Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models

Nanxi Li, Xiang Wang, Yuanjie Chen, Haode Zhang, Hong Li, Yong-Lu Li

Published in ICLR, 2026

Despite progress in image and video analysis, multimodal large language models (MLLMs) struggle with high-level physics reasoning and understanding how continuum objects behave dynamically. To evaluate this limitation, we create two benchmark tasks: Next Frame Selection and Temporal Coherence Verification. Our proposed Scene Dynamic Field method integrates physics simulators into a fine-tuning framework, achieving up to 20.7% gains on fluid tasks while demonstrating strong performance on unfamiliar physical domains.

[Paper]

[Code]

Share on

Twitter Facebook LinkedIn