评分 6.0 · 来源:cs.CL updates on arXiv.org · 发布于 2026-04-08
评分依据:视频生成作为多模态推理范式
arXiv:2511.04570v2 Announce Type: replace-cross Abstract: The “Thinking with Text” and “Thinking with Images” paradigms significantly improve the reasoning abilities of large language models (LLMs) and Vision-Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, which hinders unified multimodal understanding and generation. Therefore, we propose “Thinking with Video”, a new paradigm that leverages video gen