Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

发布

2026年04月20日

采集 2026年04月20日 09:04

学术前沿 6.0 分 — VLM是否真正执行视觉推理？研究表明VLM可能依赖文本捷径而非真实视觉推理

评分 6 · 来源：cs.CL updates on arXiv.org · 发布于 2026-04-20

评分依据：VLM是否真正执行视觉推理？研究表明VLM可能依赖文本捷径而非真实视觉推理

要点

arXiv:2604.16256v1 Announce Type: cross Abstract: Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons. Specifically, we construct each problem in text-only, imag…

🤖 AI 点评

本文提供了AI领域的重要信息，值得行业从业者关注。