The illusion of reasoning: step-level evaluation reveals decorative chain-of-thought in frontier language models

发布

2026年04月08日

采集 2026年04月08日 04:31

学术前沿 7.3 分 — 前沿模型的推理幻觉：step级评估揭示装饰性CoT

评分 7.3 · 来源：cs.CL updates on arXiv.org · 发布于 2026-04-08

评分依据：前沿模型的推理幻觉：step级评估揭示装饰性CoT

arXiv:2603.22816v2 Announce Type: replace Abstract: Language models increasingly “show their work” by writing step-by-step reasoning before answering. But are these reasoning steps genuinely used, or decorative narratives generated after the model has already decided? We introduce step-level faithfulness evaluation - removing one reasoning sentence at a time and checking whether the answer changes - requiring only API access at $1-2 per model per task. Evaluating 13 frontier models (GPT-5.4, Claude Opus, DeepSeek-V3.2, DeepSeek-R1, Gemini 2.5 Pro, MiniMax-M2.5, Kimi-K2.5, and others) across si