Skip to content
星际流动

The illusion of reasoning: step-level evaluation reveals decorative chain-of-thought in frontier language models

发布
采集
学术前沿 7.3 分 — 前沿模型的推理幻觉:step级评估揭示装饰性CoT
原文: cs.CL updates on arXiv.org

评分 7.3 · 来源:cs.CL updates on arXiv.org · 发布于 2026-04-08

评分依据:前沿模型的推理幻觉:step级评估揭示装饰性CoT

arXiv:2603.22816v2 Announce Type: replace Abstract: Language models increasingly “show their work” by writing step-by-step reasoning before answering. But are these reasoning steps genuinely used, or decorative narratives generated after the model has already decided? We introduce step-level faithfulness evaluation - removing one reasoning sentence at a time and checking whether the answer changes - requiring only API access at $1-2 per model per task. Evaluating 13 frontier models (GPT-5.4, Claude Opus, DeepSeek-V3.2, DeepSeek-R1, Gemini 2.5 Pro, MiniMax-M2.5, Kimi-K2.5, and others) across si


标签: