评分 7.3 · 来源:cs.CL updates on arXiv.org · 发布于 2026-04-08
评分依据:前沿模型的推理幻觉:step级评估揭示装饰性CoT
arXiv:2603.22816v2 Announce Type: replace Abstract: Language models increasingly “show their work” by writing step-by-step reasoning before answering. But are these reasoning steps genuinely used, or decorative narratives generated after the model has already decided? We introduce step-level faithfulness evaluation - removing one reasoning sentence at a time and checking whether the answer changes - requiring only API access at $1-2 per model per task. Evaluating 13 frontier models (GPT-5.4, Claude Opus, DeepSeek-V3.2, DeepSeek-R1, Gemini 2.5 Pro, MiniMax-M2.5, Kimi-K2.5, and others) across si