Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

发布

2026年04月20日

采集 2026年04月20日 09:04

学术前沿 7.0 分 — 针对大推理模型的推理目标越狱攻击：通过语义触发在推理链中注入有害内容

评分 7 · 来源：cs.LG updates on arXiv.org · 发布于 2026-04-20

评分依据：针对大推理模型的推理目标越狱攻击：通过语义触发在推理链中注入有害内容

要点

arXiv:2604.15725v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) have demonstrated strong capabilities in generating step-by-step reasoning chains alongside final answers, enabling their deployment in high-stakes domains such as healthcare and education. While prior jailbreak attack studies have focused on the safety of final answers, little attention has been given to the safety of the reasoning process. In this work, we identify a novel problem that injects harmful content into the reasoning steps while preserving unchanged answers. This type of attack presents two key challenge…

🤖 AI 点评

本文提供了AI领域的重要信息，值得行业从业者关注。