AutoRAN: Automated Hijacking of Safety Reasoning in Large Reasoning Models

发布

2026年04月17日

采集 2026年04月17日 04:31

学术前沿 7.5 分 — Important safety research: automated jailbreak via execution simulation targeting LRMs' internal reasoning, highly relevant to AI safety

评分 7.5 · 来源：cs.LG updates on arXiv.org · 发布于 2026-04-17

评分依据：Important safety research: automated jailbreak via execution simulation targeting LRMs’ internal reasoning, highly relevant to AI safety

arXiv:2505.10846v3 Announce Type: replace Abstract: This paper presents AutoRAN, the first framework to automate the hijacking of internal safety reasoning in large reasoning models (LRMs). At its core, AutoRAN pioneers an execution simulation paradigm that leverages a weaker but less-aligned model to simulate execution reasoning for initial hijacking attempts and iteratively refine attacks by exploiting reasoning patterns leaked through the target LRM’s refusals. This approach steers the target model to bypass its own safety guardrails and elaborate on harmful instructions.