Skip to content
星际流动

AutoRAN: Automated Hijacking of Safety Reasoning in Large Reasoning Models

发布
采集
学术前沿 7.5 分 — Important safety research: automated jailbreak via execution simulation targeting LRMs' internal reasoning, highly relevant to AI safety
原文: cs.LG updates on arXiv.org

评分 7.5 · 来源:cs.LG updates on arXiv.org · 发布于 2026-04-17

评分依据:Important safety research: automated jailbreak via execution simulation targeting LRMs’ internal reasoning, highly relevant to AI safety

arXiv:2505.10846v3 Announce Type: replace Abstract: This paper presents AutoRAN, the first framework to automate the hijacking of internal safety reasoning in large reasoning models (LRMs). At its core, AutoRAN pioneers an execution simulation paradigm that leverages a weaker but less-aligned model to simulate execution reasoning for initial hijacking attempts and iteratively refine attacks by exploiting reasoning patterns leaked through the target LRM’s refusals. This approach steers the target model to bypass its own safety guardrails and elaborate on harmful instructions.