评分 7.5 · 来源:cs.LG updates on arXiv.org · 发布于 2026-04-17
评分依据:Important safety research: automated jailbreak via execution simulation targeting LRMs’ internal reasoning, highly relevant to AI safety
arXiv:2505.10846v3 Announce Type: replace Abstract: This paper presents AutoRAN, the first framework to automate the hijacking of internal safety reasoning in large reasoning models (LRMs). At its core, AutoRAN pioneers an execution simulation paradigm that leverages a weaker but less-aligned model to simulate execution reasoning for initial hijacking attempts and iteratively refine attacks by exploiting reasoning patterns leaked through the target LRM’s refusals. This approach steers the target model to bypass its own safety guardrails and elaborate on harmful instructions.