Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

发布

2026年04月20日

采集 2026年04月20日 09:04

学术前沿 8.0 分 — Jailbreak scaling laws: 发现对抗攻击成功率从多项式增长到指数增长的交叉现象

评分 8 · 来源：cs.LG updates on arXiv.org · 发布于 2026-04-20

评分依据：Jailbreak scaling laws: 发现对抗攻击成功率从多项式增长到指数增长的交叉现象

要点

arXiv:2603.11331v2 Announce Type: replace Abstract: Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that strong adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. We first identify a minimal statistical mechanism for these two regimes by giving a small set of assumptions on the distribution of safe generation across contexts under which both scaling laws follow. To explain this phen…

🤖 AI 点评

本文提供了AI领域的重要信息，值得行业从业者关注。