Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

发布

2026年04月08日

采集 2026年04月08日 04:31

学术前沿 6.0 分 — 有一定参考价值的AI研究论文

评分 6.0 · 来源：cs.CL updates on arXiv.org · 发布于 2026-04-08

评分依据：有一定参考价值的AI研究论文

arXiv:2604.05179v1 Announce Type: new Abstract: Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single “accept all” anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail tha