Skip to content
星际流动

Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

发布
采集
学术前沿 6.0 分 — 有一定参考价值的AI研究论文
原文: cs.CL updates on arXiv.org

评分 6.0 · 来源:cs.CL updates on arXiv.org · 发布于 2026-04-08

评分依据:有一定参考价值的AI研究论文

arXiv:2604.05179v1 Announce Type: new Abstract: Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single “accept all” anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail tha


标签: