Skip to content
星际流动

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

发布
采集
学术前沿 6.0 分 — Lightweight jailbreak detection via token logits grading, fast and deterministic alternative to semantic guards
原文: cs.AI updates on arXiv.org

评分 6 · 来源:cs.AI updates on arXiv.org · 发布于 2026-04-17

评分依据:Lightweight jailbreak detection via token logits grading, fast and deterministic alternative to semantic guards

arXiv:2604.01473v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious queries, which either introduce substantial latency or suffer from the randomness in text generation. To overcome these limitations, we propose SelfGrader, a lightweight guardrail method that formulates jailbreak detection as a numerical grading problem using token-level logits.