SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

发布

2026年04月17日

采集 2026年04月17日 04:31

学术前沿 6.0 分 — Lightweight jailbreak detection via token logits grading, fast and deterministic alternative to semantic guards

评分 6 · 来源：cs.AI updates on arXiv.org · 发布于 2026-04-17

评分依据：Lightweight jailbreak detection via token logits grading, fast and deterministic alternative to semantic guards

arXiv:2604.01473v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious queries, which either introduce substantial latency or suffer from the randomness in text generation. To overcome these limitations, we propose SelfGrader, a lightweight guardrail method that formulates jailbreak detection as a numerical grading problem using token-level logits.