评分 6 · 来源:cs.AI updates on arXiv.org · 发布于 2026-04-17
评分依据:Lightweight jailbreak detection via token logits grading, fast and deterministic alternative to semantic guards
arXiv:2604.01473v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious queries, which either introduce substantial latency or suffer from the randomness in text generation. To overcome these limitations, we propose SelfGrader, a lightweight guardrail method that formulates jailbreak detection as a numerical grading problem using token-level logits.