Below-Chance Blindness: Prompted Underperformance in Small LLMs Produces Positional Bias Rather than Answer Avoidance

发布

2026年04月29日

采集 2026年04月29日 06:31

学术前沿 4.0 分 — 检测 LLM sandbagging 的预注册实验

原文： arXiv cs.CL

评分 4 · 来源：arXiv cs.CL · 发布于 2026-04-29

评分依据：检测 LLM sandbagging 的预注册实验

检测 sandbagging 是 AI 安全开放问题。本文测试症状有效性测试逻辑是否能通过 below-chance performance 识别 sandbagging。7-9B 参数规模的预注册 pilot 实验。

标签：

G-Loss: Graph-Guided Fine-Tuning of Language Models

Do LLMs Capture Embodied Cognition and Cultural Variation? Cross-Linguistic Evidence from Demonstratives