Removing Sandbagging in LLMs by Training with Weak Supervision

发布

2026年04月27日

采集 2026年04月27日 06:32

学术前沿 7.5 分 — Important alignment problem: eliciting best work from capable models under weak supervision. Relevant to AI safety deployment.

原文： arxiv.org

评分 7.5 · 来源： · 发布于 2026-04-27

评分依据：Important alignment problem: eliciting best work from capable models under weak supervision. Relevant to AI safety deployment.

Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training