Tag: evaluation

All the articles with the tag "evaluation".

6.0
Escaping the Agreement Trap: Defensibility Signals for Rule-Governed AI
2026年04月24日
· arXiv· 04/24 08:00 采集
提出 Defensibility Index 评估规则型 AI 系统，打破传统一致性指标的 Agreement Trap
6.0
GPT-5.5 System Card
2026年04月23日
· OpenAI· 04/24 08:00 采集
GPT-5.5 System Card：完整的安全评估和能力边界说明
6.0
Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
Beyond Rating：评估AI审稿文本论证质量的整体框架，超越标量评分范式
7.0
Lost in Translation: Do LVLM Judges Generalize Across Languages?
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
MM-JudgeBench：首个大规模多语言多模态评判基准，揭示LVLM评估器的跨语言泛化缺陷
7.0
Are Large Language Models Economically Viable for Industry Deployment?
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
从能源、延迟、硬件利用率等工业约束角度审视LLM经济可行性，批判纯accuracy评估范式
6.0
An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
真实条件下输出型jailbreak检测的实证研究，对比TF-IDF和生成不一致性检测器
7.5
LLM 评估：AI 的新瓶颈
2026年03月15日
· ML Frontiers· 03/16 18:32 采集
语言模型进步速度超过了我们可靠测量它们的能力，这正在成为一个问题
8.7
METR 研究：大量通过 SWE-bench 的 PR 实际上不会被合并
2026年03月10日
· METR / Hacker News
METR 研究发现，许多在 SWE-bench 上获得通过评分的 AI 生成 PR，其质量远达不到实际代码审查标准。

Tag: evaluation

Escaping the Agreement Trap: Defensibility Signals for Rule-Governed AI

GPT-5.5 System Card

Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

Lost in Translation: Do LVLM Judges Generalize Across Languages?

Are Large Language Models Economically Viable for Industry Deployment?

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

LLM 评估：AI 的新瓶颈

METR 研究：大量通过 SWE-bench 的 PR 实际上不会被合并