Skip to content
星际流动

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

发布
采集
学术前沿 6.4 分 — 有一定参考价值的AI研究论文
原文: cs.AI updates on arXiv.org

评分 6.4 · 来源:cs.AI updates on arXiv.org · 发布于 2026-04-08

评分依据:有一定参考价值的AI研究论文

arXiv:2604.06132v1 Announce Type: new Abstract: Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (ge


标签: