评分 6.4 · 来源:cs.AI updates on arXiv.org · 发布于 2026-04-08
评分依据:有一定参考价值的AI研究论文
arXiv:2604.06132v1 Announce Type: new Abstract: Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (ge