Skip to content
星际流动

ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

发布
采集
学术前沿 6.4 分 — 有一定参考价值的AI研究论文
原文: cs.CL updates on arXiv.org

评分 6.4 · 来源:cs.CL updates on arXiv.org · 发布于 2026-04-08

评分依据:有一定参考价值的AI研究论文

arXiv:2604.06111v1 Announce Type: cross Abstract: Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable. To address these issues, we propose ACE-Bench built around a unified grid-based planning task, where agents must fill hidden slots in a partially completed schedule subject to both local slot constraints and global constraints. Our benchmark offers fine-grained control through two orthogonal axes: Scalable Horizo


标签: