AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

发布

2026年04月14日

采集 2026年04月14日 04:31

学术前沿 7.0 分 — Long-context autonomous agent benchmark addressing gap between single-capability tests and real-world multi-step scenarios. 1M-token context is forward-looking.

原文： cs.AI updates on arXiv.org

评分 7 · 来源：cs.AI updates on arXiv.org · 发布于 2026-04-14

评分依据：Long-context autonomous agent benchmark addressing gap between single-capability tests and real-world multi-step scenarios. 1M-token context is forward-looking.

Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility