CocoaBench: Evaluating Unified Digital Agents in the Wild

发布

2026年04月14日

采集 2026年04月14日 04:31

学术前沿 7.0 分 — Fills gap in evaluating combined capabilities (coding + research + GUI) rather than isolated skills. Highly relevant as agents move toward unified architectures.

原文： cs.AI updates on arXiv.org

评分 7 · 来源：cs.AI updates on arXiv.org · 发布于 2026-04-14

评分依据：Fills gap in evaluating combined capabilities (coding + research + GUI) rather than isolated skills. Highly relevant as agents move toward unified architectures.

EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems