Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

发布

2026年04月29日

采集 2026年04月29日 06:31

学术前沿 7.5 分 — Web agent benchmark 填补了长程多站点任务评测的空白，与当前 agent 热点高度相关

原文： arXiv cs.LG

评分 7.5 · 来源：arXiv cs.LG · 发布于 2026-04-29

评分依据：Web agent benchmark 填补了长程多站点任务评测的空白，与当前 agent 热点高度相关

现有 Web Agent 基准集中在短程单站点任务，前沿模型已接近饱和。真实 Web 使用则是长程、多站点的：跨域比价、多服务旅行规划、多轮搜索摘要等。Odysseys 设计了需要持续上下文和跨站推理的长程任务，填补了这一空白。

标签：

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

Frictive Policy Optimization for LLMs: Epistemic Intervention, Risk-Sensitive Control, and Reflective Alignment