Tag: benchmark

All the articles with the tag "benchmark".

7.5
Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
2026年04月29日
· arXiv cs.LG· 04/29 14:31 采集
首个面向真实长程多站点工作流的 Web Agent 基准测试
7.0
Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four
2026年04月29日
· arXiv cs.LG· 04/29 14:31 采集
测量 AI 自主实现端到端 ML pipeline 能力作为 AI 自我改进潜力的早期预警信号
6.0
DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios
2026年04月29日
· arXiv cs.CL· 04/29 14:31 采集
260 任务的数据可视化 agent 基准，覆盖真实专业场景全生命周期
4.5
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in LLMs
2026年04月29日
· arXiv cs.CL· 04/29 14:31 采集
SOB 多源基准评估 LLM 结构化输出质量，超越单一 schema 合规
4.5
GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation
2026年04月29日
· arXiv cs.CL· 04/29 14:31 采集
Agent 基准多语言适配超越机器翻译的最小工作流
6.5
Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs
2026年04月23日
· arXiv cs.AI· 04/23 14:32 采集
REST/REST+ Benchmark：系统评估多模态大语言的跨模态不一致性——同一内容在不同模态下给出矛盾回答。
6.0
Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
Beyond Rating：评估AI审稿文本论证质量的整体框架，超越标量评分范式
7.0
What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
大规模研究LLM引导的进化搜索：收集15个LLM在8个任务上的优化轨迹，揭示优化增益的驱动机制
7.0
Lost in Translation: Do LVLM Judges Generalize Across Languages?
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
MM-JudgeBench：首个大规模多语言多模态评判基准，揭示LVLM评估器的跨语言泛化缺陷
7.0
When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains
2026年04月22日
· cs.CL updates on arXiv.org· 04/22 14:31 采集
首个在推理链层面检测有害行为的benchmark，捕捉jailbreak过程中从抑制拒绝到掩盖风险的完整行为链条
6.6
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
2026年04月13日
· arXiv cs.AI· 04/13 12:31 采集
HiL-Bench 是首个专门评估 AI agent「判断力」的 benchmark——不是给完美指令打分，而是测量 agent 在规格不完整或模糊时是否能识别不确定性并主动寻求人类帮助。
5.5
DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?
2026年04月13日
· arXiv cs.AI· 04/13 12:31 采集
DRBENCHER 是一个合成 benchmark 生成器，专门生成需要同时进行网页浏览和多步计算的深度研究问题，用于评估 deep research agent 在真实研究场景中的表现。
6.5
SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment
2026年04月13日
· arXiv cs.AI· 04/13 12:31 采集
SEA-Eval 首次提出超越 episode 内评估的 self-evolving agent benchmark 框架，评估 agent 是否能在任务间积累经验、优化策略、进化工具集——而非每次都从零开始。
8.0
OpenAI 发布 GPT-5.4：首个原生 Computer Use 通用模型
2026年03月22日
· OpenAI· 03/22 14:45 采集
GPT-5.4 在编程、Agent 工作流和通用推理上全面超越前代，OSWorld 75% 超越人类基线，GDPval 83% 覆盖 44 个职业
8.0
Qwen3.5-9B 本地跑出 93.8% 准确率，距 GPT-5.4 仅 4 个百分点
2026年03月21日
· Hacker News· 03/21 14:45 采集
HomeSec-Bench 基准测试显示，9B 参数的 Qwen3.5 在 MacBook M5 上以完全离线方式达到接近云端顶级模型的安全领域表现
7.2
PIXAR: From Masks to Pixels — VLM 图像篡改检测新分类法与基准
2026年03月21日
· arXiv· 03/23 18:34 采集
CVPR 2026 论文重新定义 VLM 图像篡改检测：从粗粒度区域遮罩到像素级、语义感知的统一评估框架，揭示现有方法严重高估/低估检测能力的系统性偏差。
8.2
HorizonMath: Measuring AI Progress Toward Mathematical Discovery
2026年03月16日
New benchmark of 100+ unsolved math problems with automated verification. GPT 5.4 Pro proposes solutions improving on best-known results for two problems.
7.5
LLM 评估：AI 的新瓶颈
2026年03月15日
· ML Frontiers· 03/16 18:32 采集
语言模型进步速度超过了我们可靠测量它们的能力，这正在成为一个问题
7.5
SWE-Bench 上 LLM 的 PR 合并率没有提高
2026年03月13日
· Entropic Thoughts
研究发现尽管 SWE-Bench 基准分数持续上升，LLM 生成的 PR 实际被合并进主分支的比例并未改善，暗示评测与现实脱节。
7.5
摩根士丹利警告：2026 年上半年 AI 将迎来突破性飞跃，世界尚未准备好
2026年03月13日
· Fortune
摩根士丹利新报告称，计算规模积累将推动 2026 年上半年 AI 能力跃升，GPT-5.4 已在 GDPVal 基准上超越人类专家，但能源基础设施制约正在加剧。

Tag: benchmark

Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in LLMs

GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation

Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

Lost in Translation: Do LVLM Judges Generalize Across Languages?

When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

OpenAI 发布 GPT-5.4：首个原生 Computer Use 通用模型

Qwen3.5-9B 本地跑出 93.8% 准确率，距 GPT-5.4 仅 4 个百分点

PIXAR: From Masks to Pixels — VLM 图像篡改检测新分类法与基准

HorizonMath: Measuring AI Progress Toward Mathematical Discovery

LLM 评估：AI 的新瓶颈

SWE-Bench 上 LLM 的 PR 合并率没有提高

摩根士丹利警告：2026 年上半年 AI 将迎来突破性飞跃，世界尚未准备好