Tag: 评估
All the articles with the tag "评估".
- 7.7
Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode
arXiv:2604.04978v1 Announce Type: cross Abstract: Claude Code's auto mode is the first deployed permission system for AI coding agents, using a two...
- 7.3
The illusion of reasoning: step-level evaluation reveals decorative chain-of-thought in frontier language models
arXiv:2603.22816v2 Announce Type: replace Abstract: Language models increasingly 「show their work「 by writing step-by-step reasoning before answeri...
- 6.4
Can We Predict Before Executing Machine Learning Agents?
arXiv:2601.05930v2 Announce Type: replace Abstract: Autonomous machine learning agents have revolutionized scientific discovery, yet they remain co...
- 6.4
MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings
arXiv:2603.09643v4 Announce Type: replace-cross Abstract: Current evaluation frameworks and benchmarks for LLM powered agents focus on text chat dr...
- 6.4
ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
arXiv:2604.06111v1 Announce Type: cross Abstract: Existing Agent benchmarks suffer from two critical limitations: high environment interaction over...
- 6.4
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
arXiv:2604.06132v1 Announce Type: new Abstract: Large language models are increasingly deployed as autonomous agents executing multi-step workflows...
- 5.7
Context-Value-Action Architecture for Value-Driven Large Language Model Agents
arXiv:2604.05939v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown promise in simulating human behavior, yet existing agents o...
- 6.5
AI Planning Framework:用传统规划范式诊断 LLM Web Agent
arXiv 论文将 Web Agent 架构映射到 BFS/DFS/Best-First 搜索,提出 5 个新评估指标和 794 条人类标注轨迹数据集
- 7.8