Tag: 评估

All the articles with the tag "评估".

7.7
Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode
2026年04月08日
· cs.AI updates on arXiv.org· 04/08 12:31 采集
arXiv:2604.04978v1 Announce Type: cross Abstract: Claude Code's auto mode is the first deployed permission system for AI coding agents, using a two...
7.3
The illusion of reasoning: step-level evaluation reveals decorative chain-of-thought in frontier language models
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2603.22816v2 Announce Type: replace Abstract: Language models increasingly 「show their work「 by writing step-by-step reasoning before answeri...
6.4
Can We Predict Before Executing Machine Learning Agents?
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2601.05930v2 Announce Type: replace Abstract: Autonomous machine learning agents have revolutionized scientific discovery, yet they remain co...
6.4
MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings
2026年04月08日
· cs.AI updates on arXiv.org· 04/08 12:31 采集
arXiv:2603.09643v4 Announce Type: replace-cross Abstract: Current evaluation frameworks and benchmarks for LLM powered agents focus on text chat dr...
6.4
ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
2026年04月08日
· cs.CL updates on arXiv.org· 04/08 12:31 采集
arXiv:2604.06111v1 Announce Type: cross Abstract: Existing Agent benchmarks suffer from two critical limitations: high environment interaction over...
6.4
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
2026年04月08日
· cs.AI updates on arXiv.org· 04/08 12:31 采集
arXiv:2604.06132v1 Announce Type: new Abstract: Large language models are increasingly deployed as autonomous agents executing multi-step workflows...
5.7
Context-Value-Action Architecture for Value-Driven Large Language Model Agents
2026年04月08日
· cs.AI updates on arXiv.org· 04/08 12:31 采集
arXiv:2604.05939v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown promise in simulating human behavior, yet existing agents o...
6.5
AI Planning Framework：用传统规划范式诊断 LLM Web Agent
2026年03月18日
· arXiv· 03/18 02:32 采集
arXiv 论文将 Web Agent 架构映射到 BFS/DFS/Best-First 搜索，提出 5 个新评估指标和 794 条人类标注轨迹数据集
7.8
AI Agent 研究周报：策略推理 vs 暴力搜索
2026年03月16日
· LLM Watch
本周 Agent 研究论文综述，涵盖推理能力、评估方法、安全风险和持续学习

Tag: 评估

Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode

The illusion of reasoning: step-level evaluation reveals decorative chain-of-thought in frontier language models

Can We Predict Before Executing Machine Learning Agents?

MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Context-Value-Action Architecture for Value-Driven Large Language Model Agents

AI Planning Framework：用传统规划范式诊断 LLM Web Agent

AI Agent 研究周报：策略推理 vs 暴力搜索