Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

发布

2026年04月15日

采集 2026年04月15日 04:35

学术前沿 3.2 分 — Moderate AI relevance +novelty(1) +practical(2)

评分 3.2 · 来源：cs.CL updates on arXiv.org · 发布于 2026-04-15

评分依据：Moderate AI relevance +novelty(1) +practical(2)

arXiv:2604.12843v1 Announce Type: new Abstract: The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often evaluated on different samples, making scores difficult to compare across studies. To address this, we propose a framework based on multidimensional Item Response Theory (IRT) that uses anchor items to calibrate new benchmarks to the evaluation suite while holding previously calibrated item…