Skip to content
星际流动

Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

发布
采集
学术前沿 3.2 分 — Moderate AI relevance +novelty(1) +practical(2)
原文: cs.CL updates on arXiv.org

评分 3.2 · 来源:cs.CL updates on arXiv.org · 发布于 2026-04-15

评分依据:Moderate AI relevance +novelty(1) +practical(2)

arXiv:2604.12843v1 Announce Type: new Abstract: The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often evaluated on different samples, making scores difficult to compare across studies. To address this, we propose a framework based on multidimensional Item Response Theory (IRT) that uses anchor items to calibrate new benchmarks to the evaluation suite while holding previously calibrated item…