评分 3.2 · 来源:cs.CL updates on arXiv.org · 发布于 2026-04-15
评分依据:Moderate AI relevance +novelty(1) +practical(2)
arXiv:2604.12843v1 Announce Type: new Abstract: The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often evaluated on different samples, making scores difficult to compare across studies. To address this, we propose a framework based on multidimensional Item Response Theory (IRT) that uses anchor items to calibrate new benchmarks to the evaluation suite while holding previously calibrated item…