评分 6.5 · 来源:cs.CL updates on arXiv.org · 发布于 2026-04-17
评分依据:Important methodological warning: 42% of turn-level findings may be autocorrelation artifacts, critical for conversational analysis research
arXiv:2604.14414v1 Announce Type: new Abstract: Turn-level metrics are widely used to evaluate properties of multi-turn human-LLM conversations, from safety and sycophancy to dialogue quality. However, consecutive turns within a conversation are not statistically independent — a fact that virtually all current evaluation pipelines fail to correct for in their statistical inference.