评分 5.5 · 来源:cs.LG updates on arXiv.org · 发布于 2026-04-17
评分依据:Large-scale evaluation (3333 diagnoses, 300 cases) comparing LLM jury to expert clinicians, meaningful medical AI benchmark
arXiv:2604.14892v1 Announce Type: new Abstract: Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM jury composed of three frontier AI models scoring 3333 diagnoses on 300 real-world middle-income country (MIC) hospital cases. Model performance was benchmarked against expert clinician panel and independent human re-scoring panel evaluations.