Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?

发布

2026年04月17日

采集 2026年04月17日 04:31

学术前沿 5.5 分 — Large-scale evaluation (3333 diagnoses, 300 cases) comparing LLM jury to expert clinicians, meaningful medical AI benchmark

评分 5.5 · 来源：cs.LG updates on arXiv.org · 发布于 2026-04-17

评分依据：Large-scale evaluation (3333 diagnoses, 300 cases) comparing LLM jury to expert clinicians, meaningful medical AI benchmark

arXiv:2604.14892v1 Announce Type: new Abstract: Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM jury composed of three frontier AI models scoring 3333 diagnoses on 300 real-world middle-income country (MIC) hospital cases. Model performance was benchmarked against expert clinician panel and independent human re-scoring panel evaluations.