This paper is like a standardized test report card for AI doctors: it compares how well different Chinese and international chatbots (large language models) can answer official exam questions used to certify radiology attending physicians in China.
Hospitals and regulators need to know whether general-purpose AI models are actually safe and reliable enough to assist radiologists, and which models perform best on real, high‑stakes medical exams in Chinese.
Use of a regulated, high‑stakes national qualification exam in Chinese as an evaluation benchmark, plus domain‑expert grading and error analysis, creates a high-quality, hard-to-replicate evaluation dataset and methodology.
Frontier Wrapper (GPT-4)
Context Window Stuffing
Low (No-Code/Wrapper)
Context Window Cost and the need for careful prompt design/translation for medical exam-style questions.
Early Adopters
Unlike typical generic benchmarks, this work evaluates multiple Chinese and international LLMs directly on an official Chinese radiology attending physician qualification exam, providing a culturally and linguistically specific safety/competence signal for clinical decision support in China.
4 use cases in this application