1. Rao and colleagues evaluated the clinical reasoning performance of 21 off-the-shelf LLMs across 29 standardized clinical vignettes.
2. While the LLMs achieved high accuracy on final diagnoses, they performed poorly in generating differential diagnoses and managing young adult patients.
Evidence Rating Level: 2 (Good)
Study Rundown: LLMs are being marketed for patient-facing clinical use, with an emphasis on their high accuracy on benchmark tasks. However, concerns regarding their safety and ability to navigate diagnostic uncertainty remain. Rao and colleagues assessed the accuracy of 21 off-the-shelf LLMs in working through 29 clinical vignettes and completing clinical reasoning tasks, including formulating a differential diagnosis, diagnostic testing, a final diagnosis, management, and miscellaneous clinical reasoning tasks. The outcomes were the PrIME-LLM score, which encapsulated the five domains of clinical reasoning described above, and the failure rate, defined as the proportion of questions not answered fully correctly. For the LLMs assessed, the mean PrIME-LLM scores ranged from 0.64 to 0.78, with failure rates exceeding 0.80 in all models for differential diagnosis. Additionally, differential diagnosis and management failure rates were higher for cases involving young adult and middle-aged patients, whereas pediatric cases showed lower failure rates across all categories except differential diagnosis. This study demonstrated that LLMs continue to have persistent limitations in independent clinical reasoning.
Click here to read the study in JAMA Network Open
Relevant Reading: Usability and feasibility of a Socratic LLM-supported learning tool for clinical reasoning in undergraduate nursing education
In-Depth [cross-sectional study]: Clinical vignettes were gathered from the MSD Manual web modules. Each vignette presented a case with a history of present illness, review of systems, physical examination findings, and laboratory results. These cases employed sequential select-all-that-apply questions to simulate the diagnostic process from differential diagnosis through testing and management planning. Scoring was done by medical student evaluators against the MSD Manual answer keys. Performance across the five domains was visualized as a polygon, and the PrIME-LLM score was calculated as the area of the model’s polygon divided by the area of the reference polygon. Failure rates were used as a reliability measure to complement the PrIME-LLM score. The study found that the PrIME-LLM scores differed significantly across the models. The top-performing LLMs included Grok 4, GPT-4.5, Claude 4.5 Opus, and Gemini 3.0 Flash, with no significant differences among these models. Specifically, Grok 4 had a score of 0.78 [range, 0.77-0.79], while a weaker performer, such as Gemini 1.5 Flash, had a score of 0.64 [range, 0.63-0.65]. Additionally, LLMs performed better on final diagnosis items than on the other four domains, while exhibiting the highest failure rates in differential diagnosis. For GPT-4o, the mean difference between the final diagnosis and diagnostic testing items was 0.12 [95% confidence interval, 0.03-0.07], with similar differences observed in other models. LLMs were found to consistently collapse prematurely onto single answers for differential diagnosis. Overall, this study demonstrated that LLMs performed poorly in early diagnostic reasoning.
Image: PD
©2026 2 Minute Medicine, Inc. All rights reserved. No works may be reproduced without expressed written consent from 2 Minute Medicine, Inc. Inquire about licensing here. No article should be construed as medical advice and is not intended as such by the authors or by 2 Minute Medicine, Inc.