• About
  • Masthead
  • License Content
  • Advertise
  • Submit Press Release
  • RSS/Email List
  • 2MM Podcast
  • Write for us
  • Contact Us
2 Minute Medicine
No Result
View All Result

No products in the cart.

SUBSCRIBE
  • Specialties
    • All Specialties, All Recent Reports
    • Cardiology
    • Chronic Disease
    • Dermatology
    • Emergency
    • Endocrinology
    • Gastroenterology
    • Imaging and Intervention
    • Infectious Disease
    • Nephrology
    • Neurology
    • Obstetrics
    • Oncology
    • Ophthalmology
    • Pediatrics
    • Pharma
    • Preclinical
    • Psychiatry
    • Public Health
    • Pulmonology
    • Rheumatology
    • Surgery
  • Tools
    • EvidencePulse™
    • RVU Search
    • NPI Registry Lookup
  • Pharma
  • AI News
  • The Scan+
  • Classics™+
    • 2MM+ Online Access
    • Paperback and Ebook
  • Rewinds
  • Partners
    • License Content
    • Submit Press Release
    • Advertise with Us
  • Account
    • Subscribe
    • Sign-in
    • My account
2 Minute Medicine
  • Specialties
    • All Specialties, All Recent Reports
    • Cardiology
    • Chronic Disease
    • Dermatology
    • Emergency
    • Endocrinology
    • Gastroenterology
    • Imaging and Intervention
    • Infectious Disease
    • Nephrology
    • Neurology
    • Obstetrics
    • Oncology
    • Ophthalmology
    • Pediatrics
    • Pharma
    • Preclinical
    • Psychiatry
    • Public Health
    • Pulmonology
    • Rheumatology
    • Surgery
  • Tools
    • EvidencePulse™
    • RVU Search
    • NPI Registry Lookup
  • Pharma
  • AI News
  • The Scan+
  • Classics™+
    • 2MM+ Online Access
    • Paperback and Ebook
  • Rewinds
  • Partners
    • License Content
    • Submit Press Release
    • Advertise with Us
  • Account
    • Subscribe
    • Sign-in
    • My account
SUBSCRIBE
2 Minute Medicine
Subscribe
Home All Specialties Artificial Intelligence

Large language models (LLMs) performed poorly in navigating early clinical diagnostic uncertainty

byCheng En Xi
April 20, 2026
in Artificial Intelligence
Reading Time: 2 mins read
0
Share on FacebookShare on Twitter

1. Rao and colleagues evaluated the clinical reasoning performance of 21 off-the-shelf LLMs across 29 standardized clinical vignettes.

2. While the LLMs achieved high accuracy on final diagnoses, they performed poorly in generating differential diagnoses and managing young adult patients.

Evidence Rating Level: 2 (Good)

Study Rundown: LLMs are being marketed for patient-facing clinical use, with an emphasis on their high accuracy on benchmark tasks. However, concerns regarding their safety and ability to navigate diagnostic uncertainty remain. Rao and colleagues assessed the accuracy of 21 off-the-shelf LLMs in working through 29 clinical vignettes and completing clinical reasoning tasks, including formulating a differential diagnosis, diagnostic testing, a final diagnosis, management, and miscellaneous clinical reasoning tasks. The outcomes were the PrIME-LLM score, which encapsulated the five domains of clinical reasoning described above, and the failure rate, defined as the proportion of questions not answered fully correctly. For the LLMs assessed, the mean PrIME-LLM scores ranged from 0.64 to 0.78, with failure rates exceeding 0.80 in all models for differential diagnosis. Additionally, differential diagnosis and management failure rates were higher for cases involving young adult and middle-aged patients, whereas pediatric cases showed lower failure rates across all categories except differential diagnosis. This study demonstrated that LLMs continue to have persistent limitations in independent clinical reasoning.

Click here to read the study in JAMA Network Open

Relevant Reading: Usability and feasibility of a Socratic LLM-supported learning tool for clinical reasoning in undergraduate nursing education

RELATED REPORTS

Brown University study warns of systemic ethical risks in artificial intelligence therapy chatbots

NVIDIA GTC 2026 unveils Isaac GR00T foundation model for surgical robotics

Roche and NVIDIA deploy the pharmaceutical industry’s largest artificial intelligence factory

In-Depth [cross-sectional study]: Clinical vignettes were gathered from the MSD Manual web modules. Each vignette presented a case with a history of present illness, review of systems, physical examination findings, and laboratory results. These cases employed sequential select-all-that-apply questions to simulate the diagnostic process from differential diagnosis through testing and management planning. Scoring was done by medical student evaluators against the MSD Manual answer keys. Performance across the five domains was visualized as a polygon, and the PrIME-LLM score was calculated as the area of the model’s polygon divided by the area of the reference polygon. Failure rates were used as a reliability measure to complement the PrIME-LLM score. The study found that the PrIME-LLM scores differed significantly across the models. The top-performing LLMs included Grok 4, GPT-4.5, Claude 4.5 Opus, and Gemini 3.0 Flash, with no significant differences among these models. Specifically, Grok 4 had a score of 0.78 [range, 0.77-0.79], while a weaker performer, such as Gemini 1.5 Flash, had a score of 0.64 [range, 0.63-0.65]. Additionally, LLMs performed better on final diagnosis items than on the other four domains, while exhibiting the highest failure rates in differential diagnosis. For GPT-4o, the mean difference between the final diagnosis and diagnostic testing items was 0.12 [95% confidence interval, 0.03-0.07], with similar differences observed in other models. LLMs were found to consistently collapse prematurely onto single answers for differential diagnosis. Overall, this study demonstrated that LLMs performed poorly in early diagnostic reasoning.

Image: PD

©2026 2 Minute Medicine, Inc. All rights reserved. No works may be reproduced without expressed written consent from 2 Minute Medicine, Inc. Inquire about licensing here. No article should be construed as medical advice and is not intended as such by the authors or by 2 Minute Medicine, Inc.

Tags: AIartificial intelligencechatGPTlarge language modelsllmmachine learning
Previous Post

2 Minute Medicine Rewind April 20, 2026

RelatedReports

American Academy of Pediatrics recommends standards for adverse event disclosures
AI Roundup

Brown University study warns of systemic ethical risks in artificial intelligence therapy chatbots

April 10, 2026
Single-site robotic cholecystectomy is safe, but technically challenging
AI Roundup

NVIDIA GTC 2026 unveils Isaac GR00T foundation model for surgical robotics

April 9, 2026
Natural language processing may automate data extraction from radiologic reports
AI Roundup

Roche and NVIDIA deploy the pharmaceutical industry’s largest artificial intelligence factory

April 7, 2026
2MM: AI Roundup- AI Cancer Test, Smarter Hospitals, Faster Drug Discovery, and Mental Health Tech [May 2nd, 2025]
AI Roundup

2MM: AI Roundup – Roche and NVIDIA’s AI drug discovery factory and surgical robotics foundation model, Amazon’s nationwide health AI expansion, and Brown’s AI therapy ethics warning [March 2026]

April 6, 2026

2 Minute Medicine® is an award winning, physician-run, expert medical media company. Our content is curated, written and edited by practicing health professionals who have clinical and scientific expertise in their field of reporting. Our editorial management team is comprised of highly-trained MD physicians. Join numerous brands, companies, and hospitals who trust our licensed content.

Recent Reports

  • Large language models (LLMs) performed poorly in navigating early clinical diagnostic uncertainty
  • 2 Minute Medicine Rewind April 20, 2026
  • Asundexian may reduce recurrent ischemic stroke without increased bleeding risk
License Content
Terms of Use | Disclaimer
Cookie Policy
Privacy Statement (EU)
Disclaimer

The Classics in Medicine Paperback Released!

Over the past 30 years, the transition from print to digital media has contributed to an exponential increase in medical literature. In response, 2 Minute Medicine presents 160+ authoritative, physician-written summaries of the most cited landmark trials in medicine.

amazon-logo_blackGet-it-on-iBooks-badge

Click anywhere to close this announcement

  • Specialties
    • All Specialties, All Recent Reports
    • Cardiology
    • Chronic Disease
    • Dermatology
    • Emergency
    • Endocrinology
    • Gastroenterology
    • Imaging and Intervention
    • Infectious Disease
    • Nephrology
    • Neurology
    • Obstetrics
    • Oncology
    • Ophthalmology
    • Pediatrics
    • Pharma
    • Preclinical
    • Psychiatry
    • Public Health
    • Pulmonology
    • Rheumatology
    • Surgery
  • Tools
    • EvidencePulse™
    • RVU Search
    • NPI Registry Lookup
  • Pharma
  • AI News
  • The Scan
  • Classics™
    • 2MM+ Online Access
    • Paperback and Ebook
  • Rewinds
  • Partners
    • License Content
    • Submit Press Release
    • Advertise with Us
  • Account
    • Subscribe
    • Sign-in
    • My account
No Result
View All Result

© 2026 2 Minute Medicine, Inc. - Physician-written medical news.