ChatGPT able to work through clinical vignettes with promising accuracy

1. In this clinical decision support study, Chat Generative Pre-training Transformer (ChatGPT) was presented with decision-making questions based on Merck Sharpe & Dohme (MSD) Clinical Manual scenarios, with an overall performance of 71.1% accuracy.

2. In terms of prompt type, ChatGPT had the highest performance on final diagnosis questions (76.9% accuracy) and the lowest performance on initial differential diagnosis questions (60.3% accuracy).

Evidence Rating Level: 1 (Excellent)

Study Rundown: Artificial intelligence has gained increasing popularity in health care, with potential applications in individual patient care. Specifically, artificial intelligence may be leveraged to aid in clinical decision-making tasks such as developing a differential diagnosis. ChatGPT is an autoregressive large language model that can extract data from sources across the internet to form responses to use inputs. Accordingly, the current study assessed the performance of ChatGPT version 3.5 in answering questions about MSD Clinical Manual Vignettes. Experts in the field scored outputs as per the MSD Manual answer guide. ChatGPT was most likely to score correctly on final diagnosis questions, with the lowest scores on initial differential diagnosis questions. Accuracy did not vary significantly based on the acuity of the clinical presentation, patient age, or patient gender presented in the clinical vignette. It was identified that ChatGPT has constraints in clinical judgment, including difficulty with medication dosing. Further, the study is limited by the inaccessibility of ChatGPT’s training data, which may include the MSD Clinical Manual. The results of this study suggest that ChatGPT may be able to support clinicians in solving clinical vignettes and making decisions about patient care, with specific utility in establishing a final diagnosis.

Click to read the study in the Journal of Medical Internet Research

In-Depth [clinical decision support study]: This was a clinical support study that evaluated the performance of ChatGPT version 3.5 in correctly answering questions related to patient vignettes from the MSD Clinical Manual. The primary outcome of interest was the accuracy of ChatGPT overall. Results were also stratified by type of question, acuity of clinical presentation, and patient demographics (age, gender). A total of 36 clinical vignettes were used, with the exclusion of questions involving image analysis. Each clinical vignette was tested in three separate ChatGPT sessions, with two independent individuals scoring the answers. Notably, there were no scoring discrepancies throughout the study. Data were analyzed via multivariable linear regression. Overall, ChatGPT scored an accuracy of 71.8% (range 55.9% to 83.8%). The average score across question types varied from 60.3% for initial differential diagnosis to 76.9% for final diagnosis, suggesting that performance improves directly with increased data input. When assessing for performance by patient demographics, no significant difference was found based on age (p=0.35) or gender (p=0.59). Similarly, the acuity of the clinical vignettes, as assessed by the Emergency Severity Index, did not significantly impact accuracy (p=0.55). Finally, it was identified that the majority of medication errors were due to incorrect dosing. In summary, ChatGPT was able to solve clinical vignettes with improved accuracy as more information was introduced. Limitations were noted in the assessment of initial differential diagnosis as well as medication dosing.

RELATED REPORTS

Anthropic positions Claude as workflow software for regulated healthcare, not a chatbot

AstraZeneca moves to own multimodal oncology AI with Modella

NVIDIA and Lilly put $1B behind AI as core drug infrastructure

Image: PD

©2023 2 Minute Medicine, Inc. All rights reserved. No works may be reproduced without expressed written consent from 2 Minute Medicine, Inc. Inquire about licensing here. No article should be construed as medical advice and is not intended as such by the authors or by 2 Minute Medicine, Inc.

ChatGPT able to work through clinical vignettes with promising accuracy

Anthropic positions Claude as workflow software for regulated healthcare, not a chatbot

AstraZeneca moves to own multimodal oncology AI with Modella

NVIDIA and Lilly put $1B behind AI as core drug infrastructure

Primary hepatectomy may be superior to conventional hepatectomy for primary hepatocellular carcinoma

Transcranial direct current stimulation does not improve outcomes for major depressive disorder

RelatedReports

Anthropic positions Claude as workflow software for regulated healthcare, not a chatbot

AstraZeneca moves to own multimodal oncology AI with Modella

NVIDIA and Lilly put $1B behind AI as core drug infrastructure

2MM: AI Roundup – Lilly and NVIDIA’s $1B AI lab, Illumina’s Billion Cell Atlas, AstraZeneca’s Modella deal, and Claude for Healthcare [Jan 19th, 2026]

Transcranial direct current stimulation does not improve outcomes for major depressive disorder

Inappropriate hospital admission as a risk factor for the subsequent development of adverse events

The 2 Minute Medicine Podcast Episode 22

Recent Reports