• About
  • Masthead
  • License Content
  • Advertise
  • Submit Press Release
  • RSS/Email List
  • 2MM Podcast
  • Write for us
  • Contact Us
2 Minute Medicine
No Result
View All Result

No products in the cart.

SUBSCRIBE
  • Specialties
    • All Specialties, All Recent Reports
    • Cardiology
    • Chronic Disease
    • Dermatology
    • Emergency
    • Endocrinology
    • Gastroenterology
    • Imaging and Intervention
    • Infectious Disease
    • Nephrology
    • Neurology
    • Obstetrics
    • Oncology
    • Ophthalmology
    • Pediatrics
    • Pharma
    • Preclinical
    • Psychiatry
    • Public Health
    • Pulmonology
    • Rheumatology
    • Surgery
  • Tools
    • EvidencePulse™
    • RVU Search
    • NPI Registry Lookup
  • Pharma
  • AI News
  • The Scan+
  • Classics™+
    • 2MM+ Online Access
    • Paperback and Ebook
  • Rewinds
  • Partners
    • License Content
    • Submit Press Release
    • Advertise with Us
  • Account
    • Subscribe
    • Sign-in
    • My account
2 Minute Medicine
  • Specialties
    • All Specialties, All Recent Reports
    • Cardiology
    • Chronic Disease
    • Dermatology
    • Emergency
    • Endocrinology
    • Gastroenterology
    • Imaging and Intervention
    • Infectious Disease
    • Nephrology
    • Neurology
    • Obstetrics
    • Oncology
    • Ophthalmology
    • Pediatrics
    • Pharma
    • Preclinical
    • Psychiatry
    • Public Health
    • Pulmonology
    • Rheumatology
    • Surgery
  • Tools
    • EvidencePulse™
    • RVU Search
    • NPI Registry Lookup
  • Pharma
  • AI News
  • The Scan+
  • Classics™+
    • 2MM+ Online Access
    • Paperback and Ebook
  • Rewinds
  • Partners
    • License Content
    • Submit Press Release
    • Advertise with Us
  • Account
    • Subscribe
    • Sign-in
    • My account
SUBSCRIBE
2 Minute Medicine
Subscribe
Home 2 Minute Medicine

Large language models (LLMs) demonstrated higher error rates compared to humans in a clinical oncology question bank

byCheng En XiandDeepti Shroff
May 28, 2024
in 2 Minute Medicine, Oncology
Reading Time: 3 mins read
0
Share on FacebookShare on Twitter

1. A comparative evaluation tested five publicly available LLMs on 2044 oncology questions, covering comprehensive topics in the field. The responses were compared to a human benchmark.

2. Only one of the five models tested performed above the 50th percentile, with worse performance observed in clinical oncology subcategories and female-predominant malignancies.

Evidence Rating Level: 2 (Good)

Study Rundown: Many medical professionals have begun to use large language models (LLMs), such as ChatGPT, as augmented search engines for medical information. LLMs have demonstrated high performance on subspecialty medical examinations across multiple medical specialties, but the utility of LLMs in clinical applications of clinical oncology remains unexplored.

Rydzewski and colleagues compared the performance of five LLMs on a set of multiple-choice questions related to clinical oncology to a random guess algorithm and the performance of radiation oncology trainees. The authors assessed the accuracy of the models, their self-appraised confidence, and consistency of responses across three independent replicates of questions. The LLMs were asked to provide an answer to a question, a confidence score, and an explanation of the response. Each LLM was evaluated with 2044 unique questions, across three independent replicates.

The study found that only one of the five LLMs (GPT-4) scored higher than the 50th percentile when compared to human trainees, despite all showing high self-appraised confidence. The remaining LLMs had much lower accuracies, with some being similar to the random guess strategy. LLMs scored higher on foundational topics and worse on clinical oncology topics, especially ones related to female-predominant malignancies. The authors found combining model selection, self-appraised confidence, and output consistency, helped identify more reliable outputs. Overall, this study demonstrated a need to assess the safety of implementing LLMs in clinical settings and the presence of training bias, in the form of medical misinformation related to female-predominant malignancies.

RELATED REPORTS

Food and Drug Administration reviews model for predicting drug related liver injury

2MM: AI Roundup: Food and Drug Administration reviews liver injury prediction tool, Joint Commission launches healthcare artificial intelligence certification, governance playbooks aim to standardize adoption, and pediatric hospitals bring generative tools to the frontline

Guideline-based lymph node sampling is not associated with increased postoperative complications in lung cancer

Click here to read the study in NEJM AI

Click to read an accompanying editorial in NEJM AI

Relevant Reading: Performance of ChatGPT on a primary FRCA multiple choice question bank

In-Depth [cross-sectional study]: In this study, Rydzewski and colleagues assessed the accuracies of five LLMs (LLaMA, PaLM 2, Claude-v1, GPT-3.5, and GPT-4) in 2044 multiple choice questions and aimed to identify strategies to help end users identify reliable LLM outputs. The questions were sourced from the American College of Radiology in-training radiation oncology examinations from 2013-2017, 2020, and 2021. Each question was repeated across three independent replicates. The authors compared LLM performance with a random guessing strategy and human scores for the questions sourced from the 2013 and 2014 examinations. Also, the authors assessed the self-appraised confidence of the LLMs by prompting for a confidence score ranging from 1-4, with 1 indicating a random guess and 4 indicating maximal confidence.

The five LLMs had mean accuracies ranging from 25.6%-68.7%, as compared to 25.2% of the random guess strategy. When compared against humans, only GPT-4 scored higher than 50th percentile, achieving 69th and 89th percentiles. The overall performances of LLMs were positively correlated (Pearson’s r = 0.630; p < 0.001) with their performances in a specific topic. Other than LLaMA 65B, all LLMs performed better on foundational topics (e.g., medical statistics, cancer biology), than clinical subcategories (p < 0.02). LLMs performed the worst with regards to subjects involving breast and gynecologic malignancies. All LLMs produced a confidence score of 3 or 4 in more than 94% of responses. Finally, by combining self-assessed confidence and output consistency, the authors generated accuracies of 81.7% and 81.1% in Claude-v1 and GPT-4, respectively.

In conclusion, the authors assessed the ability of five LLMs to answer clinical oncology examination questions. This work displayed the need for further LLM safety evaluations before routine clinical implementation. It also provided insight into a potential strategy to more reliably use LLM output.

Image: PD

©2024 2 Minute Medicine, Inc. All rights reserved. No works may be reproduced without expressed written consent from 2 Minute Medicine, Inc. Inquire about licensing here. No article should be construed as medical advice and is not intended as such by the authors or by 2 Minute Medicine, Inc.

Tags: artificial intelligencelarge language modelsoncology
Previous Post

Roche: CT-388 Dual GLP1-GIP Receptor Agonist Clinically Improves Weight Loss

Next Post

The Scan by 2 Minute Medicine®: Keeping up with Keto, Feeding with HIV, Lung Cancer Overhaul and Catastrophic Chip Challenge

RelatedReports

American College of Physicians releases principles to guide patient partnership in health care
AI Roundup

Food and Drug Administration reviews model for predicting drug related liver injury

June 16, 2026
2MM: AI Roundup- AI Cancer Test, Smarter Hospitals, Faster Drug Discovery, and Mental Health Tech [May 2nd, 2025]
AI Roundup

2MM: AI Roundup: Food and Drug Administration reviews liver injury prediction tool, Joint Commission launches healthcare artificial intelligence certification, governance playbooks aim to standardize adoption, and pediatric hospitals bring generative tools to the frontline

June 15, 2026
Lessons from real-world implementation of lung cancer screening
Chronic Disease

Guideline-based lymph node sampling is not associated with increased postoperative complications in lung cancer

June 9, 2026
Quick Take: Effect of Pregabalin on Radiotherapy-Related Neuropathic Pain in Patients With Head and Neck Cancer: A Randomized Controlled Trial
Oncology

Why Pfizer Is Looking East for Its Next Generation of Cancer Drugs

June 4, 2026
Next Post
The Scan by 2 Minute Medicine®:  Ultra-Trail du Mont-Blanc, Taylor Swift, NBA rookie Chet Holmgren and Magic Mushrooms!

The Scan by 2 Minute Medicine®: Keeping up with Keto, Feeding with HIV, Lung Cancer Overhaul and Catastrophic Chip Challenge

#VisualAbstract: Intensive Ambulance-Delivered Blood-Pressure Reduction Does Not Improve Outcomes in Hyperacute Stroke

#VisualAbstract: Intensive Ambulance-Delivered Blood-Pressure Reduction Does Not Improve Outcomes in Hyperacute Stroke

Quick Take: Intimate Partner Homicide of Adolescents

Bayer AG: Elinzanetant Shows Promise In Treating Menopause-Associated Vasomotor Symptoms In Phase 3 Clinical Trials

2 Minute Medicine® is an award winning, physician-run, expert medical media company. Our content is curated, written and edited by practicing health professionals who have clinical and scientific expertise in their field of reporting. Our editorial management team is comprised of highly-trained MD physicians. Join numerous brands, companies, and hospitals who trust our licensed content.

Recent Reports

  • Cruise ship hantavirus outbreak turns into a global contact tracing test
  • Celebrity body scans are colliding with radiology evidence
  • Antibiotic prescribing remains high for uncomplicated diverticulitis 
License Content
Terms of Use | Disclaimer
Cookie Policy
Privacy Statement (EU)
Disclaimer
  • Specialties
    • All Specialties, All Recent Reports
    • Cardiology
    • Chronic Disease
    • Dermatology
    • Emergency
    • Endocrinology
    • Gastroenterology
    • Imaging and Intervention
    • Infectious Disease
    • Nephrology
    • Neurology
    • Obstetrics
    • Oncology
    • Ophthalmology
    • Pediatrics
    • Pharma
    • Preclinical
    • Psychiatry
    • Public Health
    • Pulmonology
    • Rheumatology
    • Surgery
  • Tools
    • EvidencePulse™
    • RVU Search
    • NPI Registry Lookup
  • Pharma
  • AI News
  • The Scan
  • Classics™
    • 2MM+ Online Access
    • Paperback and Ebook
  • Rewinds
  • Partners
    • License Content
    • Submit Press Release
    • Advertise with Us
  • Account
    • Subscribe
    • Sign-in
    • My account
No Result
View All Result

© 2026 2 Minute Medicine, Inc. - Physician-written medical news.