• About
  • Masthead
  • License Content
  • Advertise
  • Submit Press Release
  • RSS/Email List
  • 2MM Podcast
  • Write for us
  • Contact Us
2 Minute Medicine
No Result
View All Result

No products in the cart.

SUBSCRIBE
  • Specialties
    • All Specialties, All Recent Reports
    • Cardiology
    • Chronic Disease
    • Dermatology
    • Emergency
    • Endocrinology
    • Gastroenterology
    • Imaging and Intervention
    • Infectious Disease
    • Nephrology
    • Neurology
    • Obstetrics
    • Oncology
    • Ophthalmology
    • Pediatrics
    • Pharma
    • Preclinical
    • Psychiatry
    • Public Health
    • Pulmonology
    • Rheumatology
    • Surgery
  • The Scan+
  • Wellness
  • Classics™+
    • 2MM+ Online Access
    • Paperback and Ebook
  • Rewinds
  • Visual
  • Career
  • Podcasts
  • Partners
    • License Content
    • Submit Press Release
    • Advertise with Us
  • Account
    • Subscribe
    • Sign-in
    • My account
2 Minute Medicine
  • Specialties
    • All Specialties, All Recent Reports
    • Cardiology
    • Chronic Disease
    • Dermatology
    • Emergency
    • Endocrinology
    • Gastroenterology
    • Imaging and Intervention
    • Infectious Disease
    • Nephrology
    • Neurology
    • Obstetrics
    • Oncology
    • Ophthalmology
    • Pediatrics
    • Pharma
    • Preclinical
    • Psychiatry
    • Public Health
    • Pulmonology
    • Rheumatology
    • Surgery
  • The Scan+
  • Wellness
  • Classics™+
    • 2MM+ Online Access
    • Paperback and Ebook
  • Rewinds
  • Visual
  • Career
  • Podcasts
  • Partners
    • License Content
    • Submit Press Release
    • Advertise with Us
  • Account
    • Subscribe
    • Sign-in
    • My account
SUBSCRIBE
2 Minute Medicine
Subscribe
Home 2 Minute Medicine

Large language models (LLMs) demonstrated higher error rates compared to humans in a clinical oncology question bank

byCheng En XiandDeepti Shroff Karhade
May 28, 2024
in 2 Minute Medicine, Oncology, Tech
Reading Time: 3 mins read
0
Share on FacebookShare on Twitter

1. A comparative evaluation tested five publicly available LLMs on 2044 oncology questions, covering comprehensive topics in the field. The responses were compared to a human benchmark.

2. Only one of the five models tested performed above the 50th percentile, with worse performance observed in clinical oncology subcategories and female-predominant malignancies.

Evidence Rating Level: 2 (Good)

Study Rundown: Many medical professionals have begun to use large language models (LLMs), such as ChatGPT, as augmented search engines for medical information. LLMs have demonstrated high performance on subspecialty medical examinations across multiple medical specialties, but the utility of LLMs in clinical applications of clinical oncology remains unexplored.

Rydzewski and colleagues compared the performance of five LLMs on a set of multiple-choice questions related to clinical oncology to a random guess algorithm and the performance of radiation oncology trainees. The authors assessed the accuracy of the models, their self-appraised confidence, and consistency of responses across three independent replicates of questions. The LLMs were asked to provide an answer to a question, a confidence score, and an explanation of the response. Each LLM was evaluated with 2044 unique questions, across three independent replicates.

The study found that only one of the five LLMs (GPT-4) scored higher than the 50th percentile when compared to human trainees, despite all showing high self-appraised confidence. The remaining LLMs had much lower accuracies, with some being similar to the random guess strategy. LLMs scored higher on foundational topics and worse on clinical oncology topics, especially ones related to female-predominant malignancies. The authors found combining model selection, self-appraised confidence, and output consistency, helped identify more reliable outputs. Overall, this study demonstrated a need to assess the safety of implementing LLMs in clinical settings and the presence of training bias, in the form of medical misinformation related to female-predominant malignancies.

RELATED REPORTS

Reduced-dose apixaban noninferior to full dose for cancer-associated thrombosis

Mental Health Chatbot Woebot Shown to Help with Postpartum Depression and Anxiety

AI Symptom-Checker Could Help Emergency Doctors Prioritize Patients

Click here to read the study in NEJM AI

Click to read an accompanying editorial in NEJM AI

Relevant Reading: Performance of ChatGPT on a primary FRCA multiple choice question bank

In-Depth [cross-sectional study]: In this study, Rydzewski and colleagues assessed the accuracies of five LLMs (LLaMA, PaLM 2, Claude-v1, GPT-3.5, and GPT-4) in 2044 multiple choice questions and aimed to identify strategies to help end users identify reliable LLM outputs. The questions were sourced from the American College of Radiology in-training radiation oncology examinations from 2013-2017, 2020, and 2021. Each question was repeated across three independent replicates. The authors compared LLM performance with a random guessing strategy and human scores for the questions sourced from the 2013 and 2014 examinations. Also, the authors assessed the self-appraised confidence of the LLMs by prompting for a confidence score ranging from 1-4, with 1 indicating a random guess and 4 indicating maximal confidence.

The five LLMs had mean accuracies ranging from 25.6%-68.7%, as compared to 25.2% of the random guess strategy. When compared against humans, only GPT-4 scored higher than 50th percentile, achieving 69th and 89th percentiles. The overall performances of LLMs were positively correlated (Pearson’s r = 0.630; p < 0.001) with their performances in a specific topic. Other than LLaMA 65B, all LLMs performed better on foundational topics (e.g., medical statistics, cancer biology), than clinical subcategories (p < 0.02). LLMs performed the worst with regards to subjects involving breast and gynecologic malignancies. All LLMs produced a confidence score of 3 or 4 in more than 94% of responses. Finally, by combining self-assessed confidence and output consistency, the authors generated accuracies of 81.7% and 81.1% in Claude-v1 and GPT-4, respectively.

In conclusion, the authors assessed the ability of five LLMs to answer clinical oncology examination questions. This work displayed the need for further LLM safety evaluations before routine clinical implementation. It also provided insight into a potential strategy to more reliably use LLM output.

Image: PD

©2024 2 Minute Medicine, Inc. All rights reserved. No works may be reproduced without expressed written consent from 2 Minute Medicine, Inc. Inquire about licensing here. No article should be construed as medical advice and is not intended as such by the authors or by 2 Minute Medicine, Inc.

Tags: artificial intelligencelarge language modelsoncology
Previous Post

Roche: CT-388 Dual GLP1-GIP Receptor Agonist Clinically Improves Weight Loss

Next Post

The Scan by 2 Minute Medicine®: Keeping up with Keto, Feeding with HIV, Lung Cancer Overhaul and Catastrophic Chip Challenge

RelatedReports

Thrombophilia-associated stillbirth risk appears limited to factor V Leiden
Hematology

Reduced-dose apixaban noninferior to full dose for cancer-associated thrombosis

May 20, 2025
Parents often unaware of adolescents’ suicidal thoughts
AI Roundup

Mental Health Chatbot Woebot Shown to Help with Postpartum Depression and Anxiety

May 13, 2025
Patient Basics: Heart Attack (Myocardial Infarction)
AI Roundup

AI Symptom-Checker Could Help Emergency Doctors Prioritize Patients

May 12, 2025
Rapid growth of medical artificial intelligence technology usage identified from insurance claims analysis, yet major barriers to widespread adoption remain
AI Roundup

AI-Powered Stethoscope Could Diagnose Heart Disease in Minutes

May 7, 2025
Next Post
The Scan by 2 Minute Medicine®:  Ultra-Trail du Mont-Blanc, Taylor Swift, NBA rookie Chet Holmgren and Magic Mushrooms!

The Scan by 2 Minute Medicine®: Keeping up with Keto, Feeding with HIV, Lung Cancer Overhaul and Catastrophic Chip Challenge

#VisualAbstract: Intensive Ambulance-Delivered Blood-Pressure Reduction Does Not Improve Outcomes in Hyperacute Stroke

#VisualAbstract: Intensive Ambulance-Delivered Blood-Pressure Reduction Does Not Improve Outcomes in Hyperacute Stroke

Quick Take: Intimate Partner Homicide of Adolescents

Bayer AG: Elinzanetant Shows Promise In Treating Menopause-Associated Vasomotor Symptoms In Phase 3 Clinical Trials

2 Minute Medicine® is an award winning, physician-run, expert medical media company. Our content is curated, written and edited by practicing health professionals who have clinical and scientific expertise in their field of reporting. Our editorial management team is comprised of highly-trained MD physicians. Join numerous brands, companies, and hospitals who trust our licensed content.

Recent Reports

  • #VisualAbstract: Extended Caffeine Does Not Reduce Hospitalization Time for Apnea in Moderately Preterm Infants
  • Mepolizumab decreases exacerbations in patients with eosinophilic COPD
  • Reduced-dose apixaban noninferior to full dose for cancer-associated thrombosis
License Content
Terms of Use | Disclaimer
Cookie Policy
Privacy Statement (EU)
Disclaimer

© 2021 2 Minute Medicine, Inc. - Physician-written medical news.

  • Specialties
    • All Specialties, All Recent Reports
    • Cardiology
    • Chronic Disease
    • Dermatology
    • Emergency
    • Endocrinology
    • Gastroenterology
    • Imaging and Intervention
    • Infectious Disease
    • Nephrology
    • Neurology
    • Obstetrics
    • Oncology
    • Ophthalmology
    • Pediatrics
    • Pharma
    • Preclinical
    • Psychiatry
    • Public Health
    • Pulmonology
    • Rheumatology
    • Surgery
  • The Scan
  • Wellness
  • Classics™
    • 2MM+ Online Access
    • Paperback and Ebook
  • Rewinds
  • Visual
  • Career
  • Podcasts
  • Partners
    • License Content
    • Submit Press Release
    • Advertise with Us
  • Account
    • Subscribe
    • Sign-in
    • My account
No Result
View All Result

© 2021 2 Minute Medicine, Inc. - Physician-written medical news.