Large language models (LLMs) demonstrated higher error rates compared to humans in a clinical oncology question bank

1. A comparative evaluation tested five publicly available LLMs on 2044 oncology questions, covering comprehensive topics in the field. The responses were compared to a human benchmark.

2. Only one of the five models tested performed above the 50^th percentile, with worse performance observed in clinical oncology subcategories and female-predominant malignancies.

Evidence Rating Level: 2 (Good)

Study Rundown: Many medical professionals have begun to use large language models (LLMs), such as ChatGPT, as augmented search engines for medical information. LLMs have demonstrated high performance on subspecialty medical examinations across multiple medical specialties, but the utility of LLMs in clinical applications of clinical oncology remains unexplored.

Rydzewski and colleagues compared the performance of five LLMs on a set of multiple-choice questions related to clinical oncology to a random guess algorithm and the performance of radiation oncology trainees. The authors assessed the accuracy of the models, their self-appraised confidence, and consistency of responses across three independent replicates of questions. The LLMs were asked to provide an answer to a question, a confidence score, and an explanation of the response. Each LLM was evaluated with 2044 unique questions, across three independent replicates.

The study found that only one of the five LLMs (GPT-4) scored higher than the 50^th percentile when compared to human trainees, despite all showing high self-appraised confidence. The remaining LLMs had much lower accuracies, with some being similar to the random guess strategy. LLMs scored higher on foundational topics and worse on clinical oncology topics, especially ones related to female-predominant malignancies. The authors found combining model selection, self-appraised confidence, and output consistency, helped identify more reliable outputs. Overall, this study demonstrated a need to assess the safety of implementing LLMs in clinical settings and the presence of training bias, in the form of medical misinformation related to female-predominant malignancies.

RELATED REPORTS

Food and Drug Administration reviews model for predicting drug related liver injury

2MM: AI Roundup: Food and Drug Administration reviews liver injury prediction tool, Joint Commission launches healthcare artificial intelligence certification, governance playbooks aim to standardize adoption, and pediatric hospitals bring generative tools to the frontline

Guideline-based lymph node sampling is not associated with increased postoperative complications in lung cancer

Click here to read the study in NEJM AI

Click to read an accompanying editorial in NEJM AI

Relevant Reading: Performance of ChatGPT on a primary FRCA multiple choice question bank

In-Depth [cross-sectional study]: In this study, Rydzewski and colleagues assessed the accuracies of five LLMs (LLaMA, PaLM 2, Claude-v1, GPT-3.5, and GPT-4) in 2044 multiple choice questions and aimed to identify strategies to help end users identify reliable LLM outputs. The questions were sourced from the American College of Radiology in-training radiation oncology examinations from 2013-2017, 2020, and 2021. Each question was repeated across three independent replicates. The authors compared LLM performance with a random guessing strategy and human scores for the questions sourced from the 2013 and 2014 examinations. Also, the authors assessed the self-appraised confidence of the LLMs by prompting for a confidence score ranging from 1-4, with 1 indicating a random guess and 4 indicating maximal confidence.

The five LLMs had mean accuracies ranging from 25.6%-68.7%, as compared to 25.2% of the random guess strategy. When compared against humans, only GPT-4 scored higher than 50^th percentile, achieving 69^th and 89^th percentiles. The overall performances of LLMs were positively correlated (Pearson’s r = 0.630; p < 0.001) with their performances in a specific topic. Other than LLaMA 65B, all LLMs performed better on foundational topics (e.g., medical statistics, cancer biology), than clinical subcategories (p < 0.02). LLMs performed the worst with regards to subjects involving breast and gynecologic malignancies. All LLMs produced a confidence score of 3 or 4 in more than 94% of responses. Finally, by combining self-assessed confidence and output consistency, the authors generated accuracies of 81.7% and 81.1% in Claude-v1 and GPT-4, respectively.

In conclusion, the authors assessed the ability of five LLMs to answer clinical oncology examination questions. This work displayed the need for further LLM safety evaluations before routine clinical implementation. It also provided insight into a potential strategy to more reliably use LLM output.

Image: PD

©2024 2 Minute Medicine, Inc. All rights reserved. No works may be reproduced without expressed written consent from 2 Minute Medicine, Inc. Inquire about licensing here. No article should be construed as medical advice and is not intended as such by the authors or by 2 Minute Medicine, Inc.

Large language models (LLMs) demonstrated higher error rates compared to humans in a clinical oncology question bank

Food and Drug Administration reviews model for predicting drug related liver injury

2MM: AI Roundup: Food and Drug Administration reviews liver injury prediction tool, Joint Commission launches healthcare artificial intelligence certification, governance playbooks aim to standardize adoption, and pediatric hospitals bring generative tools to the frontline

Guideline-based lymph node sampling is not associated with increased postoperative complications in lung cancer

Roche: CT-388 Dual GLP1-GIP Receptor Agonist Clinically Improves Weight Loss

The Scan by 2 Minute Medicine®: Keeping up with Keto, Feeding with HIV, Lung Cancer Overhaul and Catastrophic Chip Challenge

RelatedReports

Food and Drug Administration reviews model for predicting drug related liver injury

2MM: AI Roundup: Food and Drug Administration reviews liver injury prediction tool, Joint Commission launches healthcare artificial intelligence certification, governance playbooks aim to standardize adoption, and pediatric hospitals bring generative tools to the frontline

Guideline-based lymph node sampling is not associated with increased postoperative complications in lung cancer

Why Pfizer Is Looking East for Its Next Generation of Cancer Drugs

The Scan by 2 Minute Medicine®: Keeping up with Keto, Feeding with HIV, Lung Cancer Overhaul and Catastrophic Chip Challenge

#VisualAbstract: Intensive Ambulance-Delivered Blood-Pressure Reduction Does Not Improve Outcomes in Hyperacute Stroke

Bayer AG: Elinzanetant Shows Promise In Treating Menopause-Associated Vasomotor Symptoms In Phase 3 Clinical Trials

Recent Reports