Machine learning models diagnose celiac disease at similar performance levels to pathologists

1. Jaeckle and colleagues assessed the performance of machine learning models to diagnose celiac disease based on duodenal biopsies.

2. The machine learning models achieved accuracy above 90% and similar inter-rater reliability as independent pathologists.

Evidence Rating Level: 2 (Good)

Study Rundown: The gold standard for celiac disease (CD) diagnosis remains a duodenal biopsy and pathologist interpretation. However, there is a severe pathologist shortage, and concordance among pathologists for CD diagnosis can be as low as 70%. Jaeckle and colleagues trained five machine learning (ML) models to diagnose CD using duodenal biopsies. The models were trained on a dataset of 3383 whole-slide images (WSIs), and their performance was evaluated by assessing accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). The models’ performance was also compared to that of four independent pathologists by examining the diagnostic concordance between the ML model and pathologists. The study found that, for a test dataset from a previously unseen source, the ML models achieved a mean accuracy of 97.5%, a sensitivity of 95.5%, and a specificity of 97.8%. During cross-validation, the PPV was 91.4% and the NPV was 98.5%. The concordance between the model and pathologists was also nearly identical to the inter-rater agreement between pathologists. This study demonstrated the potential for ML models to achieve human-level accuracy for CD biopsy diagnosis.

Click here to read the study in NEJM AI

Relevant Reading: Artificial intelligence in digital pathology: a systematic review and meta-analysis of diagnostic test accuracy

RELATED REPORTS

Juror perception of radiologist liability can be affected by artificial intelligence (AI) use in diagnosis

Natural language processing (NLP) model sped up patient message management in the electronic health record (EHR)

Artificial intelligence-guided ultrasound lacks sufficient accuracy for deep vein thrombosis detection

In-Depth [retrospective cohort]: Jaeckle and colleagues trained five ML models with 3383 WSIs of hematoxylin- and eosin-stained duodenal biopsies and their corresponding diagnoses from four hospitals. The WSIs were pre-processed, and Macenko’s method for stain normalization was used to improve generalization performance. All five models were run independently, and their outputs were averaged. A celiac-positive diagnosis was made if the mean output exceeded a specified diagnostic threshold. The following performance metrics were evaluated: accuracy, sensitivity, specificity, PPV, and NPV. Model performance was also compared to four independent pathologists by calculating the mean average agreement and Cohen’s kappa coefficient. When the models were evaluated during cross-validation, the mean accuracy was 96.8%, the sensitivity was 95.4%, the specificity was 97.2%, the PPV was 91.4%, and the NPV was 98.5%. Additionally, the models were assessed using a new dataset from a new hospital, for which they achieved an accuracy of 97.5%, a sensitivity of 95.5%, and a specificity of 97.8%. Finally, the inter-rater agreement between the ML models and independent pathologists was 0.905±0.050, with a corresponding kappa coefficient of 0.813±0.095. These values were nearly identical to the concordance between the four pathologists: 0.903±0.066, with a kappa coefficient of 0.810±0.127. The main limitation of the study was the subjective nature of CD diagnosis, which affects the accuracy of the ground truth used to train the models. Overall, the authors concluded that the ML model achieved pathologist-level performance in diagnosing CD.

Image: PD

©2025 2 Minute Medicine, Inc. All rights reserved. No works may be reproduced without expressed written consent from 2 Minute Medicine, Inc. Inquire about licensing here. No article should be construed as medical advice and is not intended as such by the authors or by 2 Minute Medicine, Inc.

Tags: #histopathology artifical intelligence celiac celiac disease machine learning pathology

Machine learning models diagnose celiac disease at similar performance levels to pathologists

Juror perception of radiologist liability can be affected by artificial intelligence (AI) use in diagnosis

Natural language processing (NLP) model sped up patient message management in the electronic health record (EHR)

Artificial intelligence-guided ultrasound lacks sufficient accuracy for deep vein thrombosis detection

Presymptomatic treatment of spinal muscular atrophy with risdiplam leads to improved functional outcomes

Significant body weight reduction with cagrilintide-semaglutide therapy

RelatedReports

Juror perception of radiologist liability can be affected by artificial intelligence (AI) use in diagnosis

Natural language processing (NLP) model sped up patient message management in the electronic health record (EHR)

Artificial intelligence-guided ultrasound lacks sufficient accuracy for deep vein thrombosis detection

Kenya’s AI Consult reduces errors in 20000 clinical encounters

Significant body weight reduction with cagrilintide-semaglutide therapy

2 Minute Medicine Rewind September 1, 2025

Moderate to severe TBI is associated with elevated malignant brain tumor risk

Recent Reports