1. Jaeckle and colleagues assessed the performance of machine learning models to diagnose celiac disease based on duodenal biopsies.
2. The machine learning models achieved accuracy above 90% and similar inter-rater reliability as independent pathologists.
Evidence Rating Level: 2 (Good)
Study Rundown: The gold standard for celiac disease (CD) diagnosis remains a duodenal biopsy and pathologist interpretation. However, there is a severe pathologist shortage, and concordance among pathologists for CD diagnosis can be as low as 70%. Jaeckle and colleagues trained five machine learning (ML) models to diagnose CD using duodenal biopsies. The models were trained on a dataset of 3383 whole-slide images (WSIs), and their performance was evaluated by assessing accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). The models’ performance was also compared to that of four independent pathologists by examining the diagnostic concordance between the ML model and pathologists. The study found that, for a test dataset from a previously unseen source, the ML models achieved a mean accuracy of 97.5%, a sensitivity of 95.5%, and a specificity of 97.8%. During cross-validation, the PPV was 91.4% and the NPV was 98.5%. The concordance between the model and pathologists was also nearly identical to the inter-rater agreement between pathologists. This study demonstrated the potential for ML models to achieve human-level accuracy for CD biopsy diagnosis.
Click here to read the study in NEJM AI
Relevant Reading: Artificial intelligence in digital pathology: a systematic review and meta-analysis of diagnostic test accuracy
In-Depth [retrospective cohort]: Jaeckle and colleagues trained five ML models with 3383 WSIs of hematoxylin- and eosin-stained duodenal biopsies and their corresponding diagnoses from four hospitals. The WSIs were pre-processed, and Macenko’s method for stain normalization was used to improve generalization performance. All five models were run independently, and their outputs were averaged. A celiac-positive diagnosis was made if the mean output exceeded a specified diagnostic threshold. The following performance metrics were evaluated: accuracy, sensitivity, specificity, PPV, and NPV. Model performance was also compared to four independent pathologists by calculating the mean average agreement and Cohen’s kappa coefficient. When the models were evaluated during cross-validation, the mean accuracy was 96.8%, the sensitivity was 95.4%, the specificity was 97.2%, the PPV was 91.4%, and the NPV was 98.5%. Additionally, the models were assessed using a new dataset from a new hospital, for which they achieved an accuracy of 97.5%, a sensitivity of 95.5%, and a specificity of 97.8%. Finally, the inter-rater agreement between the ML models and independent pathologists was 0.905±0.050, with a corresponding kappa coefficient of 0.813±0.095. These values were nearly identical to the concordance between the four pathologists: 0.903±0.066, with a kappa coefficient of 0.810±0.127. The main limitation of the study was the subjective nature of CD diagnosis, which affects the accuracy of the ground truth used to train the models. Overall, the authors concluded that the ML model achieved pathologist-level performance in diagnosing CD.
Image: PD
©2025 2 Minute Medicine, Inc. All rights reserved. No works may be reproduced without expressed written consent from 2 Minute Medicine, Inc. Inquire about licensing here. No article should be construed as medical advice and is not intended as such by the authors or by 2 Minute Medicine, Inc.