1. Liang and colleagues retrospectively compared Generative Pretrained Transformer 4 (GPT-4)’s comments on scientific papers to those of human peer reviewers.
2. Appropriate overlap between the comments made by GPT-4 and human reviewers was found, and researchers found GPT-4-generated feedback beneficial.
Evidence Rating Level: 2 (Good)
Study Rundown: Effective feedback from peer reviewers is crucial for rigorous scientific research, but it is a time and resource-intensive process. Large language models (LLMs) have the potential to automate feedback generation. Liang and colleagues developed a GPT-4-based scientific feedback generator and retrospectively evaluated its feedback against feedback from human reviewers. Research papers and their comments from reviewers were collected, and the LLM generated structured feedback from the PDFs of the papers. Extractive text summarization was conducted on the LLM- and human-generated feedback and semantic text matching were used to identify overlaps. Additionally, the authors surveyed 308 researchers who used LLM-generated feedback to evaluate their utility. For papers from the Nature family, the study found that more than half of the comments made by GPT-4 were also made by at least one human reviewer. The survey found that 50.3% of the researchers who had used LLM found the feedback helpful, and 20.1% considered it similarly helpful to human feedback. This study demonstrated the potential for LLM to provide useful and timely comments to researchers when human expert feedback is unavailable.
Click here to read the study in NEJM AI
Relevant Reading: Leveraging artificial intelligence to enhance systematic reviews in health research: advanced tools and challenges
In-Depth [retrospective cohort]: Two datasets, one consisting of 3096 scientific papers and 8745 comments from 15 Nature family journals, and another consisting of 1709 papers and 6505 comments from the International Conference on Learning Representations (ICLR), were produced. The research papers were given to the LLM to generate structured feedback, and feedback and comments from human reviewers underwent extractive text summarization and semantic text matching. Further, 308 researchers from 110 institutions who had received LLM-generated feedback on their papers were asked to evaluate the LLM’s utility and performance. For the Nature dataset, 57.55% of the LLM-generated comments overlapped with at least one human reviewer. 30.85% of the LLM-generated comments overlapped with an individual reviewer, which was similar to the degree of overlap between two human reviewers (28.58%). For the ICLR dataset, 77.18% of the LLM-generated comments overlapped with at least one human reviewer. The overlap between LLM and individual reviewers was also similar to that of two human reviewers in this dataset. Additionally, LLMs commented on research implications 7.27 times more frequently than humans, while focusing less on the study’s novelty. For the prospective user survey, 50.3% of the respondents found the feedback to be helpful, and 7.1% found it very helpful. 50.5% were willing to reuse the system, and the respondents were optimistic about the LLM’s continued use. The study’s limitations included a lack of fine-tuning for the GPT-4 model and the restriction to English-language studies. In summary, this study provided promising evidence for LLMs to provide feedback on research papers.
Image: PD
©2025 2 Minute Medicine, Inc. All rights reserved. No works may be reproduced without expressed written consent from 2 Minute Medicine, Inc. Inquire about licensing here. No article should be construed as medical advice and is not intended as such by the authors or by 2 Minute Medicine, Inc.