Large language models show potential to provide feedback on research papers on a large-scale

1. Liang and colleagues retrospectively compared Generative Pretrained Transformer 4 (GPT-4)’s comments on scientific papers to those of human peer reviewers.

2. Appropriate overlap between the comments made by GPT-4 and human reviewers was found, and researchers found GPT-4-generated feedback beneficial.

Evidence Rating Level: 2 (Good)

Study Rundown: Effective feedback from peer reviewers is crucial for rigorous scientific research, but it is a time and resource-intensive process. Large language models (LLMs) have the potential to automate feedback generation. Liang and colleagues developed a GPT-4-based scientific feedback generator and retrospectively evaluated its feedback against feedback from human reviewers. Research papers and their comments from reviewers were collected, and the LLM generated structured feedback from the PDFs of the papers. Extractive text summarization was conducted on the LLM- and human-generated feedback and semantic text matching were used to identify overlaps. Additionally, the authors surveyed 308 researchers who used LLM-generated feedback to evaluate their utility. For papers from the Nature family, the study found that more than half of the comments made by GPT-4 were also made by at least one human reviewer. The survey found that 50.3% of the researchers who had used LLM found the feedback helpful, and 20.1% considered it similarly helpful to human feedback. This study demonstrated the potential for LLM to provide useful and timely comments to researchers when human expert feedback is unavailable.

Click here to read the study in NEJM AI

Relevant Reading: Leveraging artificial intelligence to enhance systematic reviews in health research: advanced tools and challenges

RELATED REPORTS

Brown University study warns of systemic ethical risks in artificial intelligence therapy chatbots

NVIDIA GTC 2026 unveils Isaac GR00T foundation model for surgical robotics

Roche and NVIDIA deploy the pharmaceutical industry’s largest artificial intelligence factory

In-Depth [retrospective cohort]: Two datasets, one consisting of 3096 scientific papers and 8745 comments from 15 Nature family journals, and another consisting of 1709 papers and 6505 comments from the International Conference on Learning Representations (ICLR), were produced. The research papers were given to the LLM to generate structured feedback, and feedback and comments from human reviewers underwent extractive text summarization and semantic text matching. Further, 308 researchers from 110 institutions who had received LLM-generated feedback on their papers were asked to evaluate the LLM’s utility and performance. For the Nature dataset, 57.55% of the LLM-generated comments overlapped with at least one human reviewer. 30.85% of the LLM-generated comments overlapped with an individual reviewer, which was similar to the degree of overlap between two human reviewers (28.58%). For the ICLR dataset, 77.18% of the LLM-generated comments overlapped with at least one human reviewer. The overlap between LLM and individual reviewers was also similar to that of two human reviewers in this dataset. Additionally, LLMs commented on research implications 7.27 times more frequently than humans, while focusing less on the study’s novelty. For the prospective user survey, 50.3% of the respondents found the feedback to be helpful, and 7.1% found it very helpful. 50.5% were willing to reuse the system, and the respondents were optimistic about the LLM’s continued use. The study’s limitations included a lack of fine-tuning for the GPT-4 model and the restriction to English-language studies. In summary, this study provided promising evidence for LLMs to provide feedback on research papers.

Image: PD

©2025 2 Minute Medicine, Inc. All rights reserved. No works may be reproduced without expressed written consent from 2 Minute Medicine, Inc. Inquire about licensing here. No article should be construed as medical advice and is not intended as such by the authors or by 2 Minute Medicine, Inc.

Large language models show potential to provide feedback on research papers on a large-scale

Brown University study warns of systemic ethical risks in artificial intelligence therapy chatbots

NVIDIA GTC 2026 unveils Isaac GR00T foundation model for surgical robotics

Roche and NVIDIA deploy the pharmaceutical industry’s largest artificial intelligence factory

#VisualAbstract: Elinzanetant Effectively Reduces Vasomotor Symptoms from Endocrine Therapy for Breast Cancer

#VisualAbstract: Tarlatamab Improves Survival in Small-Cell Lung Cancer after Platinum-Based Chemotherapy

RelatedReports

Brown University study warns of systemic ethical risks in artificial intelligence therapy chatbots

NVIDIA GTC 2026 unveils Isaac GR00T foundation model for surgical robotics

Roche and NVIDIA deploy the pharmaceutical industry’s largest artificial intelligence factory

2MM: AI Roundup – Roche and NVIDIA’s AI drug discovery factory and surgical robotics foundation model, Amazon’s nationwide health AI expansion, and Brown’s AI therapy ethics warning [March 2026]

#VisualAbstract: Tarlatamab Improves Survival in Small-Cell Lung Cancer after Platinum-Based Chemotherapy

Artificial intelligence may assist in early detection of decreased ejection fraction on echocardiograms

Early screening for emotional and cognitive issues may improve psychiatric outcomes for stroke patients

Recent Reports