Arq. Bras. Oftalmol. 2026; 89 (4): 10.5935/0004-2749.2025-0283
Total: 27
Mauro Gobira1,2; Rodrigo Moreira1; Flavio J. L. Galhardo Carvalho Filho1; Kevin Waquim Pessoa Carvalho1 ; Francisco N. Murta1 ; Lucas Antônio Avelar Carvalho1; Rubens Belfort Jr.1,2; Ivan M. Tavares2
DOI: 10.5935/0004-2749.2025-0283
ABSTRACT
PURPOSE: To assess the performance of a contemporary large language model (ChatGPT-5) against ophthalmology residents on a standardized set of glaucoma multiple-choice questions.
METHODS: We conducted a cross-sectional comparative study with 189 text-only glaucoma multiple-choice questions from the Cybersight question bank. ChatGPT-5 was tested under standardized conditions, with each item placed in a new chat and limited to letter-only outputs. Six ophthalmology residents from a Brazilian training program (two Postgraduate Year 1, two Postgraduate Year 2, and two Postgraduate Year 3) answered the same questions under supervision. Accuracy was calculated using the official key. McNemar’s exact test was used to compare items between ChatGPT-5 and residents, and matched odds ratios and 95% confidence intervals (95% CIs) were calculated using the Haldane–Anscombe correction.
RESULTS: ChatGPT-5 received 164 of 189 correct responses (86.8%; 95% CI, 81.2–90.9). Residents’ overall accuracy was 62.9% (713/1,134; 95% CI, 60.0–65.6). The top-performing resident earned 76.7%. ChatGPT-5 outperformed all residents in head-to-head comparisons, with odds ratios ranging from 1.84 (95% CI, 1.10–3.08) to 13.15 (95% CI, 5.93–29.20), all p≤0.023. ChatGPT-5 correctly answered 17/189 items (9.0%), but fewer than half of residents were correct (“large language model-only wins”), whereas residents were more successful on items that ChatGPT-5 overlooked.
CONCLUSIONS: ChatGPT-5 outperformed ophthalmology residents on text-based glaucoma multiple-choice questions, indicating its potential as a subspecialty education and assessment tool. Generalizability is limited by the single question bank, text-only items, a small resident cohort, and the evaluation of one large language model version at a single time point. Before incorporating these findings into clinical decision-making, larger, multimodal, and longitudinal studies are required.
Keywords: Glaucoma; Artificial intelligence; Large language models; Education, medical; Medical staff, hospital
THE CONTENT OF THIS ARTICLE IS NOT AVAILABLE FOR THIS LANGUAGE.
How to cite this article: