Assessing a large language model for glaucoma knowledge: ChatGPT-5 versus residents

Mauro Gobira^1,2; Rodrigo Moreira¹; Flavio J. L. Galhardo Carvalho Filho¹; Kevin Waquim Pessoa Carvalho¹ ; Francisco N. Murta¹ ; Lucas Antônio Avelar Carvalho¹; Rubens Belfort Jr.^1,2; Ivan M. Tavares²

DOI: 10.5935/0004-2749.2025-0283

ABSTRACT

PURPOSE: To assess the performance of a contemporary large language model (ChatGPT-5) against ophthalmology residents on a standardized set of glaucoma multiple-choice questions.
METHODS: We conducted a cross-sectional comparative study with 189 text-only glaucoma multiple-choice questions from the Cybersight question bank. ChatGPT-5 was tested under standardized conditions, with each item placed in a new chat and limited to letter-only outputs. Six ophthalmology residents from a Brazilian training program (two Postgraduate Year 1, two Postgraduate Year 2, and two Postgraduate Year 3) answered the same questions under supervision. Accuracy was calculated using the official key. McNemar’s exact test was used to compare items between ChatGPT-5 and residents, and matched odds ratios and 95% confidence intervals (95% CIs) were calculated using the Haldane–Anscombe correction.
RESULTS: ChatGPT-5 received 164 of 189 correct responses (86.8%; 95% CI, 81.2–90.9). Residents’ overall accuracy was 62.9% (713/1,134; 95% CI, 60.0–65.6). The top-performing resident earned 76.7%. ChatGPT-5 outperformed all residents in head-to-head comparisons, with odds ratios ranging from 1.84 (95% CI, 1.10–3.08) to 13.15 (95% CI, 5.93–29.20), all p≤0.023. ChatGPT-5 correctly answered 17/189 items (9.0%), but fewer than half of residents were correct (“large language model-only wins”), whereas residents were more successful on items that ChatGPT-5 overlooked.
CONCLUSIONS: ChatGPT-5 outperformed ophthalmology residents on text-based glaucoma multiple-choice questions, indicating its potential as a subspecialty education and assessment tool. Generalizability is limited by the single question bank, text-only items, a small resident cohort, and the evaluation of one large language model version at a single time point. Before incorporating these findings into clinical decision-making, larger, multimodal, and longitudinal studies are required.

Keywords: Glaucoma; Artificial intelligence; Large language models; Education, medical; Medical staff, hospital

THE CONTENT OF THIS ARTICLE IS NOT AVAILABLE FOR THIS LANGUAGE.

Read in English Print PDF English

How to cite this article:

Gobira^{1 M, 2} , Moreira R, Carvalho Filho FJLG, Carvalho KWP, Murta FN, et al. Assessing a large language model for glaucoma knowledge: ChatGPT-5 versus residents. Arq. Bras. Oftalmol. 2026;89(4): e2025-0283:1-6. 10.5935/0004-2749.2025-0283

Export citation

Assessing a large language model for glaucoma knowledge: ChatGPT-5 versus residents

Dimensions

Altmetric

PlumX