AI Chatbots Equal Ophthalmologists in Glaucoma, Retinal Disease Management

Author(s):

A large language model demonstrated comparative diagnostic accuracy and completeness in glaucoma and retina disease to fellowship-trained ophthalmologists.

Andy Huang, MD

Credit: LinkedIn

Artificial intelligence (AI) models can match or exceed fellowship-trained ophthalmologists in the diagnosis and management of glaucoma and retina disease, according to new research.¹

In the comparative cross-sectional study, GPT-4, a large language model (LLM) AI system, exhibited comparative diagnostic accuracy and completeness in both clinical questions and clinical cases to 12 attending specialists and three senior trainees in ophthalmology.

“The performance of GPT-4 in our study was quite eye-opening,” said lead study author Andy Huang, MD, an ophthalmology resident at the New York Eye and Ear Infirmary of Mount Sinai.² “We recognized the enormous potential of this AI system from the moment we started testing it and were fascinated to observe that GPT-4 could not only assist but in some cases, match or exceed, the expertise of seasoned ophthalmic specialists.”

Medical decision-making and patient education have increasingly integrated LLMs into care, suggesting the potential for AI use in ophthalmology. Recent evidence has supported the consistent performance of LLM chatbots in providing comparable answers as ophthalmologists for a range of patient eye care questions, as well as its strong performance on an ophthalmic knowledge assessment.³

However, Huang and colleagues indicated a broader evaluation of an LLM’s accuracy compared with trained professionals is needed to address real-life clinical situations.¹ To explore that real-life potential, the investigative team assessed GPT-4’s responses versus those of fellowship-trained glaucoma and retinal specialists on ophthalmic-based questions and patient case management.

In the single-center, comparative cross-sectional study, investigators recruited 12 attending physicians (8 in glaucoma and 4 in retina) and 3 ophthalmology trainees from eye clinics associated with the investigative team’s institution. Glaucoma and retina questions (10 of each) were randomly selected from the American Academy of Ophthalmology (AAO) Commonly Asked Questions. Deidentified glaucoma and retinal cases (10 of each) were randomly selected from ophthalmology patients seen at the affiliated clinics.

The LLM’s role was defined as a medical assistant to provide concise answers that emulate an ophthalmologist’s response. The accuracy of answers was measured on a 10-point Likert scale for medical accuracy and completeness, with lower scores representing very poor accuracy. Data were collected from June to August 2023.

Upon analysis, the combined question-case mean rank for accuracy was 506.2 for the LLM chatbot and 403.4 for glaucoma specialists (n = 831; Mann-Whitney U = 27976.5; P <.001). The mean rank for completeness was 528.3 and 398.7 for these groups, respectively (n = 828; Mann-Whitney U = 25218.5; P <.001).

Meanwhile, the mean rank for accuracy was 235.3 for the LLM chatbot and 216.1 for retina specialists (n = 440; Mann-Whitney U = 15518.0; P = .17) and the mean rank for completeness was 258.3 and 208.7 in these groups, respectively (n = 439; Mann-Whitney U = 13123.5; P = .005).

The analysis identified differences between specialists and trainees in both accuracy Likert Scoring (n = 1271; Kruskal-Wallis H, 44.36; P <.001) and completeness Likert scoring (n = 1268; Kruskal-Wallis H, 88.27; P <.001). After performing the Dunn test, investigators identified a significant difference between all pairwise comparisons, aside from specialist versus trainee in rating chatbot completeness.

Overall, the pairwise comparisons revealed both trainees and specialists rated the chatbot’s accuracy and completeness higher than specialist counterparts, with the specialists indicating a significant difference in the chatbot’s accuracy (z = 3.23; P = .007) and completeness (z = 5.86; P <.001).

Huang and colleagues noted the enhanced performance of the chatbot could be attributable to the prompting techniques used in the analysis, particularly instructing the LLM to act as a clinician in an ophthalmology note format.

They pointed to the need for further testing but shared their belief these data support the possibility of AI tools as both diagnostic and therapeutic adjuncts in ophthalmology.

“It could serve as a reliable assistant to eye specialists by providing diagnostic support and potentially easing their workload, especially in complex cases or areas of high patient volume,” Huang said.² “For patients, the integration of AI into mainstream ophthalmic practice could result in quicker access to expert devices, coupled with more informed decision-making to guide their treatment.”

References

_{Huang AS, Hirabayashi K, Barna L, Parikh D, Pasquale LR. Assessment of a Large Language Model’s Responses to Questions and Cases About Glaucoma and Retina Management. JAMA Ophthalmol. Published online February 22, 2024. doi:10.1001/jamaophthalmol.2023.6917}
_{Icahn Mount Sinai. Artificial intelligence matches or outperforms human specialists in retina and glaucoma management, Mount Sinai Study finds. EurekAlert! February 22, 2024. Accessed February 23, 2024. https://www.eurekalert.org/news-releases/1034711.}
_{Bernstein IA, Zhang Y, Govil D, et al. Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions. JAMA Netw Open. 2023;6(8):e2330320. doi:10.1001/jamanetworkopen.2023.30320}