News
Article
Author(s):
Modern large language models perform strongly on ophthalmic knowledge assessments and approach or exceed the scores of fully qualified and trainee ophthalmologists.
A recent head-to-head cross-sectional study found state-of-the-art large language models (LLMs), including ChatGPT-4, are nearing expert-level clinical knowledge and reasoning in the field of ophthalmology.1
In the analysis, GPT-4 fared well with expert ophthalmologists and specialty trainees, achieving a pass-worthy performance on a mock ophthalmological examination. However, as top-performing doctors remained superior in the knowledge examination, LLMs might be most beneficial for eye-related advice or management suggestions.
“We could realistically deploy AI in triaging patients with eye issues to decide which cases are emergencies that need to be seen by a specialist immediately, which can be seen by a general practitioner, and which don’t need treatment,” lead study author Arun J Thirunavukarasu, MB BChir, University of Cambridge school of clinical medicine, explained in a statement.2 “The models could follow clear algorithms already in use, and we’ve found GPT-4 is as good as expert clinicians at processing eye symptoms and signs to answer more complicated questions.”
LLMs have proven revolutionary in natural language processing and their introduction into medicine is the center of a significant discussion around its utility and ethical concerns. In eye care, ChatGPT showed superiority on eye-related questions in an examination for general practitioners and achieved significant improvement over time on ophthalmic knowledge assessments after GPT-3.5 was succeeded by GPT-4.3,4
However, Thirunavukarasu and colleagues suggested these analyses are impacted by the potential for “contamination,” or an inflated performance related to previously recalled text, rather than clinical reasoning.1 In addition, the examination performance does little to indicate the potential of LLMs to improve clinical practice as a medical assistance tool.
For this analysis, investigators used the United Kingdom’s FRCOphth Part 2 examination to evaluate the capability of state-of-the-art LLMs, with fully qualified and current training ophthalmologists serving as a robust clinical benchmark, rather than raw examination scores. As these questions were not fully available online, the risk of contamination was minimized.
Each eligible question was inputted into ChatGPT (GPT-3.5 and GPT-4 versions) between April and May 2023. If the LLM did not provide a definitive answer, the question was re-trialed up to three times and regarded as “null” if no answer was provided. After their release, the PaLM 2 and LLaMA models were trialed on the 90-question examination between June and July 2023.
To evaluate the performance and accuracy of the LLM outputs, five expert ophthalmologists who had passed the examination, three trainees currently in residency, and two unspecialized junior doctors answered the 90-question mock examination, independently, without reference to textbooks, the internet, or the LLMs’ recorded answers.
Ultimately, 347 of 360 questions from the textbook were used for analysis, including 87 of the 90 questions from the mock examination chapter. Overall performance across the 347 questions was significantly higher for GPT-4 than for GPT-3.5 (61.7% vs. 48.41%; P <.01).
Performance in the mock examination revealed GPT-4 compared well with other LLMs, junior and trainee doctors, and ophthalmology experts. GPT-4 was the top-scoring model (69%), compared with GPT, performing higher than GPT-23.5 (48%) and LLaMA (32%), but statistically similar to PaLM 2 (56%), despite a superior score.
Moreover, the performance of GPT-4 was statistically similar to the mean score achieved by expert ophthalmologists (P = .28). Overall, GPT-4 performed favorably to expert ophthalmologists (median, 76%; range, 64–90%), ophthalmology trainees (median, 59%; range, 57–63%), and unspecialized junior doctors (median, 43%; range, 41–44%).
Low agreement between LLMs and doctors indicated idiosyncratic differences in knowledge and reasoning with consistency across subjects and type (P >.05). LLM knowledge and reasoning ability may be general across ophthalmology, rather than strong for a particular subspecialty or question type.
The LLM examination performance translated to subjective preference, as all ophthalmologists preferred responses from GPT-4. In particular, they rated the accuracy and relevance of GPT-4 higher than GPT-3.5 (P <.05).
Thirunavukarasu and colleagues noted the remarkable performance of GPT-4 in ophthalmology may provide its usefulness in clinical contexts, particularly in areas without access to ophthalmologists. Its ophthalmic knowledge and reasoning ability are likely superior to non-specialist doctors and could assist clinicians in their day-to-day work.
“Even taking the future use of AI into account, I think doctors will continue to be in charge of patient care,” Thirunavukarasu said.2 “The most important thing is to empower patients to decide whether they want computer systems to be involved or not. That will be an individual decision for each patient to make.”
References