News
Article
Author(s):
Across 40 clinical scenarios, ChatGPT did not provide a comprehensive response in 50% of clinical questions with nearly 30% hallucinated sources.
ChatGPT answered correctly in more than 80% of complex open-ended vitreoretinal clinical scenarios but demonstrated a reduced capability to offer a comprehensive response, according to data presented at the American Society of Retina Specialists (ASRS) 42nd Annual Meeting.1
Across the 40 open-ended clinical scenarios, the artificial intelligence (AI) chatbot was incapable of a comprehensive response in approximately 50% of clinical questions and generated nearly 30% hallucinated sources. Hallucinations occur when a large language model (LLM) produces nonsensical or inaccurate responses presented as factual.2
“This demonstrates that while ChatGPT is rapidly growing more accurate, it is not yet suitable as an information source for patients,” wrote the investigative team, led by Michael J. Maywood, MD, department of ophthalmology, Corewell Health William Beaumont University Hospital.1
AI chatbots continue to evolve as a medical tool, particularly in ophthalmology, making it critical to evaluate its strengths and limitations. In this retrospective, cross-sectional study, Maywood and colleagues assessed the performance of ChatGPT by determining the accuracy of the chatbot’s responses to complex open-ended vitreoretinal clinical scenarios, as well as the sources used in answering the clinical prompts.
Investigators designed 40 open-ended clinical scenarios across 4 primary topics in vitreoretinal disease, with responses graded on correctness and comprehensiveness by 3 blinded retina specialists. The primary outcome of the analysis was the number of clinical scenarios answered correctly and comprehensively by the chatbot.
Secondary outcomes involved theoretical harm to patients from an incorrect response, the distribution of the type of references used by ChatGPT, and the occurrence of hallucinated references.
Upon analysis, in June 2023, ChatGPT answered 83% (n = 33 of 40) of clinical scenarios correctly, while providing a comprehensive answer in only 52.5% (n = 21 of 40) of cases. Subgroup analysis showed an average correct response of 86.7% in neovascular age-related macular degeneration (nAMD), 100% in diabetic retinopathy (DR), 76.7% in retinal vascular disease, and 70% in the surgical domain.
After the assessment of the references, ChatGPT generated 70% real references and 30% hallucinated references. Overall, there were 6 incorrect responses with 1 (16.7%) cases of no harm, 3 (50%) cases of possible harm, and 2 (33.3%) cases of definitive harm.
“It was unable to provide a comprehensive response in ~50% of clinical questions and generated 30% hallucinated sources,” Maywood added.
Reference