ChatGPT-4 Vision Outperforms Radiology Test Questions Over Image-Based Ones

Author(s):

Chelsie Derman

ChatGPT-4 Vision performed well on text-based radiology test questions but struggled to answer image-related questions.

ChatGPT-4 Vision Outperforms Radiology Test Questions Over Image-Based Ones

Chad Klochko, MD

Credit: Henry Ford Health

In a recent study, ChatGPT-4 Vision performed well on text-based radiology test questions but struggled to answer image-related questions accurately.

ChatGPT-4 Vision, released in September 2023, is the first large language model with both text and image processing capabilities. The competence of the AI tool for radiology-related tasks is still being studied.

“ChatGPT-4 has shown promise for assisting radiologists in tasks such as simplifying patient-facing radiology reports and identifying the appropriate protocol for imaging exams,” said Chad Klochko, MD, musculoskeletal radiologist and artificial intelligence (AI) researcher at Henry Ford Health in Detroit, Michigan. “With image processing capabilities, GPT-4 Vision allows for new potential applications in radiology.”

Investigators were interested in evaluating the performance of ChatGPT-4 Vision on radiology in-training examination questions, with texts and images, to examine the model’s baseline knowledge in radiology. To accomplish this, the team conducted a prospective study between September 2023 and March 2024.

Hayden and colleagues asked ChatGPT-4 Vision 386 retired questions, of which 189 were image-based and 197 text-based, from the American College of Radiology Diagnostic Radiology In-Training Examinations. Among the question pairs, 9 were duplicates, and only the first of the duplicates were included in ChatGPT’s assessment—leaving 377 questions across 13 domains. A subanalysis evaluated the impact of several zero-shot prompts on performance.

ChatGPT-4 Vision correctly answered 65.3% of the 377 unique questions. The AI had significantly greater accuracy on text-only questions (81.5%) than on image-based questions (47.8%).

“The 81.5% accuracy for text-only questions mirrors the performance of the model’s predecessor,” Klochko said. “This consistency on text-based questions may suggest that the model has a degree of textual understanding in radiology.”

Investigators evaluated several prompts on the performance of ChatGPT-4 Vision. The prompts included:

· Original: “You are taking a radiology board exam. Images of the questions will be uploaded. Choose the correct answer for each question.”

· Basic: “Choose the single best answer in the following retired radiology board exam question.”

· Short instruction: “This is a retired radiology board exam question to gauge your medical knowledge. Choose the single best answer letter and do not provide any reasoning for your answer.”

· Long instruction: “You are a board-certified diagnostic radiologist taking an examination. Evaluate each question carefully and if the question additionally contains an image, please evaluate the image carefully in order to answer the question. Your response must include a single best answer choice. Failure to provide an answer choice will count as incorrect.”

· Chain of thought: “You are taking a retired board exam for research purposes. Given the provided image, think step by step for the provided question.”

The model correctly answered 183 of 265 questions with a basic prompt, but it declined to answer 120 questions, many of which were image-based.

“The phenomenon of declining to answer questions was something we hadn’t seen in our initial exploration of the model,” Klochko said.

For text-based prompts, chain-of-thought outperformed long instruction by 6.1% (P = .02), basic prompting by 6.8% (P = .009), and the original prompting style by 8.9% (P = .001). Investigators observed no differences between prompts on image-based questions, with P values ranging from .27 to > .99.

Genitourinary radiology was the only subspeciality where ChatGPT-4 Vision outperformed on questions with images (67%) than text-only questions (57%). However, text-only questions outperformed all other subspecialities.

Although text-based questions typically performed better than image-based questions, investigators found, among image-based questions, the model had the best accuracy in the chest (69%) and genitourinary (67%) subspecialities. The model performed the worst for image-based questions in the nuclear domain, only answering 2 of 10 questions correctly.

“Given the current challenges in accurately interpreting key radiologic images and the tendency for hallucinatory responses, the applicability of GPT-4 Vision in information-critical fields such as radiology is limited in its current state,” Klochko said.

References

^{Hayden N, Gilbert S, Poisson LM, Griffith B, Klochko C. Performance of GPT-4 with Vision on Text- and Image-based ACR Diagnostic Radiology In-Training Examination Questions. Radiology. 2024;312(3):e240153. doi:10.1148/radiol.240153}
^{Vision-based ChatGPT shows deficits interpreting radiologic images. EurekAlet! September 3, 2024. https://www.eurekalert.org/news-releases/1055870. Accessed September 9, 2024.}