News
Article
Author(s):
A recent study demonstrated physicians surpass GPT-4- or GPT-3.5 turbo at making clinical recommendations in the emergency department.
ChatGPT will not be helping the decision-making for physicians any time soon, as a new study demonstrated.1
GPT-4-turbo may have performed tasks better than the earlier version, GPT-3.5-turbo, particularly in predicting the need for antibiotics for a patient in the emergency department, but this language model did not perform better than a resident physician.
Artificial intelligence (AI) in healthcare has been studied across all different specialties, from psychiatry and dermatology to ophthalmology and now hospital medicine. Although AI can help physicians complete their tasks quicker, such as speeding up the diagnosis process, it is not a replacement for a human—especially in the emergency department.
“This is a valuable message to clinicians not to blindly trust these models,” said lead investigator Christopher Y.K. Williams, MD, from Bakar Computational Health Sciences Institute, University of California, San Francisco.2 “ChatGPT can answer medical exam questions and help draft clinical notes, but it’s not currently designed for situations that call for multiple considerations, like the situations in an emergency department.”
Investigators conducted a study to determine whether large language models, such as GPT-4, can provide clinical recommendations for the tasks of admission status, radiological investigation request status, and antibiotic prescription status using clinical notes from the emergency department.1 The team randomly selected 10,000 emergency department visits (out of 351,401 visits) to assess the accuracy of zero-shot, GPT-3.5-turbo- and GPT-4-turbo-generated clinical recommendations across 4 different prompts:
Both ChatGPT models—GPT-4-turbo and GPT-3.5 turbo performed poorly compared to a physician, with accuracy scores of 8% and 24%, respectively. The language models tended to be extra cautious in their recommendations with high sensitivity.
Prompt A led to high sensitivity and low specificity performance. Prompt B marginally improved the specificity. Prompts C and D were the ones that generated the greatest specificity with limited effect on sensitivity.
The team discovered physician sensitivity was below that of GPT-3.5-turbo responses, but specificity was significantly greater. They observed similar findings when comparing the performance of GPT-4 with a physician, excluding the antibiotic prescription task where the language model surpassed the performance of a physician but had worse sensitivity.
However, after evaluating the language models in a more representative setting using an unbalanced sample of 1000 emergency department visits that reflect the real-world, the accuracy of the resident physician recommendations performed better than GPT-3.5 turbo recommendations for all prompts. The GPT-4 performed better than a physician for the antibiotic prescription status task but worse for admission status and radiological investigation.
The study ultimately revealed AI tended to overprescribe, resulting in many false positive suggestions. This can be harmful not only for the patient but also for the healthcare system itself by impacting hospital resource availability and costs.
Williams explained AI’s tendency to overprescribe could be because models are trained from the internet, and trustworthy medical advice sites are not designed to answer emergency medical questions—only to send readers to a doctor who can address their concerns.
“These models are almost fine-tuned to say, ‘seek medical advice,’ which is quite right from a general public safety perspective,” Williams said.2 “But erring on the side of caution isn’t always appropriate in the ED setting, where unnecessary interventions could cause patients harm, strain resources, and lead to higher costs for patients.”
References