Physicians Should Not Use ChatGPT for Clinical Recommendations, Study Indicates

Author(s):

Key Takeaways

GPT-4-turbo and GPT-3.5-turbo underperformed compared to resident physicians in emergency department tasks, except for antibiotic prescriptions.
AI models demonstrated high sensitivity but low specificity, often leading to overprescription and false positives.
AI's cautious recommendations stem from training on general internet data, not tailored for emergency medical decision-making.
Resident physicians outperformed AI in real-world settings, highlighting AI's current limitations in complex clinical environments.

A recent study demonstrated physicians surpass GPT-4- or GPT-3.5 turbo at making clinical recommendations in the emergency department.

Physicians Should Not Use ChatGPT for Clinical Recommendations, Study Indicates

Christopher Y.K. Williams, MD

Credit: LinkedIn

ChatGPT will not be helping the decision-making for physicians any time soon, as a new study demonstrated.¹

GPT-4-turbo may have performed tasks better than the earlier version, GPT-3.5-turbo, particularly in predicting the need for antibiotics for a patient in the emergency department, but this language model did not perform better than a resident physician.

Artificial intelligence (AI) in healthcare has been studied across all different specialties, from psychiatry and dermatology to ophthalmology and now hospital medicine. Although AI can help physicians complete their tasks quicker, such as speeding up the diagnosis process, it is not a replacement for a human—especially in the emergency department.

“This is a valuable message to clinicians not to blindly trust these models,” said lead investigator Christopher Y.K. Williams, MD, from Bakar Computational Health Sciences Institute, University of California, San Francisco.² “ChatGPT can answer medical exam questions and help draft clinical notes, but it’s not currently designed for situations that call for multiple considerations, like the situations in an emergency department.”

Investigators conducted a study to determine whether large language models, such as GPT-4, can provide clinical recommendations for the tasks of admission status, radiological investigation request status, and antibiotic prescription status using clinical notes from the emergency department.¹ The team randomly selected 10,000 emergency department visits (out of 351,401 visits) to assess the accuracy of zero-shot, GPT-3.5-turbo- and GPT-4-turbo-generated clinical recommendations across 4 different prompts:

Prompt A: “Please return whether the patient should be admitted to hospital/requires radiological investigation/requires antibiotics
Prompt B: Added “only suggest… if absolutely required
Prompt C: “Removing restrictions on the verbosity of GPT-3.5-turbo response”
Prompt D: “Let’s think step by step chain-of-thought prompting”

Both ChatGPT models—GPT-4-turbo and GPT-3.5 turbo performed poorly compared to a physician, with accuracy scores of 8% and 24%, respectively. The language models tended to be extra cautious in their recommendations with high sensitivity.

Prompt A led to high sensitivity and low specificity performance. Prompt B marginally improved the specificity. Prompts C and D were the ones that generated the greatest specificity with limited effect on sensitivity.

The team discovered physician sensitivity was below that of GPT-3.5-turbo responses, but specificity was significantly greater. They observed similar findings when comparing the performance of GPT-4 with a physician, excluding the antibiotic prescription task where the language model surpassed the performance of a physician but had worse sensitivity.

However, after evaluating the language models in a more representative setting using an unbalanced sample of 1000 emergency department visits that reflect the real-world, the accuracy of the resident physician recommendations performed better than GPT-3.5 turbo recommendations for all prompts. The GPT-4 performed better than a physician for the antibiotic prescription status task but worse for admission status and radiological investigation.

The study ultimately revealed AI tended to overprescribe, resulting in many false positive suggestions. This can be harmful not only for the patient but also for the healthcare system itself by impacting hospital resource availability and costs.

Williams explained AI’s tendency to overprescribe could be because models are trained from the internet, and trustworthy medical advice sites are not designed to answer emergency medical questions—only to send readers to a doctor who can address their concerns.

“These models are almost fine-tuned to say, ‘seek medical advice,’ which is quite right from a general public safety perspective,” Williams said.² “But erring on the side of caution isn’t always appropriate in the ED setting, where unnecessary interventions could cause patients harm, strain resources, and lead to higher costs for patients.”

References

^{Williams CYK, Miao BY, Kornblith AE, Butte AJ. Evaluating the use of large language models to provide clinical recommendations in the Emergency Department. Nat Commun. 2024;15(1):8236. Published 2024 Oct 8. doi:10.1038/s41467-024-52415-1}
^{When It Comes to Emergency Care, Chatgpt Overprescribes. EurekAlert! October 8, 2024. https://www.eurekalert.org/news-releases/1060326. Accessed October 17, 2024.}