Study Compares Accuracy of ChatGPT-3.5, GPT-4 in Diagnosing for Skin of Color

Author(s):

This analysis was designed to examine ChatGPT’s accuracy in diagnosing dermatologic conditions among those with and without skin of color.

Study Compares Accuracy of ChatGPT-3.5, GPT-4 in Diagnosing for Skin of Color

Credit: Pexels

There is not a significant difference between GPT-3.5 and GPT-4 in the artificial intelligence (AI) models’ accuracy in assessing non-skin of color and skin of color patient groups based on workup or histopathology, new findings suggest, with both models showing a high diagnostic accuracy rate of 72%–100%.¹

These data were included in a new research letter which began by highlighting the growing literature on the implementation of ChatGPT in the dermatology space, given the application’s ability to generate human-like text responses to input by users.

The investigators—led by Simal Qureshi from Memorial University of Newfoundland’s Faculty of Medicine in Canada—noted that the application features a standard model (GPT-3.5) as well as a premium version (GPT-4), the latter of which provides greater processing capacity. Qureshi et al. noted the widely held view that AI datasets often do not include skin of color cases.²

“However, no studies have explored the accuracy of this model in providing clinical information on (skin of color), which could be a valuable tool for clinicians and medical trainees,” Qureshi and colleagues wrote. “We, therefore, sought to understand the accuracy of ChatGPT in diagnosing dermatologic conditions in both (skin of color and non-(skin of color) cases.”¹

Study Design

The research team evaluated 29 cases in total, having drawn 14 from a general dermatology textbook assigned to patients in the non-skin of color group. They drew 15 of these cases from a dermatology textbook which was aimed at skin of color and then assigned to this group during the study.

The cases evaluated by the team covered a variety of skin conditions across both cohorts. The histories and physical assessment details of each case were inputted by the investigators into the GPT-3.5 and GPT-4 models and used medical terminology, with questions posed to the AI to identify the top 3 differential diagnoses.

Provided the research team could produce additional diagnostic data, such as imaging or laboratory tests, the team would add it. The AI application was then given an inquiry in which the provision of a final diagnosis was requested.

Twelve of the skin of color cases involved histopathological reports. These were also added to the system for the generation of a final disease diagnosis.

Chi-squared tests were later implemented to evaluate and compare GPT-3.5 and GPT-4’s diagnostic accuracy within the skin of color and non-skin of color cohorts.

ChatGPT-3.5 Versus GPT-4

Both of the study arms were reported to be similar in age as well as gender distribution. However, non-skin of color cases were noted as being longer in word count versus those of the skin of color cases (251.4 compared to 145.9 words, P = .01). Despite this fact, this did not correlate with either models’ accuracy in diagnosing patients (correlation coefficients r = .11 and .26, respectively).

Overall, there was no significant difference identified by the investigators between GPT-3.5 or GPT-4 in accuracy when formulating differential diagnoses or final diagnoses based on additional workup or histopathology. A notable finding was that GPT-3.5's accuracy was shown to decrease as additional clinical data was added, yet GPT-4's accuracy improved when additional data was included.

A correct diagnosis occurred when using GPT-4 in 100% of the skin of color cases involving histopathology, and this contrasted with an accuracy rate of 66.7% for GPT-3. However, this distinction was not reported by the research team to be statistically significant (P = .093).

What These Findings Mean

This new research was able to demonstrate that these AI models maintained a high level of diagnostic accuracy (72%–100%), with the team concluding that these attributes were comparable across both patient cohorts.

The investigators did acknowledge their study’s limitations, noting that they used a smaller sample size and that their skin of color cohort was only made up of individuals identified as Black or Hispanic. Additionally, the team expressed that their cases were selected from only 2 textbooks, suggesting limited generalizability.

“As AI tools become increasingly used in clinical practice, dermatologists must understand the implications for diverse skin types,” they wrote. “To improve the efficacy of ChatGPT in diagnosing conditions in SOC, the body of scientific literature on which the model is trained needs to include more studies of larger sample sizes in patients with (skin of color).”

References

^{Qureshi S, Alli SR, Ogunyemi B (2024). Accuracy of ChatGPT-3.5 and GPT-4 in diagnosing clinical scenarios in dermatology involving skin of color. Int J Dermatol. https://doi.org/10.1111/ijd.17425.}
^{Butt S, Butt H, Gnanappiragasam D. Unintentional consequences of artificial intelligence in dermatology for patients with skin of colour. Clin Exp Dermatol. 2021; 46(7): 1333–1334.}