Lack of Data, Potential Bias Reported in AI Algorithms for Skin Disease

Author(s):

Acknowledging the potential medical harm from errors and bias, investigators believe clinical AI needs to be held to a higher standard.

Roxana Daneshjou, MD, PhD

A scoping review from the Stanford School of Medicine found 3 issues in data sets that were used to develop and test clinical artificial intelligence (AI) algorithms for skin disease.

The issues pertained to the sparsity of data set characterization and a lack of transparency, nonstandard and unverified disease labels, and inability to fully assess patient diversity used for algorithm development and testing.

The study was led by Roxana Daneshjou, MD, PhD, Stanford Department of Dermatology.

Daneshjou and investigators noted the potential of clinical AI to improve clinical care, but added that generalizable algorithms depended on clinical care that they are tested and trained on.

Over a 5-year period of data, the team performed the scoping review that assessed data set availability and data set descriptions such as labeling and patient diversity in order to better understand the characteristics of data sets used for developing AI algorithms and assess potential problems in dermatology AI sets.

The Methods

Daneshjou and investigators initially searched PubMed for peer-reviewed articles published between January 1, 2015, and November 1, 2020.

Articles that defined a clinically relevant task and developed a deep learning algorithm or that tested an existing deep learning algorithm using clinical or dermoscopic images of skin disease were included in the study.

Of these studies, 57 developed a new deep learning algorithm and 13 tested an existing algorithm on an additional data set.

A majority of the studies were published in the past 2 years.

Data set audit criteria were determined by consensus of all au- thors after reviewing existing literature to highlight data set transparency and sources of bias, and data set descriptions were based on how data were presented in the study.

The Findings

The team identified 70 studies that met their criteria.

Among those studies, 57 trained and tested a new algorithm, and only 14 had test data that were from a different source than the training set or an additional external validation.

Additionally, 56 studies included at least 1 cutaneous malignant neoplasm in their task, and 36 (64.3%) met gold standard criteria for histopathological diagnosis. This suggested that a substantial number of studies were trained and/or tested on noisy annotations, which could affect diagnostic accuracy.

Among all studies, only 14 of 70 (20.0%) described any information on the ethnicity or race of patients in at least 1 data set and only 7 of 70 (10.0%) included any information on the Fitzpatrick skin types of images used in at least 1 data set.

Investigators noted that data on the prevalence of Fitzpatrick skin types in the US population were inadequate.

They also recorded a lack of clear labeling in many studies. They estimated that of a total 1,065,291 images, only 257,372 (24.2%) were publicly available or available on request, whereas 807,919 (75.8%) were private.

Most of the AI literature addressing skin conditions was developed from private data sets used once to generate a single study, with no ability for replication, which greatly limited their effect and generalizability.

Across all the studies, models were generally not accessible, with only 21 of the 70 studies (30.0%) stating that the model was available via request or through a commercial product.

Based off their findings, investigators had 3 recommendations improving transparency and reducing the potential for bias in developed clinical algorithms, the first of which dealt with the sharing of data sets and/or models at the time of publication.

The second recommendation suggested a clear description of important data set characteristics, which include patient populations, skin tones represented, and the image labeling process, if data could not be shared. Lastly, a clear description of how data sets were used for training, validation, testing, or additional external validation was also suggested.

“Developing clinically applicable deep learning models in medicine is predicated on the creation of robust models developed from data sets that are either publicly available for scrutiny or well described,” the team wrote. “Clinical AI must be held to a high standard owing to the potential for significant medical harm from errors and bias.”

The study, “Lack of Transparency and Potential Bias
in Artificial Intelligence Data Sets and Algorithms A Scoping Review,” was published online in JAMA Dermatology.