Article
Author(s):
The findings of a recent diagnostic study suggest that AI was able to infer the self-reported race of infants from retinal vessel maps, something human graders are unable to do.
Results from a recent diagnostic study suggests artificial intelligence (AI) can infer self-reported race from retinal fundus images that were not previously thought to contain information relevant to race.1
Based on these findings, the investigative team led by J. Peter Campbell, MD, MPH, Casey Eye Institute, Oregon Health & Science University, suggests that biomarker-based approaches to training AI models may not remove the potential risk for racial bias in practice.
“These results suggest that preliminary preprocessing steps may not always be successful, and future work ought to ensure that as much attention is paid to potential sources of bias in preprocessing as it is during the training of diagnostic models,” Campbell and colleagues wrote.1
Race is considered a social construct; however, it is associated with variations in skin pigmentation, a phenotypic feature that can affect image-based classification performance. As the application of AI in medicine continues to increase, the harm from these potential biases has warranted attention. Ophthalmology is a field that has experienced a significant increase in AI applications, with imaging modalities that rely on both visible and nonvisible light potentially leading to biases based on differential reflectance based on pigmentation.
In the current study, Campbell and colleagues evaluated whether image-based medical AI algorithms can infer self-reported race directly from color retinal fundus images (RFIs) and retinal vessel maps (RVMs) of infants screened for retinopathy of prematurity (ROP). Between January 2012 - July 2020, infants born before a gestational age of 31 weeks or with a birth weight ≤1501g were routinely screened for ROP in the Imaging and Informatics in ROP cohort study. From this population, those with parent-reported Black or White race were included in the current analysis.
Grayscale RVMs were segmented from color RFIs by an algorithm developed for ROP classification that uses a u-net to focus attention on the retinal vasculature and to standardize the appearance of images with respect to image quality, brightness, and color variations that may occur from differences in choroidal pigmentation. Additionally, RVMS were iteratively transformed via thresholding, binarizing, or skeletonizing and convolutional neural networks (CNN) were trained to learn self-reported race using color RFIs or each iteration of thresholded, binarized, and skeletonized RVMs.
A result of an imbalance between Black and White infants in the data set, the analysis used area under the precision recall curve (AUC-PR) as the main metric of model performance, but additionally evaluated area under the receiver operating characteristic curve (AUROC). Both AUC-PR and AUROC were evaluated at the image and infant level. Study data were analyzed from July - September 2021.
Of the infants in the data set, 245 neonates with parent-reported Black (mean age, 27.2 weeks; 55 males [58.5%]) or White (mean age, 27.6 weeks; 80 males [53.0%]) were included in the analysis. Then, 94 Black infants (38.4%) and 151 White infants (61.6%) met inclusion criteria for the study. A total of 40 different CNN models were trained on color RFIs, or thresholded grayscale, binarized, or skeletonized RVMs.
Upon analysis, investigators found the models had near-perfect ability to predict self-reported race from color RFIs, with an AUC-PR of 0.999 (95% CI, 0.999 - 1.000) at the image level and 1.000 at the infant level (95% CI, 0.999 - 1.000). Campbell and colleagues noted that raw RVMs were nearly as predictive as color RFIs (image-level AUC-PR, 0.938; 95% CI, 0.926 - 0.950; infant-level AUC-PR, 0.995; 95% CI, 0.992 - 0.998).
The investigative team noted that CNNs could learn whether RFIs or RVMs were from Black or White infants, regardless of whether images contained color, vessel segmentation brightness differences were nullified, or vessel segmentation widths were uniform.
In an accompanying editorial, Daniel Shu Wei Ting, MD, PhD, Surgical Retina, Singapore National Eye Center, noted the critical contribution of the study’s findings is the evidence that AI algorithms can still detect features to identify self-reported race beyond what human graders are able to identity in images.
“It highlights the challenge in mitigating the risk of AI racial bias, where previously deployed biomarker-based AI models may not fully address racial bias of the AI algorithm,” Ting wrote.2 “This suggests that further work into AI explainability is required to understand how AI models can differentiate race beyond features identified by humans.”
Ting noted that future research should focus on additional strategies to identify and mitigate AI bias at various stages of the model development pipeline, ranging from the data collection and preparation stage to the model training and evaluation stage to the post-authorization deployment stage.
“As health care AI applications approach large-scale adoption, potential racial biases in AI models must be proactively evaluated and mitigated to prevent patient harm and to reduce inequities in health care,” he wrote.2
References