Mammograms are important for women’s health and to catch early indicators of breast cancer. However, mammograms require human specialists to interpret them, and even highly-trained specialists make mistakes. It’s estimated that the false positive rate of mammograms is approximately 10% in the USA. In order to create systems that give more accurate readings of mammograms, the Digital Mammography (DM) Dream Challenge has recently created a crowdsourced endeavor to develop new mammogram reading algorithms
As PhysicsWorld reports, computer vision techniques and deep learning algorithms have become more sophisticated over the past few years, researchers and engineers are turning towards the use of AI to interpret mammograms, aiming to increase accuracy. The Digital Mammography (DM) Dream Challenge launched a competition to investigate the use of AI algorithms at recognizing possible signs of breast cancer.
The DM Dream challenge is the largest every study of deep learning algorithms regarding mammography interpretation thus far. Justin Guinney, the president of the DREAM challenges, explained that the reason for the challenges is that it allowed for a structured assessment of dozens of deep learning models on two different databases. Participants in the challenge were required to design algorithms that could be trained on mammography data and output a probability score that a patient would be diagnosed with breast cancer within a year. There was also a secondary challenge condition as well. In the second task, the algorithms were allowed to be trained on additional information like demographic risk data, clinical data, and images collected from previous screening exams.
There were two datasets used to train the models. The first dataset was the Kaiser Permanente Washington (KPW) dataset, assembled by researchers from the US. Meanwhile, a second dataset was assembled by Swedish researchers at the Karolinska Institute (KI).
Over 1100 participants joined the challenge, divided into 126 different teams and comprised of people from all over the world. During the first part of the challenge, the models were trained using the KPW data set, which included images from over 140000 screening exams. When compared with the results of the second challenge, it was found that access to more features like clinical data didn’t meaningfully improve the discriminatory power of the algorithms. However, the DM Dream team did suggest that future algorithm development should include the analysis of prior images for patients, positing that the participants may not have fully utilized the data.
According to Health IT Analytics, the DR Dream team wants to have the top eight performing teams collaborate and design an ensemble classification model, to see if this model could potentially outperform the individual models. The challenge coordinators then used a weighted aggregation of the predictions of the various algorithms, creating the Challenge Ensemble Model (CEM). The probabilist predictions of the CEM model were compared against interpretations from radiologists, as evaluated using specificity. The radiologists achieved 90.5% specificity, while the CEM model achieved only 76.1% specificity. While the results seem disappointing, when the guesses of a CEM model and radiologist interpretation were aggregated into another model (CEM+R), the specificity improved to 92%.
Comparable results were achieved on the KI test dataset, used exclusively to validate the models, which contained images of over 166,000 exams. with the CEM model being slightly inferior to radiologists (92.5% specificity compared to 96.7% specificity), but CEM+R achieving better results (98.5% specificity).
While none of the individual models were able to outperform human specialists, the CEM+R model saw a slight advantage compared to the interpretation of radiologists alone. It could be possible that combining human intuition with an AI assistant could improve accuracy.