Unsupervised Pathology Detection (UPD) has recently received considerable attention in medical image diagnosis. However, the lack of publicly available benchmark datasets for UPD makes researchers fall back on datasets that were originally created for other tasks. These datasets may exhibit domain shift that acts as a confounding variable, fooling observers into believing that the models excel at detecting pathologies, while a significant part of the model’s performance is detecting the domain shift. In this short paper, we show on the example of the Hyper-Kvasir dataset, how confounding variables can dramatically skew the actual performance of pathology detection methods.
Unsupervised anomaly detection models can distinguish samples from the training distribution from samples from other distributions. Many works have benchmarked their models on the Hyper-Kvasir dataset: A dataset of gastrointestinal endoscopy videos to detect polyps in them. We show that anomaly detection models trained on this dataset actually detect general distribution shifts instead of polyps.
In normal images, the operator simply navigates the endoscopy through the gastrointernal tract. If they see a polyp, they will inspect it more closely, resulting in vastly different orientations, and lighting conditions. These two distributions are greatly different, but the presence of a polyp is not the largest difference between them.
Fig. 1. Normal training and abnormal test images in Hyper-Kvasir come from different distributions.