In here: Task 3 supplemental information it is stated that the validation and test set predictions are to be made on “single image data only”. I was wondering whether this means that there is no lesion overlap between the training and validation set.
For the test set this is obvious. However, I was wondering whether the feedback from validation submissions is meaningful. If we cannot assume strict lesion separation for the validation and training set, then the performance feedback is not that helpful. Also, for the arxiv paper potentially reported results would be unrealistic.
This means you are not supposed to use aggregate dataset statistics in any way to affect the judgement on single images. i.e. no normalization based on entire test set. The system should make decisions on test set images based on each single image by itself.
To directly answer your question about overlap, there is no overlap between Test and Validation image sets. However, both image sets were randomly sampled from the same population of clinical images:
[1] Noel C. F. Codella, David Gutman, M. Emre Celebi, Brian Helba, Michael A. Marchetti, Stephen W. Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kittler, Allan Halpern: “Skin Lesion Analysis Toward Melanoma Detection: A Challenge at the 2017 International Symposium on Biomedical Imaging (ISBI), Hosted by the International Skin Imaging Collaboration (ISIC)”, 2017; arXiv:1710.05006.
[2] Philipp Tschandl, Cliff Rosendahl, Harald Kittler: “The HAM10000 Dataset: A Large Collection of Multi-Source Dermatoscopic Images of Common Pigmented Skin Lesions”, 2018; arXiv:1803.10417.
Accordingly, the two image sets should be statistically quite similar. Of course, due to the significantly smaller sample size of the Validation image set (and the non-blinded scoring), we are not confident that the Validation phase provides a robust measure of algorithm performance; it’s only a tool to ensure the formatting correctness of submissions.
Still regarding the image set distributions, can we assume that the testing set has the same statistical distribution/representativity of the training set?
No, the Training and Test sets do not necessarily have the same statistical properties. An important aspect of clinically-useful algorithms is generalizability, so the Test sets may contain some images drawn from additional sources. Of course, all participants are evaluated equally against the same Test sets, which we believe is fundamentally fair. Furthermore, note that we are gathering and reporting information on whether submitted algorithms were trained with additional dermoscopic / skin imaging data sources, beyond the provided Training images.