Dear All,
I used Phashes to search for duplicate images with same or different GT. This may be also used to detect possible duplicates between Train and test set to avoid leakage.
Hi,
I’ve checked the possible duplicates, mainly searching for those with different labels. The fact that the hashes match doesn’t necessarily mean that the images are the same.
There are 104 cases that match the hash, 103 where 2 images match, and 3 in one case. There are 3 cases where the labels do not match, in one of them the images are different (a collision of hashes). In the other 2 cases the same image (with different ID) was labeled as NEV and MEL.
Hashes collision: (‘ISIC_0060085.jpg’, ‘1_NV’), (‘ISIC_0053484.jpg’, ‘4_BKL’)
Double labeling:
Case 1: (‘ISIC_0069013.jpg’, ‘1_NV’), (‘ISIC_0071017.jpg’, ‘0_MEL’)
Case 2: (‘ISIC_0067980.jpg’, ‘1_NV’), (‘ISIC_0067502.jpg’, ‘0_MEL’)
This issue should be reviewed further before the new dataset is published on the ISIC platform. The analysis performed by @deeponcologyai was very useful.