A list of duplicate images in the training set

Dear All,
I used Phashes to search for duplicate images with same or different GT. This may be also used to detect possible duplicates between Train and test set to avoid leakage.

The full Jupyter notebook is attached.

Regards,
duplicated-list-csv-file.zip (784.9 KB)

2 Likes

Hi,
I’ve checked the possible duplicates, mainly searching for those with different labels. The fact that the hashes match doesn’t necessarily mean that the images are the same.
There are 104 cases that match the hash, 103 where 2 images match, and 3 in one case. There are 3 cases where the labels do not match, in one of them the images are different (a collision of hashes). In the other 2 cases the same image (with different ID) was labeled as NEV and MEL.

Hashes collision: (‘ISIC_0060085.jpg’, ‘1_NV’), (‘ISIC_0053484.jpg’, ‘4_BKL’)
Double labeling:
Case 1: (‘ISIC_0069013.jpg’, ‘1_NV’), (‘ISIC_0071017.jpg’, ‘0_MEL’)
Case 2: (‘ISIC_0067980.jpg’, ‘1_NV’), (‘ISIC_0067502.jpg’, ‘0_MEL’)

This issue should be reviewed further before the new dataset is published on the ISIC platform. The analysis performed by @deeponcologyai was very useful.

Best regards,

1 Like

Eduardo Pérez hello,
are you saying that there is only 2 cases that are double labeling, please confirm.

Hi @alla1g2f0 ,
Reproducing the work of @deeponcologyai I have obtained these 2 cases with the same images and different tags.

1 Like

A pending task in this topic would be manually reviewing the rest of the images whose hashes matched

1 Like

Hello,

I can confirm the images correspond to the diagnostic class MEL.

Sorry for any inconveniences.

We will make sure we check the rest of the hash collisions for any other diagnostic inconsistencies.

Marc Combalia

2 Likes

so there is a duplicated in the training data?