This repository is a slightly cleaner wash list of MS-Celeb-1M. As we know, there are lots of noises in it. For example, some images belong to one celebrity while those are included in other celebrities. Some images are very blurry and even clearly not human faces.
We provide a wash list to clean the dataset, and you can download from OneDrive(coming soon) or Baidu Yun.
This list is based on XiangWu's repo, so it will may be slightly cleaner. The ID mapping is same as Wu's, and follows the format in the list:
ID/\$(foldername)_\$(filename)
We extracted feature of every image by a CNN, and rudely use hierarchical clustering algorithm to find out the cluster contains the most images in each celebrity folder. The images of this cluster will be regard as no noise data of one celebrity. If the elements of this largest cluster equal or less than 5 images, the whole folder will be dropped.
Datasets | Celebrities | Images |
---|---|---|
Original Dataset | 99,892 | 8,456,240 |
XiangWu's Cleaned Dataset | 79099 | 5,049,824 |
Our Cleaned Dataset | 78579 | 4,621,640 |
We spot-checked some cases manually, and found a few typical cases:
Before:
After
Due to the variance of images in this folder is very large, the biggest cluster only contains 1/5 images. Before:
After
We searched all the individuals of LFW in MSCeleb1M and list the nearest neighbor with cosine similarity in msceleb1m_lfw_mapping_probability.txt.
ID lfw_celeb_name cosine_similarity
According to our test, the pair with 0.5 over cosine similarity can be considered as the same person with a high probability. So we list 1266 pairs whose similarity is more than 0.5 in msceleb1m_lfw_overlaplist.txt as the overlap with LFW. Of course, the threshold can be determined by your own.
We simply trained a same model on the 3 datasets, and the LFW accuracy were listed below:
Dataset | Accuracy |
---|---|
Original Dataset | 98.21% |
XiangWu's Cleaned Dataset | 99.42% |
Our Cleaned Dataset | 99.55% |
Due to the inadequacy of our work, this result may not explain any problems.
- Considering our CNN model is not good enough, this clean list certainly still exist some noises, and some images which are not noises were deleted. We will update this list if we get a better CNN model;
We will try to find out the over 1000 overlap identities between MS-Celeb-1M and LFW;- Unfortunately, the mapping probability list was also generated by our CNN model. We have to admit that there must be some true negative and false positive. We will manually check suspicious pairs;
The released list is only allowed for non-commercial use.