Bring Your Own Data - collecting a dataset with 5k images of clothes

BYOD
Bring Your Own Data
Or how I collected 5,000 images of clothes
Alexey Grigorev
22.10.2020

Plan
● Why collecting a dataset?
● Collecting a dataset
○ Amazon MTurk
○ Yandex.Toloka
○ Tagias
● Annotating it
● Accessing and using it

Which dataset to use for
the deep learning chapter?
🤔

What if I collected
a dataset myself?
🤔

Crowdsourcing platforms
● Amazon MTurk
● Yandex.Toloka

https://blog.mturk.com/tutorial-how-to-create-hits-that-ask-workers-to-upload-files-using-amazon-cognito-and-amazon-s3-38acb1108633
I gave up somewhere here

Uploading images of clothes
Take pictures of clothes and upload them
Take as many pictures as possible of the following items:
● T-shirts
● Sweaters
● Shirts
● Jeans and pants
● Dresses and skirts
● ...

Other options?
Let’s ask the network!

https://medium.com/data-science-insider/clothing-dataset-call-for-action-3cad023246c1

Quality of labels
● Train a simple neural network on all data
● High learning rate, 1-2 epochs
● Apply the model to the dataset and look at the error
● Correct mistakes in labels manually

https://www.linkedin.com/posts/djemeljanovs_datasciencedj-100daysofdatascience-rigadsclub-activity-6704485793743847425-NsLr/
Uploading to Kaggle
● Run kaggle datasets init
● Update the generated dataset-metadata.json file
● Run kaggle datasets create --dir-mode=tar

https://www.kaggle.com/agrigorev/clothing-dataset-full

Subset: top-10 classes
● T-shirt (928 items)
● Long Sleeve (576 items)
● Pants (559 items)
● Shirt (345 items)
● Shoes (297 items)
● Dress (288 items)
● Shorts (257 items)
● Outwear (246 items)
● Hat (149 items)
● Skirt (136 items)
https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/
chapter-07-neural-nets/07-neural-nets-train.ipynb
https://github.com/alexeygrigorev/clothing-dataset-small

https://medium.com/data-science-insider/clothing-dataset-5b72cd7c3f1f

Download
● https://www.kaggle.com/agrigorev/clothing-dataset-full
● https://github.com/alexeygrigorev/clothing-dataset
● https://github.com/alexeygrigorev/clothing-dataset-small

Summary
● Amazon MTurk — too much time to set up
● Yandex.Toloka — easy to set up, not easy to validate data
● Networking — great, but difficult and needs good incentive
● Tagias — good quality, 100% certainty data is not copied

mlbookcamp.com
● Learn Machine Learning by doing
projects
● http://bit.ly/mlbookcamp
● Get 40% off with code “grigorevpc”
Machine Learning
Bookcamp

@Al_Grigoragrigorev
https://airtable.com/shrrHhErcwaqH59TY
Get the slides and
win a free copy of
ML Bookcamp!

Bring Your Own Data - collecting a dataset with 5k images of clothes

Recommended

Recommended

More Related Content

Similar to Bring Your Own Data - collecting a dataset with 5k images of clothes

Similar to Bring Your Own Data - collecting a dataset with 5k images of clothes (20)

More from Alexey Grigorev

More from Alexey Grigorev (20)

Recently uploaded

Recently uploaded (20)

Bring Your Own Data - collecting a dataset with 5k images of clothes