Practical tips for handling noisy data and annotaiton
1. KaggleDays Tokyo Workshop
Practical tips for handling
noisy data and annotation
Ryuichi Kanoh (RK)
December 11, 2019 https://www.kaggle.com/ryuichi0704
2. Overview
- This is a KaggleDays workshop on noise handling sponsored by DeNA.
- In addition to explaining the techniques, I will touch on:
- Experimental results
- Implementations (https://github.com/ryuichi0704/workshop_noise_handling)
- Interactive communication is welcome.
- Both in English and Japanese.
2
5. Big data and machine learning
- Large and high-quality dataset drives the success of ML.
- However, it is very hard to prepare such a dataset.
- So, you probably want to use crowdsourcing / crawling and so on.
https://medium.com/syncedreview/sensetime-trains-imagenet-alexnet-in-record-1-5-minutes-e944ab049b2c
5
6. Possible noise from web crawling
Google Search
- The keywords may not be relevant to the image content.
6
7. Possible noise from crowd sourcing
Annotation error may occur with
- limited communication between seekers and solvers.
- limited working time.
https://www.flickr.com/photos/kjempekjekt/3470254809
7
8. Noise example in kaggle competitions
8
- Task
- 340 class timestamped vector classification
- Difficulty
- Dataset is collected from browser game.
- Quality differences depending on drawer
https://quickdraw.withgoogle.com/
10. There are many noisy datasets in kaggle
10
Classes are fine-grained. It is difficult even for a
human to annotate consistently.
75% of the dataset is annotated by metadata.
(not by a human)
Annotation granularity is not stable.
(e.g. [face] vs [face, nose, eye, mouth...])
11. There are many noisy datasets in kaggle
11
Labels vary depending on the annotator.
Annotation was crowd-sourced.
There are external datasets with noisy
annotations.
Each video was automatically annotated by the
YouTube annotation system.
13. Setup
- Use QuickDraw (Link) dataset.
- 340 class image classification.
- Evaluation metric is top-1 accuracy.
- Timestamped vectors are converted to 1-channel images with 32x32 resolution.
- Dataset is randomly subsampled from original dataset.
- Train: 81600 samples, Test: 20400 samples (random split)
- 300 images per class in total.
- Test accuracy observed with the maximum validation accuracy is reported.
- Base setting:
13
model base-lr batch-size epoch train:valid objective optimizer scheduler
ResNet18 0.1 128 50
9:1
(random
split)
Cross entropy
SGD with
nesterov
momentum,
w-decay 1e-4
MultistepLR
(x 0.1 at 40, 45 epoch)
*Other details are in the GitHub repository.
14. Setup
14
Local machine
Container registry AI platform training
Google Cloud
push
Google sheet
results
pull
- Experiments were done with AI platform training.
notification
15. Base results
15
- Test accuracy distribution with 50 seeds.
- 0.563 is an average performance.
- random seed effect is around 0.004 (maximum: ~0.010)
16. Model output analysis
- Check hard samples and easy samples.
16
prediction (for validation dataset)
label
Error = 0.01 0.90 0.03
17. 17
- Model output analysis
Easy samples (based on cross-entropy)
label / pred
18. 18
- Model output analysis
Hard samples (based on cross-entropy)
label / pred
19. 19
- Model output analysis
Why difficult?
- There are a number of ways in which a model can be difficult.
- Image itself is noisy or wrong
- 1, 2, 8, 9
- There are similar confusing classes
- 3, 4, 7
- So, the model is definitely struggling.
- Are there any techniques we can use improve model training?
1 2 3
4 5 6
7 8 9
20. ● Introduction
● Setup of the experiment
● Techniques for learning with a noisy dataset
1. Mixup
2. Large batch size
3. Distillation
20
Agenda
21. [1] Mixup
- Construct virtual training sample with
- λ is randomly sampled from symmetric beta distribution.
21
https://arxiv.org/abs/1710.09412
http://moustaphacisse.com/
23. - Even though we have blue (noisy) sample in here, its effect is suppressed by the
surrounding red.
- Effectiveness for label noise is also mentioned in the original paper.
23
https://www.inference.vc/mixup-data-dependent-data-augmentation/
- Mixup
Why mixup for noisy dataset?
24. - You may ask what happens if we mix in the intermediate vector.
- It is called “Manifold mixup”. https://arxiv.org/abs/1806.05236
- Feature vectors in the random layer are mixed using the same mixup procedure.
24
- Mixup
Derivatives of mixup
https://arxiv.org/abs/1512.03385
25. 25
- Mixup
Experimental results
- Mixup performance is better than base performance (0.563).
- Important aspects:
- Performance changes drastically with alpha (beta-distribution parameter).
- A manifold-mixup is also a viable alternative.
- It can be used not only for images, but for categorical tabular data and so on..
26. - Data mixing based on beta distribution
- https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/model.py#L39-L51
- Select mixing layer (for manifold mixup)
- https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/model.py#L28-L37
- Loss calculation
- https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/runner/mixup_runner.py#L17-L20
26
- Mixup
Implementations
27. - Stopping strong augmentation (like mixup, auto-augment) on the final phase is helpful.
(https://arxiv.org/abs/1909.09148)
- Performance improvement is observed with our QuickDraw dataset.
27
- Mixup
Tips on training with mixup
QuickDraw dataset, with mixup (alpha=1.6)
0.6074 → 0.6165
28. - iMet Collection 2019 - FGVC6, 6th place
- For image data
- Freesound Audio Tagging 2019, 1st place
- For audio data
- The 2nd YouTube-8M Video Understanding Challenge, 2nd place
- For video feature vector
28
- Mixup
Examples in competitions
29. [2] Large batch size
29
- With a severe label noise, large batch size is helpful for training.
- Within a large batch, the gradient from random noisy labels cancels out.
Larger batch size is effective.
https://arxiv.org/abs/1705.10694
30. 30
- Large batch size
Other aspect: sharp and flat minimum
- If the noise of the gradient is too small, the model is likely to converge into a sharp minimum.
- At the sharp minimum, the model is not likely to be generalized.
- For balancing, it is said that the learning rate should be tuned together with batch size.
https://arxiv.org/abs/1609.04836
- Practically, only considering batch size is not enough.
31. 31
- Large batch size
Experimental results
*Trained under the same number of iterations. (epoch=batch_size)
- Clear proportional relationship is observed.
- Not so large batch size looks optimal for this dataset.
32. 32
- Large batch size
Experimental results
- Note that there are other relationships.
- Learning-rate vs weight-decay
- Although the purposes of algorithms are different, they have a strong relationship.
- It is important to tune parameters with considering their interactions.
33. - Usually, it is hard to set large batch size because of GPU memory.
- Approach 1:
- Gradient accumulation
- Approach 2:
- Mixed precision training
- https://github.com/NVIDIA/apex
33
- Large batch size
Tips for setting large batch size
34. - Gradient accumulation
- https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/runner/base_runner.py#L102-L104
- Hyper-parameters are set by arguments
- https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/common.py#L15-L34
34
- Large batch size
Implementations
35. - Quick, Draw! Doodle Recognition Challenge, 5th place
- Batch size up to 10K.
- iMet Collection 2019 - FGVC6, 1st place
- Batch size 1000~1500 (Accumulation 10~20 times)
35
- Large batch size
Examples in competitions
36. [3] Distillation
- Train student network with pre-trained teacher prediction.
- It eases student model’s training. (student can understand which sample is difficult)
36
https://towardsdatascience.com/knowledge-distillation-simplified-dd4973dbc764
37. 37
- Distillation
Procedure [1/3]
Train teacher
Teacher model
Teacher model
prediction
Train data (data) Train data (label)
Teacher prediction for
train data
Test data (data)
Operation
Data
Train data
Train Train Train Train Valid
Train Train Train Valid Train
Train Train Valid Train Train
Train Valid Train Train Train
Valid Train Train Train Train
Fold1
Fold2
Fold3
Fold4
Fold5
OOF prediction
Cross validation
For making teacher predictions, the model is often trained with cross validation.
38. 38
- Distillation
Procedure [2/3]
Train teacher
Teacher model
Teacher model
prediction
Train data (data) Train data (label)
Teacher prediction for
train data
Test data (data)
Operation
Data
Train student
Student model
Train data
Train Train Train Train Valid
Train Train Train Valid Train
Train Train Valid Train Train
Train Valid Train Train Train
Valid Train Train Train Train
Fold1
Fold2
Fold3
Fold4
Fold5
OOF prediction
Cross validation
39. 39
- Distillation
Procedure [3/3]
Train teacher
Teacher model
Teacher model
prediction
Train data (data) Train data (label)
Teacher prediction for
train data
Test data (data)
Operation
Data
Train student
Student model
Model prediction
Prediction result for
test data
Train data
Train Train Train Train Valid
Train Train Train Valid Train
Train Train Valid Train Train
Train Valid Train Train Train
Valid Train Train Train Train
Fold1
Fold2
Fold3
Fold4
Fold5
OOF prediction
Cross validation
40. There are some strategies for training the student.
- Use (a*soft-label loss) + (b*hard-label loss) as the student’s loss function.
- Max(0.7*soft-label, hard-label) as new label. (e.g. for F2 metric)
- Softmax with temperature is sometimes used for teacher prediction.
- In original paper, KL divergence between the student and the teacher was also used.
- https://arxiv.org/abs/1503.02531
40
- Distillation
How to use teacher prediction
teacher prediction (=soft label)
hard label
0.20 0.70soft label = 0.10 0.70
0.00 1.00hard label = 0.00 0.00
0.14 1.00new target = 0.07 0.49
41. - Distillation can smooth out extremity in noisy hard labels.
- If the data is complex and it is hard to annotate, teacher prediction labels for the
data may not have high confidence.
- When a cat is annotated as a dog by mistake, the teacher prediction label for the
data may be close to dog if other datasets are reliable.
41
- Distillation
Why distillation for noisy dataset?
dog cat
label (noisy) 0 1
teacher prediction 0.9 0.1
42. 42
- Distillation
Experimental results
- Distillation performance is better than base performance (0.563).
- Improved performance even with the same model architecture.
- Weight of the soft loss affects performance.
- 2 (soft-loss effect is double of the hard-loss) is the best.
- Is it because of the noisy dataset?
43. - Calculate hard and soft loss
- https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/runner/distillation_runner.py#L64-L77
43
- Distillation
Implementations
44. - iMet Collection 2019 - FGVC6, 9th place
- Max(0.7*soft-label, hard-label) as new targets
- Property of the competition metric (F2) is considered.
- The 2nd YouTube-8M Video Understanding Challenge, 2nd place
- Multi-stage distillation
44
- Distillation
Examples in competitions