Practical tips for handling noisy data and annotaiton

KaggleDays Tokyo Workshop
Practical tips for handling
noisy data and annotation
Ryuichi Kanoh (RK)
December 11, 2019 https://www.kaggle.com/ryuichi0704
Overview
- This is a KaggleDays workshop on noise handling sponsored by DeNA.
- In addition to explaining the techniques, I will touch on:
- Experimental results
- Implementations (https://github.com/ryuichi0704/workshop_noise_handling)
- Interactive communication is welcome.
- Both in English and Japanese.
2
Agenda
● Introduction
● Setup of the experiment
● Techniques for learning with a noisy dataset
3
Agenda
● Introduction
● Setup of the experiment
● Techniques for learning with a noisy dataset
4
Big data and machine learning
- Large and high-quality dataset drives the success of ML.
- However, it is very hard to prepare such a dataset.
- So, you probably want to use crowdsourcing / crawling and so on.
https://medium.com/syncedreview/sensetime-trains-imagenet-alexnet-in-record-1-5-minutes-e944ab049b2c
5
Possible noise from web crawling
Google Search
- The keywords may not be relevant to the image content.
6
Possible noise from crowd sourcing
Annotation error may occur with
- limited communication between seekers and solvers.
- limited working time.
https://www.flickr.com/photos/kjempekjekt/3470254809
7
Noise example in kaggle competitions
8
- Task
- 340 class timestamped vector classification
- Difficulty
- Dataset is collected from browser game.
- Quality differences depending on drawer
https://quickdraw.withgoogle.com/
Dataset example (class: monkey)
9
There are many noisy datasets in kaggle
10
Classes are fine-grained. It is difficult even for a
human to annotate consistently.
75% of the dataset is annotated by metadata.
(not by a human)
Annotation granularity is not stable.
(e.g. [face] vs [face, nose, eye, mouth...])
There are many noisy datasets in kaggle
11
Labels vary depending on the annotator.
Annotation was crowd-sourced.
There are external datasets with noisy
annotations.
Each video was automatically annotated by the
YouTube annotation system.
Agenda
● Introduction
● Setup of the experiment
● Techniques for learning with a noisy dataset
12
Setup
- Use QuickDraw (Link) dataset.
- 340 class image classification.
- Evaluation metric is top-1 accuracy.
- Timestamped vectors are converted to 1-channel images with 32x32 resolution.
- Dataset is randomly subsampled from original dataset.
- Train: 81600 samples, Test: 20400 samples (random split)
- 300 images per class in total.
- Test accuracy observed with the maximum validation accuracy is reported.
- Base setting:
13
model base-lr batch-size epoch train:valid objective optimizer scheduler
ResNet18 0.1 128 50
9:1
(random
split)
Cross entropy
SGD with
nesterov
momentum,
w-decay 1e-4
MultistepLR
(x 0.1 at 40, 45 epoch)
*Other details are in the GitHub repository.
Setup
14
Local machine
Container registry AI platform training
Google Cloud
push
Google sheet
results
pull
- Experiments were done with AI platform training.
notification
Base results
15
- Test accuracy distribution with 50 seeds.
- 0.563 is an average performance.
- random seed effect is around 0.004 (maximum: ~0.010)
Model output analysis
- Check hard samples and easy samples.
16
prediction (for validation dataset)
label
Error = 0.01 0.90 0.03
17
- Model output analysis
Easy samples (based on cross-entropy)
label / pred
18
- Model output analysis
Hard samples (based on cross-entropy)
label / pred
19
- Model output analysis
Why difficult?
- There are a number of ways in which a model can be difficult.
- Image itself is noisy or wrong
- 1, 2, 8, 9
- There are similar confusing classes
- 3, 4, 7
- So, the model is definitely struggling.
- Are there any techniques we can use improve model training?
1 2 3
4 5 6
7 8 9
● Introduction
● Setup of the experiment
● Techniques for learning with a noisy dataset
1. Mixup
2. Large batch size
3. Distillation
20
Agenda
[1] Mixup
- Construct virtual training sample with
- λ is randomly sampled from symmetric beta distribution.
21
https://arxiv.org/abs/1710.09412
http://moustaphacisse.com/
- Mixup
Beta distribution
22
http://wazalabo.com/mixup_1.html
- Large alpha : strong smoothing (in original paper, 0.2 for ImageNet)
- Even though we have blue (noisy) sample in here, its effect is suppressed by the
surrounding red.
- Effectiveness for label noise is also mentioned in the original paper.
23
https://www.inference.vc/mixup-data-dependent-data-augmentation/
- Mixup
Why mixup for noisy dataset?
- You may ask what happens if we mix in the intermediate vector.
- It is called “Manifold mixup”. https://arxiv.org/abs/1806.05236
- Feature vectors in the random layer are mixed using the same mixup procedure.
24
- Mixup
Derivatives of mixup
https://arxiv.org/abs/1512.03385
25
- Mixup
Experimental results
- Mixup performance is better than base performance (0.563).
- Important aspects:
- Performance changes drastically with alpha (beta-distribution parameter).
- A manifold-mixup is also a viable alternative.
- It can be used not only for images, but for categorical tabular data and so on..
- Data mixing based on beta distribution
- https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/model.py#L39-L51
- Select mixing layer (for manifold mixup)
- https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/model.py#L28-L37
- Loss calculation
- https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/runner/mixup_runner.py#L17-L20
26
- Mixup
Implementations
- Stopping strong augmentation (like mixup, auto-augment) on the final phase is helpful.
(https://arxiv.org/abs/1909.09148)
- Performance improvement is observed with our QuickDraw dataset.
27
- Mixup
Tips on training with mixup
QuickDraw dataset, with mixup (alpha=1.6)
0.6074 → 0.6165
- iMet Collection 2019 - FGVC6, 6th place
- For image data
- Freesound Audio Tagging 2019, 1st place
- For audio data
- The 2nd YouTube-8M Video Understanding Challenge, 2nd place
- For video feature vector
28
- Mixup
Examples in competitions
[2] Large batch size
29
- With a severe label noise, large batch size is helpful for training.
- Within a large batch, the gradient from random noisy labels cancels out.
Larger batch size is effective.
https://arxiv.org/abs/1705.10694
30
- Large batch size
Other aspect: sharp and flat minimum
- If the noise of the gradient is too small, the model is likely to converge into a sharp minimum.
- At the sharp minimum, the model is not likely to be generalized.
- For balancing, it is said that the learning rate should be tuned together with batch size.
https://arxiv.org/abs/1609.04836
- Practically, only considering batch size is not enough.
31
- Large batch size
Experimental results
*Trained under the same number of iterations. (epoch=batch_size)
- Clear proportional relationship is observed.
- Not so large batch size looks optimal for this dataset.
32
- Large batch size
Experimental results
- Note that there are other relationships.
- Learning-rate vs weight-decay
- Although the purposes of algorithms are different, they have a strong relationship.
- It is important to tune parameters with considering their interactions.
- Usually, it is hard to set large batch size because of GPU memory.
- Approach 1:
- Gradient accumulation
- Approach 2:
- Mixed precision training
- https://github.com/NVIDIA/apex
33
- Large batch size
Tips for setting large batch size
- Gradient accumulation
- https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/runner/base_runner.py#L102-L104
- Hyper-parameters are set by arguments
- https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/common.py#L15-L34
34
- Large batch size
Implementations
- Quick, Draw! Doodle Recognition Challenge, 5th place
- Batch size up to 10K.
- iMet Collection 2019 - FGVC6, 1st place
- Batch size 1000~1500 (Accumulation 10~20 times)
35
- Large batch size
Examples in competitions
[3] Distillation
- Train student network with pre-trained teacher prediction.
- It eases student model’s training. (student can understand which sample is difficult)
36
https://towardsdatascience.com/knowledge-distillation-simplified-dd4973dbc764
37
- Distillation
Procedure [1/3]
Train teacher
Teacher model
Teacher model
prediction
Train data (data) Train data (label)
Teacher prediction for
train data
Test data (data)
Operation
Data
Train data
Train Train Train Train Valid
Train Train Train Valid Train
Train Train Valid Train Train
Train Valid Train Train Train
Valid Train Train Train Train
Fold1
Fold2
Fold3
Fold4
Fold5
OOF prediction
Cross validation
For making teacher predictions, the model is often trained with cross validation.
38
- Distillation
Procedure [2/3]
Train teacher
Teacher model
Teacher model
prediction
Train data (data) Train data (label)
Teacher prediction for
train data
Test data (data)
Operation
Data
Train student
Student model
Train data
Train Train Train Train Valid
Train Train Train Valid Train
Train Train Valid Train Train
Train Valid Train Train Train
Valid Train Train Train Train
Fold1
Fold2
Fold3
Fold4
Fold5
OOF prediction
Cross validation
39
- Distillation
Procedure [3/3]
Train teacher
Teacher model
Teacher model
prediction
Train data (data) Train data (label)
Teacher prediction for
train data
Test data (data)
Operation
Data
Train student
Student model
Model prediction
Prediction result for
test data
Train data
Train Train Train Train Valid
Train Train Train Valid Train
Train Train Valid Train Train
Train Valid Train Train Train
Valid Train Train Train Train
Fold1
Fold2
Fold3
Fold4
Fold5
OOF prediction
Cross validation
There are some strategies for training the student.
- Use (a*soft-label loss) + (b*hard-label loss) as the student’s loss function.
- Max(0.7*soft-label, hard-label) as new label. (e.g. for F2 metric)
- Softmax with temperature is sometimes used for teacher prediction.
- In original paper, KL divergence between the student and the teacher was also used.
- https://arxiv.org/abs/1503.02531
40
- Distillation
How to use teacher prediction
teacher prediction (=soft label)
hard label
0.20 0.70soft label = 0.10 0.70
0.00 1.00hard label = 0.00 0.00
0.14 1.00new target = 0.07 0.49
- Distillation can smooth out extremity in noisy hard labels.
- If the data is complex and it is hard to annotate, teacher prediction labels for the
data may not have high confidence.
- When a cat is annotated as a dog by mistake, the teacher prediction label for the
data may be close to dog if other datasets are reliable.
41
- Distillation
Why distillation for noisy dataset?
dog cat
label (noisy) 0 1
teacher prediction 0.9 0.1
42
- Distillation
Experimental results
- Distillation performance is better than base performance (0.563).
- Improved performance even with the same model architecture.
- Weight of the soft loss affects performance.
- 2 (soft-loss effect is double of the hard-loss) is the best.
- Is it because of the noisy dataset?
- Calculate hard and soft loss
- https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/runner/distillation_runner.py#L64-L77
43
- Distillation
Implementations
- iMet Collection 2019 - FGVC6, 9th place
- Max(0.7*soft-label, hard-label) as new targets
- Property of the competition metric (F2) is considered.
- The 2nd YouTube-8M Video Understanding Challenge, 2nd place
- Multi-stage distillation
44
- Distillation
Examples in competitions
Summary
45
● Introduction
● Setup of the experiment
● Techniques for learning with a noisy dataset
- Learning with selected samples
- Drop large error samples from training
- Curriculum learning
- Learning with noise transition
- Forward correction (modify objective)
46
(Curriculum learning)
(Noise translation matrix)
There are many other techniques
https://arxiv.org/abs/1808.01097
55 45
55 45
55 45
55 45
55
50
50
50
50
50
12.5
12.5
EOF
47
1 of 47

Recommended

自己教師学習(Self-Supervised Learning) by
自己教師学習(Self-Supervised Learning)自己教師学習(Self-Supervised Learning)
自己教師学習(Self-Supervised Learning)cvpaper. challenge
12.7K views177 slides
Kaggleのテクニック by
KaggleのテクニックKaggleのテクニック
KaggleのテクニックYasunori Ozaki
24.6K views30 slides
[DL輪読会]GENESIS: Generative Scene Inference and Sampling with Object-Centric L... by
[DL輪読会]GENESIS: Generative Scene Inference and Sampling with Object-Centric L...[DL輪読会]GENESIS: Generative Scene Inference and Sampling with Object-Centric L...
[DL輪読会]GENESIS: Generative Scene Inference and Sampling with Object-Centric L...Deep Learning JP
624 views22 slides
【DL輪読会】GPT-4Technical Report by
【DL輪読会】GPT-4Technical Report【DL輪読会】GPT-4Technical Report
【DL輪読会】GPT-4Technical ReportDeep Learning JP
1.4K views29 slides
TabNetの論文紹介 by
TabNetの論文紹介TabNetの論文紹介
TabNetの論文紹介西岡 賢一郎
763 views12 slides
Curriculum Learning (関東CV勉強会) by
Curriculum Learning (関東CV勉強会)Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)Yoshitaka Ushiku
63.9K views43 slides

More Related Content

What's hot

勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial) by
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)RyuichiKanoh
25.6K views173 slides
[DL輪読会]Dense Captioning分野のまとめ by
[DL輪読会]Dense Captioning分野のまとめ[DL輪読会]Dense Captioning分野のまとめ
[DL輪読会]Dense Captioning分野のまとめDeep Learning JP
2K views39 slides
[DL輪読会]AutoAugment: LearningAugmentation Strategies from Data & Learning Data... by
[DL輪読会]AutoAugment: LearningAugmentation Strategies from Data & Learning Data...[DL輪読会]AutoAugment: LearningAugmentation Strategies from Data & Learning Data...
[DL輪読会]AutoAugment: LearningAugmentation Strategies from Data & Learning Data...Deep Learning JP
2.9K views22 slides
SSII2022 [OS3-02] Federated Learningの基礎と応用 by
SSII2022 [OS3-02] Federated Learningの基礎と応用SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用SSII
2.4K views38 slides
[DL輪読会]Live-Streaming Fraud Detection: A Heterogeneous Graph Neural Network A... by
[DL輪読会]Live-Streaming Fraud Detection: A Heterogeneous Graph Neural Network A...[DL輪読会]Live-Streaming Fraud Detection: A Heterogeneous Graph Neural Network A...
[DL輪読会]Live-Streaming Fraud Detection: A Heterogeneous Graph Neural Network A...Deep Learning JP
1K views20 slides
ArcFace: Additive Angular Margin Loss for Deep Face Recognition by
ArcFace: Additive Angular Margin Loss for Deep Face RecognitionArcFace: Additive Angular Margin Loss for Deep Face Recognition
ArcFace: Additive Angular Margin Loss for Deep Face Recognitionharmonylab
812 views21 slides

What's hot(20)

勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial) by RyuichiKanoh
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
RyuichiKanoh25.6K views
[DL輪読会]Dense Captioning分野のまとめ by Deep Learning JP
[DL輪読会]Dense Captioning分野のまとめ[DL輪読会]Dense Captioning分野のまとめ
[DL輪読会]Dense Captioning分野のまとめ
Deep Learning JP2K views
[DL輪読会]AutoAugment: LearningAugmentation Strategies from Data & Learning Data... by Deep Learning JP
[DL輪読会]AutoAugment: LearningAugmentation Strategies from Data & Learning Data...[DL輪読会]AutoAugment: LearningAugmentation Strategies from Data & Learning Data...
[DL輪読会]AutoAugment: LearningAugmentation Strategies from Data & Learning Data...
Deep Learning JP2.9K views
SSII2022 [OS3-02] Federated Learningの基礎と応用 by SSII
SSII2022 [OS3-02] Federated Learningの基礎と応用SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2.4K views
[DL輪読会]Live-Streaming Fraud Detection: A Heterogeneous Graph Neural Network A... by Deep Learning JP
[DL輪読会]Live-Streaming Fraud Detection: A Heterogeneous Graph Neural Network A...[DL輪読会]Live-Streaming Fraud Detection: A Heterogeneous Graph Neural Network A...
[DL輪読会]Live-Streaming Fraud Detection: A Heterogeneous Graph Neural Network A...
Deep Learning JP1K views
ArcFace: Additive Angular Margin Loss for Deep Face Recognition by harmonylab
ArcFace: Additive Angular Margin Loss for Deep Face RecognitionArcFace: Additive Angular Margin Loss for Deep Face Recognition
ArcFace: Additive Angular Margin Loss for Deep Face Recognition
harmonylab812 views
公平性を保証したAI/機械学習
アルゴリズムの最新理論 by Kazuto Fukuchi
公平性を保証したAI/機械学習
アルゴリズムの最新理論公平性を保証したAI/機械学習
アルゴリズムの最新理論
公平性を保証したAI/機械学習
アルゴリズムの最新理論
Kazuto Fukuchi1.4K views
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features by ARISE analytics
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
ARISE analytics7.8K views
論文紹介:Grad-CAM: Visual explanations from deep networks via gradient-based loca... by Kazuki Adachi
論文紹介:Grad-CAM: Visual explanations from deep networks via gradient-based loca...論文紹介:Grad-CAM: Visual explanations from deep networks via gradient-based loca...
論文紹介:Grad-CAM: Visual explanations from deep networks via gradient-based loca...
Kazuki Adachi5.2K views
GAN(と強化学習との関係) by Masahiro Suzuki
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
Masahiro Suzuki83K views
[DL輪読会]Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) 表形式デー... by Deep Learning JP
[DL輪読会]Revisiting Deep Learning Models for Tabular Data  (NeurIPS 2021) 表形式デー...[DL輪読会]Revisiting Deep Learning Models for Tabular Data  (NeurIPS 2021) 表形式デー...
[DL輪読会]Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) 表形式デー...
Deep Learning JP2.5K views
[DL輪読会]EfficientDet: Scalable and Efficient Object Detection by Deep Learning JP
[DL輪読会]EfficientDet: Scalable and Efficient Object Detection[DL輪読会]EfficientDet: Scalable and Efficient Object Detection
[DL輪読会]EfficientDet: Scalable and Efficient Object Detection
Deep Learning JP7K views
Isolation forest by kataware
Isolation forestIsolation forest
Isolation forest
kataware9.2K views
Top-K Off-Policy Correction for a REINFORCE Recommender System by harmonylab
Top-K Off-Policy Correction for a REINFORCE Recommender SystemTop-K Off-Policy Correction for a REINFORCE Recommender System
Top-K Off-Policy Correction for a REINFORCE Recommender System
harmonylab3.8K views
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向 by SSII
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
SSII2.1K views
楽しい研究のために今からできること 〜新しく研究を始める皆さんへ〜 by 諒介 荒木
楽しい研究のために今からできること 〜新しく研究を始める皆さんへ〜楽しい研究のために今からできること 〜新しく研究を始める皆さんへ〜
楽しい研究のために今からできること 〜新しく研究を始める皆さんへ〜
諒介 荒木7.6K views
Generative Models(メタサーベイ ) by cvpaper. challenge
Generative Models(メタサーベイ )Generative Models(メタサーベイ )
Generative Models(メタサーベイ )
cvpaper. challenge7.5K views

Similar to Practical tips for handling noisy data and annotaiton

Troubleshooting Deep Neural Networks - Full Stack Deep Learning by
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningSergey Karayev
20.6K views146 slides
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi... by
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...Jinwon Lee
708 views26 slides
Learning global pooling operators in deep neural networks for image retrieval... by
Learning global pooling operators in deep neural networks for image retrieval...Learning global pooling operators in deep neural networks for image retrieval...
Learning global pooling operators in deep neural networks for image retrieval...Erlangen Artificial Intelligence & Machine Learning Meetup
213 views61 slides
Web Traffic Time Series Forecasting by
Web Traffic  Time Series ForecastingWeb Traffic  Time Series Forecasting
Web Traffic Time Series ForecastingBillTubbs
2K views39 slides
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch by
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From ScratchSunghoon Joo
827 views19 slides
Applying Deep Learning with Weak and Noisy labels by
Applying Deep Learning with Weak and Noisy labelsApplying Deep Learning with Weak and Noisy labels
Applying Deep Learning with Weak and Noisy labelsDarian Frajberg
1K views27 slides

Similar to Practical tips for handling noisy data and annotaiton(20)

Troubleshooting Deep Neural Networks - Full Stack Deep Learning by Sergey Karayev
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Sergey Karayev20.6K views
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi... by Jinwon Lee
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee708 views
Web Traffic Time Series Forecasting by BillTubbs
Web Traffic  Time Series ForecastingWeb Traffic  Time Series Forecasting
Web Traffic Time Series Forecasting
BillTubbs2K views
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch by Sunghoon Joo
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
Sunghoon Joo827 views
Applying Deep Learning with Weak and Noisy labels by Darian Frajberg
Applying Deep Learning with Weak and Noisy labelsApplying Deep Learning with Weak and Noisy labels
Applying Deep Learning with Weak and Noisy labels
Darian Frajberg1K views
Using Bayesian Optimization to Tune Machine Learning Models by Scott Clark
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
Scott Clark642 views
Using Bayesian Optimization to Tune Machine Learning Models by SigOpt
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
SigOpt1.5K views
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... by Databricks
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Databricks920 views
Bag of tricks for image classification with convolutional neural networks r... by Dongmin Choi
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...
Dongmin Choi82 views
Mlp mixer image_process_210613 deeplearning paper review! by taeseon ryu
Mlp mixer image_process_210613 deeplearning paper review!Mlp mixer image_process_210613 deeplearning paper review!
Mlp mixer image_process_210613 deeplearning paper review!
taeseon ryu240 views
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™ by Databricks
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Databricks2K views
StackNet Meta-Modelling framework by Sri Ambati
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling framework
Sri Ambati760 views
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf by Duy-Hieu Bui
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Duy-Hieu Bui0 views
Using SigOpt to Tune Deep Learning Models with Nervana Cloud by SigOpt
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
SigOpt1K views
Analysis Of Matrix Multiplication Computational Methods by Joe Andelija
Analysis Of Matrix Multiplication Computational MethodsAnalysis Of Matrix Multiplication Computational Methods
Analysis Of Matrix Multiplication Computational Methods
Joe Andelija3 views

Recently uploaded

3196 The Case of The East River by
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East RiverErickANDRADE90
16 views4 slides
VoxelNet by
VoxelNetVoxelNet
VoxelNettaeseon ryu
7 views21 slides
Cross-network in Google Analytics 4.pdf by
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdfGA4 Tutorials
6 views7 slides
TGP 2.docx by
TGP 2.docxTGP 2.docx
TGP 2.docxsandi636490
10 views8 slides
[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptx by
[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptx[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptx
[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptxDataScienceConferenc1
5 views15 slides
Short Story Assignment by Kelly Nguyen by
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyenkellynguyen01
19 views17 slides

Recently uploaded(20)

3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9016 views
Cross-network in Google Analytics 4.pdf by GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views
CRM stick or twist.pptx by info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821710 views
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... by DataScienceConferenc1
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
SUPER STORE SQL PROJECT.pptx by khan888620
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptx
khan88862012 views
Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821712 views
Survey on Factuality in LLM's.pptx by NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra16 views
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials14 views
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821710 views
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1006 views

Practical tips for handling noisy data and annotaiton

  • 1. KaggleDays Tokyo Workshop Practical tips for handling noisy data and annotation Ryuichi Kanoh (RK) December 11, 2019 https://www.kaggle.com/ryuichi0704
  • 2. Overview - This is a KaggleDays workshop on noise handling sponsored by DeNA. - In addition to explaining the techniques, I will touch on: - Experimental results - Implementations (https://github.com/ryuichi0704/workshop_noise_handling) - Interactive communication is welcome. - Both in English and Japanese. 2
  • 3. Agenda ● Introduction ● Setup of the experiment ● Techniques for learning with a noisy dataset 3
  • 4. Agenda ● Introduction ● Setup of the experiment ● Techniques for learning with a noisy dataset 4
  • 5. Big data and machine learning - Large and high-quality dataset drives the success of ML. - However, it is very hard to prepare such a dataset. - So, you probably want to use crowdsourcing / crawling and so on. https://medium.com/syncedreview/sensetime-trains-imagenet-alexnet-in-record-1-5-minutes-e944ab049b2c 5
  • 6. Possible noise from web crawling Google Search - The keywords may not be relevant to the image content. 6
  • 7. Possible noise from crowd sourcing Annotation error may occur with - limited communication between seekers and solvers. - limited working time. https://www.flickr.com/photos/kjempekjekt/3470254809 7
  • 8. Noise example in kaggle competitions 8 - Task - 340 class timestamped vector classification - Difficulty - Dataset is collected from browser game. - Quality differences depending on drawer https://quickdraw.withgoogle.com/
  • 10. There are many noisy datasets in kaggle 10 Classes are fine-grained. It is difficult even for a human to annotate consistently. 75% of the dataset is annotated by metadata. (not by a human) Annotation granularity is not stable. (e.g. [face] vs [face, nose, eye, mouth...])
  • 11. There are many noisy datasets in kaggle 11 Labels vary depending on the annotator. Annotation was crowd-sourced. There are external datasets with noisy annotations. Each video was automatically annotated by the YouTube annotation system.
  • 12. Agenda ● Introduction ● Setup of the experiment ● Techniques for learning with a noisy dataset 12
  • 13. Setup - Use QuickDraw (Link) dataset. - 340 class image classification. - Evaluation metric is top-1 accuracy. - Timestamped vectors are converted to 1-channel images with 32x32 resolution. - Dataset is randomly subsampled from original dataset. - Train: 81600 samples, Test: 20400 samples (random split) - 300 images per class in total. - Test accuracy observed with the maximum validation accuracy is reported. - Base setting: 13 model base-lr batch-size epoch train:valid objective optimizer scheduler ResNet18 0.1 128 50 9:1 (random split) Cross entropy SGD with nesterov momentum, w-decay 1e-4 MultistepLR (x 0.1 at 40, 45 epoch) *Other details are in the GitHub repository.
  • 14. Setup 14 Local machine Container registry AI platform training Google Cloud push Google sheet results pull - Experiments were done with AI platform training. notification
  • 15. Base results 15 - Test accuracy distribution with 50 seeds. - 0.563 is an average performance. - random seed effect is around 0.004 (maximum: ~0.010)
  • 16. Model output analysis - Check hard samples and easy samples. 16 prediction (for validation dataset) label Error = 0.01 0.90 0.03
  • 17. 17 - Model output analysis Easy samples (based on cross-entropy) label / pred
  • 18. 18 - Model output analysis Hard samples (based on cross-entropy) label / pred
  • 19. 19 - Model output analysis Why difficult? - There are a number of ways in which a model can be difficult. - Image itself is noisy or wrong - 1, 2, 8, 9 - There are similar confusing classes - 3, 4, 7 - So, the model is definitely struggling. - Are there any techniques we can use improve model training? 1 2 3 4 5 6 7 8 9
  • 20. ● Introduction ● Setup of the experiment ● Techniques for learning with a noisy dataset 1. Mixup 2. Large batch size 3. Distillation 20 Agenda
  • 21. [1] Mixup - Construct virtual training sample with - λ is randomly sampled from symmetric beta distribution. 21 https://arxiv.org/abs/1710.09412 http://moustaphacisse.com/
  • 22. - Mixup Beta distribution 22 http://wazalabo.com/mixup_1.html - Large alpha : strong smoothing (in original paper, 0.2 for ImageNet)
  • 23. - Even though we have blue (noisy) sample in here, its effect is suppressed by the surrounding red. - Effectiveness for label noise is also mentioned in the original paper. 23 https://www.inference.vc/mixup-data-dependent-data-augmentation/ - Mixup Why mixup for noisy dataset?
  • 24. - You may ask what happens if we mix in the intermediate vector. - It is called “Manifold mixup”. https://arxiv.org/abs/1806.05236 - Feature vectors in the random layer are mixed using the same mixup procedure. 24 - Mixup Derivatives of mixup https://arxiv.org/abs/1512.03385
  • 25. 25 - Mixup Experimental results - Mixup performance is better than base performance (0.563). - Important aspects: - Performance changes drastically with alpha (beta-distribution parameter). - A manifold-mixup is also a viable alternative. - It can be used not only for images, but for categorical tabular data and so on..
  • 26. - Data mixing based on beta distribution - https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/model.py#L39-L51 - Select mixing layer (for manifold mixup) - https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/model.py#L28-L37 - Loss calculation - https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/runner/mixup_runner.py#L17-L20 26 - Mixup Implementations
  • 27. - Stopping strong augmentation (like mixup, auto-augment) on the final phase is helpful. (https://arxiv.org/abs/1909.09148) - Performance improvement is observed with our QuickDraw dataset. 27 - Mixup Tips on training with mixup QuickDraw dataset, with mixup (alpha=1.6) 0.6074 → 0.6165
  • 28. - iMet Collection 2019 - FGVC6, 6th place - For image data - Freesound Audio Tagging 2019, 1st place - For audio data - The 2nd YouTube-8M Video Understanding Challenge, 2nd place - For video feature vector 28 - Mixup Examples in competitions
  • 29. [2] Large batch size 29 - With a severe label noise, large batch size is helpful for training. - Within a large batch, the gradient from random noisy labels cancels out. Larger batch size is effective. https://arxiv.org/abs/1705.10694
  • 30. 30 - Large batch size Other aspect: sharp and flat minimum - If the noise of the gradient is too small, the model is likely to converge into a sharp minimum. - At the sharp minimum, the model is not likely to be generalized. - For balancing, it is said that the learning rate should be tuned together with batch size. https://arxiv.org/abs/1609.04836 - Practically, only considering batch size is not enough.
  • 31. 31 - Large batch size Experimental results *Trained under the same number of iterations. (epoch=batch_size) - Clear proportional relationship is observed. - Not so large batch size looks optimal for this dataset.
  • 32. 32 - Large batch size Experimental results - Note that there are other relationships. - Learning-rate vs weight-decay - Although the purposes of algorithms are different, they have a strong relationship. - It is important to tune parameters with considering their interactions.
  • 33. - Usually, it is hard to set large batch size because of GPU memory. - Approach 1: - Gradient accumulation - Approach 2: - Mixed precision training - https://github.com/NVIDIA/apex 33 - Large batch size Tips for setting large batch size
  • 34. - Gradient accumulation - https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/runner/base_runner.py#L102-L104 - Hyper-parameters are set by arguments - https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/common.py#L15-L34 34 - Large batch size Implementations
  • 35. - Quick, Draw! Doodle Recognition Challenge, 5th place - Batch size up to 10K. - iMet Collection 2019 - FGVC6, 1st place - Batch size 1000~1500 (Accumulation 10~20 times) 35 - Large batch size Examples in competitions
  • 36. [3] Distillation - Train student network with pre-trained teacher prediction. - It eases student model’s training. (student can understand which sample is difficult) 36 https://towardsdatascience.com/knowledge-distillation-simplified-dd4973dbc764
  • 37. 37 - Distillation Procedure [1/3] Train teacher Teacher model Teacher model prediction Train data (data) Train data (label) Teacher prediction for train data Test data (data) Operation Data Train data Train Train Train Train Valid Train Train Train Valid Train Train Train Valid Train Train Train Valid Train Train Train Valid Train Train Train Train Fold1 Fold2 Fold3 Fold4 Fold5 OOF prediction Cross validation For making teacher predictions, the model is often trained with cross validation.
  • 38. 38 - Distillation Procedure [2/3] Train teacher Teacher model Teacher model prediction Train data (data) Train data (label) Teacher prediction for train data Test data (data) Operation Data Train student Student model Train data Train Train Train Train Valid Train Train Train Valid Train Train Train Valid Train Train Train Valid Train Train Train Valid Train Train Train Train Fold1 Fold2 Fold3 Fold4 Fold5 OOF prediction Cross validation
  • 39. 39 - Distillation Procedure [3/3] Train teacher Teacher model Teacher model prediction Train data (data) Train data (label) Teacher prediction for train data Test data (data) Operation Data Train student Student model Model prediction Prediction result for test data Train data Train Train Train Train Valid Train Train Train Valid Train Train Train Valid Train Train Train Valid Train Train Train Valid Train Train Train Train Fold1 Fold2 Fold3 Fold4 Fold5 OOF prediction Cross validation
  • 40. There are some strategies for training the student. - Use (a*soft-label loss) + (b*hard-label loss) as the student’s loss function. - Max(0.7*soft-label, hard-label) as new label. (e.g. for F2 metric) - Softmax with temperature is sometimes used for teacher prediction. - In original paper, KL divergence between the student and the teacher was also used. - https://arxiv.org/abs/1503.02531 40 - Distillation How to use teacher prediction teacher prediction (=soft label) hard label 0.20 0.70soft label = 0.10 0.70 0.00 1.00hard label = 0.00 0.00 0.14 1.00new target = 0.07 0.49
  • 41. - Distillation can smooth out extremity in noisy hard labels. - If the data is complex and it is hard to annotate, teacher prediction labels for the data may not have high confidence. - When a cat is annotated as a dog by mistake, the teacher prediction label for the data may be close to dog if other datasets are reliable. 41 - Distillation Why distillation for noisy dataset? dog cat label (noisy) 0 1 teacher prediction 0.9 0.1
  • 42. 42 - Distillation Experimental results - Distillation performance is better than base performance (0.563). - Improved performance even with the same model architecture. - Weight of the soft loss affects performance. - 2 (soft-loss effect is double of the hard-loss) is the best. - Is it because of the noisy dataset?
  • 43. - Calculate hard and soft loss - https://github.com/ryuichi0704/workshop_noise_handling/blob/master/project/work/runner/distillation_runner.py#L64-L77 43 - Distillation Implementations
  • 44. - iMet Collection 2019 - FGVC6, 9th place - Max(0.7*soft-label, hard-label) as new targets - Property of the competition metric (F2) is considered. - The 2nd YouTube-8M Video Understanding Challenge, 2nd place - Multi-stage distillation 44 - Distillation Examples in competitions
  • 45. Summary 45 ● Introduction ● Setup of the experiment ● Techniques for learning with a noisy dataset
  • 46. - Learning with selected samples - Drop large error samples from training - Curriculum learning - Learning with noise transition - Forward correction (modify objective) 46 (Curriculum learning) (Noise translation matrix) There are many other techniques https://arxiv.org/abs/1808.01097 55 45 55 45 55 45 55 45 55 50 50 50 50 50 12.5 12.5

Editor's Notes

  1. ここまで24min
  2. ここまでで37min
  3. ここまで46min