Bag of tricks for image classification with convolutional neural networks review [cdm]

Bag of tricks for image classiﬁcation
with convolutional neural networks
Choi Dongmin
Yonsei University Severance Hospital CCIDS

Contents
• Abstract

• Introduction

• Training Procedures

• Eﬃcient Training

• Model Tweaks

• Training Reﬁnements

• Transfer Learning

• Conclusion

Abstract
• Focus on training reﬁnements, such as data augmentations
and optimization methods

• Tips of how to deal with these reﬁnements

• Raised ResNet-50’s top-1 accuracy from 73.5% to 79.29% 
on ImageNet

• These improvements also leads to better transfer learning
performance

Introduction
• Image Classiﬁcation performance has been raised with various
architecture including AlexNet, VGG, ResNet, DenseNet and
NASNet.

• However, these advancements did not solely come from improved
model architecture 
- Training reﬁnements (ex. loss functions, data preprocessing,
optimization) as played a major role

Training Procedures
• 1. Baseline Training Procedure

• 2. Experiment Results

Training Procedures
1. Baseline Training Procedure - During Training
• (1) Randomly sample an image and decode it into 32-bit floating point raw pixel
values in [0, 255].

• (2) Randomly crop a rectangular region whose aspect ratio is randomly sampled
in [3/4, 4/3] and area randomly sampled in [8%, 100%], then resize the cropped
region into a 224-by-224 square image.

• (3) Flip horizontally with 0.5 probability.

• (e) Scale hue, saturation, and brightness with coefficients uniformly drawn from 
[0.6, 1.4].

• (5) Add PCA noise with a coefficient sampled from a normal distribution (0, 0.1).

• (6). Normalize RGB channels by subtracting 123.68, 116.779, 103.939 and
dividing by 58.393, 57.12, 57.375, respectively.
N

Training Procedures
1. Baseline Training Procedure - During Validation
• Resize each image’s shorter edge to 256 pixels while keeping its aspect
ratio

• Crop out 224-by-224 region in the center

• Normalize RGB channels similar to training

• No random data augmentations

Training Procedures
1. Baseline Training Procedure - Weights Initialization
• Convolutional and fully-connected layers : Xavier algorithm

• All biases : 0

• For batch normalization, vectors : 1 / vectors : 0γ β

Training Procedures
1. Baseline Training Procedure - Optimizer and Hyper-parameters
• Optimizer : Nesterov Accelerated Gradient (NAG) descent

• Epochs : 120

• Batch size : 256

• GPU : 8 NVIDIA V100 GPUs

• Learning rate : 0.1 (decay at 30, 60, and 90 epochs)

Training Procedures
2. Experiment Results

Efﬁcient Training
• 1. Large-batch Training

• 2. Low-precision Training

• 3. Experiment Result

Efﬁcient Training
1. Large-batch Training
• It is more eﬃcient to use larger batch size

• But using large batch size decreases convergence rate for convex
problems and degrades validation accuracy.

• Four heuristics that help scale the batch size up 
- Linear scaling learning rate 
- Learning rate warmup 
- Zero  
- No bias decay
γ

Efﬁcient Training
1. Large-batch Training : Linear Scaling Learning Rate
• A large batch size reduces the noise in the gradient (= reduces variance)

• It is possible to increase the learning rate with a large batch size

• In Goyal et al*, linearly increasing the learning rate with the batch size
works empirically for ResNet-50 training.

• In He et al**, set the initial learning rate to 0.1 x b/256 (b = batch size)
Goyal et al*, Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017
He et al**, Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

Efficient Training
1. Large-batch Training : Learning Rate Warmup
• Initial network parameters are typically far away from the final solution 
- Using a too large learning rate may result in numerical instability

• Using a small learning rate at the beginning and then 
switch back to the initial learning rate when the training process is stable

• In Goyal et al*, set the learning rate as below 
- Use the first batches to warm up 
- Initial learning rate :  
- At batch , the learning rate :
m
η
i(1 ≤ i ≤ m) iη/m

Efﬁcient Training
1. Large-batch Training : Zero γ
• In batch normalization, the output is  
( : standardized input / : learnable parameters initialized to 1s and 0s)

• Zero initialization heuristic 
- Initialize for all BN layers that sit at the end of a residual block 
- All residual blocks just return their inputs [ output = + block ] 
- Mimics network that has less number of layers and is easier to train 
at the initial stage
γ ̂x + β
̂x x γ, β
γ
γ = 0
x (x)
https://mc.ai/bag-of-tricks-for-image-classiﬁcation-with-convolutional-neural-networks-in-keras/

Efﬁcient Training
1. Large-batch Training : No bias decay
• Weight decay is often applied to all learnable parameters in order to  
avoid overﬁtting

• According to Jia et al*, it’s recommended to only apply weight decay to
weights (not to bias)

• That is, other parameters (biases and and in BN layers) are left
unregularized
γ β
Jia et al*, Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205, 2018.

Efﬁcient Training
2. Low-precision training
• Neural networks are commonly trained with 32-bit-ﬂoating-point (FP32)

• However, new hardware support lower precision data types

• 2 to 3 times faster after switching 
from FP32 to FP16 on V100

• Micikevicius et al* suggested 
Mixed precision training 
( Forward and backward in FP16, 
Weight update in FP32)
Micikevicius et al*, Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.

Efﬁcient Training

Model Tweaks
• A minor adjustment to the network architecture, such as
changing the stride

• Such a tweak often barely changes the computational
complexity but might have a non-negligible eﬀect on accuracy

Model Tweaks
Original ResNet-50
Input Stem
Downsampling Block

Model Tweaks
Original Downsampling Block
ResNet-B Downsampling Block
Using a kernel size 1x1 with a stride of 2 
→ Ignores three-quarters of the input feature map
First appeared in a Torch implementation

Model Tweaks
Original Input Stem
ResNet-C Input Stem
7 x 7 convolution is 5.4 times more expensive 
than 3 x 3 convolution
Proposed in Inception-v2 originally

Model Tweaks
Original Input Stem
ResNet-D Downsampling Block
Path B also ignores 3/4 of input feature maps 
because of stride size
Proposed model tweak
Original Downsampling Block

Training Reﬁnements
• 1. Cosine Learning Rate Decay

• 2. Label Smoothing

• 3. Knowledge Distillation

• 4. Mixup Training

• 5. Experiment Results

1. Cosine Learning Rate Decay
• Cosine Annealing Strategy proposed by Loshchilov et al*
ηt =
1
2 (
1 + cos
(
tπ
T ))
η
- : Learning rate at batch
- : The number of total batches
ηt t
T
Loshchilov et al*, SGDR: stochastic gradient descent with restarts. CoRR, abs/1608.03983, 2016.

1. Cosine Learning Rate Decay
Remains large → Potentially improves training process

2. Label Smoothing Proposed by Inception-v2*
Inception-v2* : C.Szegedy et al. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
Ground Truth Probability :
Optimal Solution :
- : the number of labelsK

2. Label Smoothing Proposed by Inception-v2*
Inception-v2* : C.Szegedy et al. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
centers at the theoretical value
fewer extreme values

3. Knowledge Distillation
G Hinton et al. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.  
Image
Teacher Model 
(Large; ex. ResNet-152)
Student Model 
(Small; ex. ResNet-50)
r
z
Output
Transfer
loss(p, softmax(z)) + T2
loss(softmax(r/T), softmax(z/T))
To improve accuracy of the student model while keeping its complexity the same
※ : the temperature hyper-parameter to make the softmax outputs smootherT

4. Mixup Training
H Zhang et al. mixup: Beyond empirical risk minimization. CoRR, abs/1710.09412, 2017.
https://hoya012.github.io/blog/Bag-of-Tricks-for-Image-Classiﬁcation-with-Convolutional-Neural-Networks-Review/
Only use , generated by a weighted linear interpolation of two example( ̂x, ̂y) (xi, yi), (xj, yj)
drawn from the Beta distribution(α, α)

- for label smoothing
- for knowledge distillation
- for mixup training
ϵ = 0.1
T = 20
α = 0.2

Transfer Learning
Object Detection

Transfer Learning
Semantic Segmentation

Thank you
Yonsei University Severance Hospital CCIDS

Bag of tricks for image classification with convolutional neural networks review [cdm]

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bag of tricks for image classification with convolutional neural networks review [cdm]

Similar to Bag of tricks for image classification with convolutional neural networks review [cdm] (20)

More from Dongmin Choi

More from Dongmin Choi (20)

Recently uploaded

Recently uploaded (20)

Bag of tricks for image classification with convolutional neural networks review [cdm]