Batch normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Batch normalization
Accelerating Deep Network Training by Reducing
Internal Covariate Shift
H O K U N L I N
2 0 2 3 / 8 / 3
1

Outline
Introduction
Related Works
Methodology
Experimental Results
Conclusion
2

Introduction
Sergey Ioffe, Christian Szegedy from Google Reserch, 2015 ICML
It's hard to train deep neural networks
3
SGD is simple but require careful tunning
The inputs to each layer are affected by all preceding layers
Needs to continuously adapt to new distribution

Introduction
5
Gradient vanishing slow down the convergence.
Changing distribution of input will likely move x into saturated regime

Introduction
Use Batch Normalization fix means and variances of layer inputs
Reduce the dependence of initial values of SGD
Reduce the need for Dropout
6
Match SOTA model on ImageNet using only 7% of training steps

Outline
Introduction
Related Works
Methodology
Conclusion
7

Related Works
Normalizing the Inputs
Covariate shift
Input distribution to a system changes
8
Internal Covatiate Shift
Input distribution to a network changes due to training

Outline
Introduction
Related Works
Methodology
Conclusion
9

Methodology
Fix the distribution of the layer inputs
1 0
Normalize each scalar feature independently
Since fill whitening is costly
Normalization via mini-batch
Since we often use mini-batch in SGD

Methodology - BN Transform
1 1
Consider a mini-batch with size m,

1 2

1 3
Add learnable parameters gamma and beta

Methodology - Train&Inference
1 4

1 5

1 6
Output depends on data in mini-batch!

1 7
Training Inferencing

1 8
Training Inferencing

1 9
Training Batch Normalization Network Find E[x], Var[x] and inference

2 0

2 1

2 2

Methodology - CNN
Put BN before nonlinearities
Also include learnable gamma and beta
2 3
reset mini-batch size to m times spatial size to obtain convolution properties

Methodology - Observation
With BN, backpropagation is unaffected by the scale of parameters
2 4

2 5

2 6

2 7
BN will stablize the parameter growth

Methodology - observation
2 8
BN will stablize the parameter growth
BN enables higher learning rates

Outline
Introduction
Related Works
Methodology
Conclusion
2 9

MNIST dataset
Handwritten digits dataset
28 x 28 pixel monochrome images
60K training images
3 0
10K testing images
10 labels
Verify internal covariate shift here

MNIST dataset - NN Setup
28 x 28 binary image input
3 FC hidden layers with 100 sigmoid nonlinearities each
1 FC hidden layer with 10 activations with cross-entropy loss
3 1
Train for 50K steps, 60 examples per mini-batch
BN add in each hidden layers
W initialized to small random Gaussian values

MNIST dataset - Result
x represent epoch
y represent test accuracy
3 2
NN with BN has higher test accuracy

MNIST dataset - Result
x represent epoch
y represent output value
3 3
Lines represent {15, 50, 85}th percentiles
Distribution in NN without BN is unstable
Distribution in NN with BN is stable

ImageNet
Train with ILSVRC 2012 dataset
1000 labels
150K test and validation images
3 4
1.2M train images

ImageNet - Inception Model
3 5
GoogLeNet is one of the instance of Inception
Won 2014 ImageNet
SOTA model as baseline

ImageNet - Inception Setup
3 6

3 7

3 8
5x5 conv. layers to two 3x3 conv. layers

3 9
28x28 inception modules from 2 to 3

4 0
Use average, max-pooling during training

4 1
Remove board pooling layers between any two incepetion modules

4 2
Add stride-2 conv./pooling layers before filter in 3c, 4e

ImageNet - BN Setup
4 3
Increase learning rate
Remove Dropout
Reduce the L2 Regularization by a factor of 5
Lower learning rate 6 times faster
Remove Local Response Normalization
Shuffle training examples more thoroughly
Reduce the photometric distortions

ImageNet - BN Setup
4 4
BN-Baseline
Inception + BN before each nonlinearity

ImageNet - BN Setup
4 5
BN-Baseline
BN-x5 / BN-x30
BN-Baseline with learning rate increase by a factor of 5 / 30 (0.0075, 0.045)

ImageNet - BN Setup
4 6
BN-Baseline
BN-x5 / BN-x30
BN-Baseline with learning rate increase by a factor of 5 / 30 (0.0075, 0.045)
BN-x30-Sigmoid
BN-x30 with Sigmoid instead of ReLU

ImageNet - Result
4 7
x represent epoch
y represent validation accuracy
Same acc. in fewer steps with BN
Inception+Sigmoid has acc. < 1/1000

ImageNet - Result
4 8
BN-x30 train slower initially
Higher learning rate, higher acc.

ImageNet - Result
4 9
BN reach 72.2% less than half steps
BN-x5 only needs 14 times fewer steps
Same acc. in fewer steps with BN
Inception+Sigmoid has acc. < 1/1000
Doesn't need Dropout and Local Response Normalization

ImageNet Ensemble - Setup
5 0
6 BN-x30 form BN-Inception ensemble
Increase initial weights in conv. layers
Using Dropout with probability 5% or 10%
Using non-convolutional, per-activation BN in last hidden layer
Predict base on on the arithmetic average

ImageNet Ensemble - Result
5 1

Outline
Introduction
Related Works
Methodology
Conclusion
5 2

Conclusion
Reduce internal covariate shift speed up training
Add BN to SOTA model yields a substantial speedup in training
Preserve model expressivity
5 3
Allows higher learning rate
Reduce the need for dropout and careful parameter initialization
Beat SOTA model in ImageNet classfication

Batch normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

More Related Content

What's hot

Similar to Batch normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Recently uploaded

Batch normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift