SlideShare a Scribd company logo
Why Batch Normalization
Works so Well
Group:We are the REAL baseline
D05921027 Chun-Min Chang, D05921018 Chia-Ching Lin
F03942038 Chia-Hao Chung, R05942102 Kuan-Hua Wang
Internal Covariate Shift
•  During training, layers need to continuously adapt to the new
distribution of their inputs
w
​
𝑧↓
1 
​
𝑧↓
2 
𝑧
w
Batch Normalization (BN)
•  Goal: to speed up the process of training deep neural networks by
reducing internal covariate shift
​
𝑧↓
1 
​
𝑧↓
2 
​ 𝑧↑′ 
ww
​
𝑧↓
1
↑′ 
​
𝑧↓
2
BN
BN
Idea of BN
•  Full whitening? Too costly!
•  2 necessary simplifications
a.  Normalize each feature dimention (no decorrelation)
b.  Normalize each batch
•  E.g., for the 𝑘-dim input vector:
•  Also, “scale” and “shift” parameters are introduced to preserve network
capacity
batch mean
batch variance
​​ 𝑥 ↑( 𝑘) =​​ 𝑥↑( 𝑘) −E[​ 𝑥↑( 𝑘) ]/√⁠Var[​ 𝑥↑( 𝑘) ]  
​ 𝑦↑( 𝑘) =​ 𝛾↑( 𝑘) ​​ 𝑥 ↑( 𝑘) +​ 𝛽↑( 𝑘) 
BN Algorithm (1/2)
•  Training:
𝜖 is a constant preventing
division by zero, e.g., 0.001
BN Algorithm (2/2)
•  Testing: use population statistics ( 𝜇 and 𝜎) estimated using moving averages
of batch statistics (​ 𝜇↓𝐵  and ​ 𝜎↓𝐵 ) during training
𝛼 is the moving average momentum, e.g., 0.999
𝜇← 𝛼𝜇+(1− 𝛼)​ 𝜇↓B 
𝜎← 𝛼𝜎+(1− 𝛼)​ 𝜎↓B 
Problems of Interest
•  To understand the effect of BN w.r.t. the following network components
(1) activation function
(2) optimizer
(3) batch size
(4) training/testing data distribution
•  To validate the claims in the original BN paper
(5) BN solves the issue of gradient vanishing
(6) BN regularizes the model
(7) BN helps making singular values of layers’ Jacobian closer to 1
•  (8) To compare BN with batch renormalization (BRN)
Experiment Setup
•  Toolkit: tensorflow
•  Dataset: MNIST
•  Network structure: DNN of 2 hidden layers, both 100 neurons
•  Default parameters (may change for different experiments)
(1) learning rate: 0.0001
(2) batch size: 64
(3) activation function: sigmoid
(4) optimizer: SGD
•  BNs are implemented before activation functions
To understand the effect of BN w.r.t. the
following network components
(1) activation function
(2) optimizer
(3) batch size
(4) training/testing data distribution
(1) Activation Function
•  In all cases, BN significantly
improves the speed of
training
•  Sigmoid w/o BN: gradient
vanishing
(2) Optimizer
•  ReLU+Adam ≈ ReLU+SGD
+BN
(same as Sigmoid)
with BN, the selection of
optimizers does not lead to
significant difference
(3) Batch Size
•  For small batch size (i.e., 4), BN
degrades the performance
(4) Mismatch between Training and Testing
•  For binary classification task with extremely imbalanced testing distribution
(e.g., 99 : 1), it is no surprise that BN ruins the performance
Brief Summary I
1.  BN speeds up training process and improves performance for all choices of activation
functions and optimizers, with the biggest improvement when Sigmoid is used
2.  For BN, the choice of activation functions is more crucial than the choice of optimizer
3.  BN worsens performance if (1) too small batch size, or (2) greatly mismatched
training/testing data distribution
To validate the claims in the BN paper
(5) BN solves the issue of gradient vanishing
(6) BN regularizes the model
(7) BN helps making singular values of layers’
Jacobian closer to 1
(5) BN does solve the issue of gradient vanishing
0.02
0.10
5x
Without BN
0.10
0.15
With BN
0.20
0.20
Layer 1
Layer 2
Layer 1
Layer 2
Sigmoid ReLU
(6) BN does regularize the model
•  E.g., average magnitude of weights in layer 2
w11
w12
w22
w21
BN
BN
×​
𝛾↑1 
​
𝑎
↓
1
↑
1 
​
𝑎
↓
2
↑
1 
​
𝑎
↓
2
↑
2 
​
𝑎
↓
1
↑
2 
+​
𝛽↑1 
×​
𝛾↑2  +​
𝛽↑2 
w’s can be smaller since we have 𝛾’s
Does BN benefit the gradient flow?
•  Isometry (保距轉換):
è singular values are closed to 1
•  Recall that errors are back-propagated via layer Jacobian matrix
•  Claim: BN can help making singular values of layers’ Jacobian closer to 1
Singular Values of Layer Jacobian (Sigmoid)
Singular Values of Layer Jacobian (Sigmoid)
Singular Values of Layer Jacobian (Sigmoid)
Singular Values of Layer Jacobian (ReLU)
Singular Values of Layer Jacobian (ReLU)
Singular Values of Layer Jacobian (ReLU)
Brief Summary II
1.  BN does solve the issue of gradient vanishing
2.  BN does regularize the weights
3.  BN does benefit the gradient flow by making singular values of layers’
Jacobian closer to 1
To compare BN with batch
renormalization (BRN)
(8) Does BRN really solve the problems of BN?
Batch Renormalization (BRN)
•  Recall that BN worsens performance if (1) too small batch size, or (2) greatly
mismatched training/testing data distribution
•  This is mainly due to the mismatch between batch statistics (used during
training) and estimated population statistics (used during testing)
•  BRN introduces two parameters 𝑟 and 𝑑 to fix this mismatch:
BN
 BRN
BRN Algorithm
•  During training, population statistics are
maintained and introduced in normalization
process
•  During testing, estimated population
statistics are used
Note that when 𝑟=1 and 𝑑=0, BRN = BN
BN vs. BRN under small batch size
•  BRN survives under small batch size: 4
Conclusions
We have showed experimentally that
1.  BN speeds up training process and improves performance no matter which
activation functions or optimizers are used
。With BN, activation function is more crucial than optimizer
2.  BN does…
(1) solve the issue of gradient vanishing
(2) regularize the weights
(3) benefit gradient flow through network
3.  BN worsens performance if (1) too small batch size, or (2) greatly
mismatched training/testing data distribution
è Solved by BRN
References
•  [S. Ioffe & C. Szegedy, 2015] Ioffe, Sergey, Szegedy, Christian. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
•  [Saxe et al., 2013] Saxe, Andrew M., McClelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear
dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.
•  [Nair & Hinton, 2010] Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann
machines. In ICML, pp. 807–814. Omnipress, 2010.
•  [Shimodaira, 2000] Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log-
likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, October 2000.
•  [LeCun et al., 1998b] LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.),
Neural Networks: Tricks of the trade. Springer, 1998b.
•  [Wiesler & Ney, 2011] Wiesler, Simon and Ney, Hermann. A convergence analysis of log-linear training. In Shawe-
Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information
Processing Systems 24, pp. 657–665, Granada, Spain, December 2011.
References
•  [Wiesler et al., 2014] Wiesler, Simon, Richard, Alexander, Schlu ̈ter, Ralf, and Ney, Hermann. Mean-normalized
stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech, and Signal
Processing, pp. 180–184, Florence, Italy, May 2014.
•  [Raiko et al., 2012] Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear
transformations in perceptrons. In International Conference on Artificial In- telligence and Statistics (AISTATS), pp.
924–932, 2012.
•  [Povey et al., 2014] Povey, Daniel, Zhang, Xiaohui, and Khudanpur, San- jeev. Parallel training of deep neural
networks with natural gradient and parameter averaging. CoRR, abs/1410.7455, 2014.
•  [Wang et al., 2016] Wang, S., Mohamed, A. R., Caruana, R., Bilmes, J., Plilipose, M., Richardson, M., ... & Aslan, O.
(2016, June). Analysis of Deep Neural Networks with the Extended Data Jacobian Matrix. In Proceedings of The 33rd
International Conference on Machine Learning (pp. 718-726).
•  [K. Jia, 2016] JIA, Kui. Improving training of deep neural networks via Singular Value Bounding. arXiv preprint arXiv:
1611.06013, 2016.
•  [R2RT] Implementing Batch Normalization in Tensorflow:
https://r2rt.com/implementing-batch-normalization-in-tensorflow.html

More Related Content

What's hot

MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044
Jinwon Lee
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
Kuppusamy P
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Yan Xu
 
Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural network
Syed Annus Ali SHah
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Knoldus Inc.
 
Bert
BertBert
AlexNet
AlexNetAlexNet
AlexNet
Bertil Hatt
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
Mostafa G. M. Mostafa
 
Link prediction 방법의 개념 및 활용
Link prediction 방법의 개념 및 활용Link prediction 방법의 개념 및 활용
Link prediction 방법의 개념 및 활용
Kyunghoon Kim
 
Artificial Neural Networks - ANN
Artificial Neural Networks - ANNArtificial Neural Networks - ANN
Artificial Neural Networks - ANN
Mohamed Talaat
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Edureka!
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
Universitat Politècnica de Catalunya
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
Sebastian Ruder
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
Databricks
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNet
SungminYou
 
Pr045 deep lab_semantic_segmentation
Pr045 deep lab_semantic_segmentationPr045 deep lab_semantic_segmentation
Pr045 deep lab_semantic_segmentation
Taeoh Kim
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
Larry Guo
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
ananth
 
Deep Learning - Overview of my work II
Deep Learning - Overview of my work IIDeep Learning - Overview of my work II
Deep Learning - Overview of my work II
Mohamed Loey
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Preferred Networks
 

What's hot (20)

MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural network
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Bert
BertBert
Bert
 
AlexNet
AlexNetAlexNet
AlexNet
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
 
Link prediction 방법의 개념 및 활용
Link prediction 방법의 개념 및 활용Link prediction 방법의 개념 및 활용
Link prediction 방법의 개념 및 활용
 
Artificial Neural Networks - ANN
Artificial Neural Networks - ANNArtificial Neural Networks - ANN
Artificial Neural Networks - ANN
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNet
 
Pr045 deep lab_semantic_segmentation
Pr045 deep lab_semantic_segmentationPr045 deep lab_semantic_segmentation
Pr045 deep lab_semantic_segmentation
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
 
Deep Learning - Overview of my work II
Deep Learning - Overview of my work IIDeep Learning - Overview of my work II
Deep Learning - Overview of my work II
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 

Similar to Why Batch Normalization Works so Well

Batch_Normalization.pptx
Batch_Normalization.pptxBatch_Normalization.pptx
Batch_Normalization.pptx
gnans Kgnanshek
 
4 high performance large-scale image recognition without normalization
4 high performance large-scale image recognition without normalization4 high performance large-scale image recognition without normalization
4 high performance large-scale image recognition without normalization
Donghoon Park
 
Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Network Deconvolution review [cdm]
Network Deconvolution review [cdm]
Dongmin Choi
 
High performance large-scale image recognition without normalization
High performance large-scale image recognition without normalizationHigh performance large-scale image recognition without normalization
High performance large-scale image recognition without normalization
taeseon ryu
 
Guide
GuideGuide
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
taikhoan262
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
MLconf
 
Generating super resolution images using transformers
Generating super resolution images using transformersGenerating super resolution images using transformers
Generating super resolution images using transformers
NEERAJ BAGHEL
 
Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...
Dongmin Choi
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
PrabhuSelvaraj15
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Ahmed Yousry
 
ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)
WoochulShin10
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Yan Xu
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
csandit
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reduction
Yan Xu
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
Benjamin Le
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed Recognition
Sunghoon Joo
 
Decision Forests and discriminant analysis
Decision Forests and discriminant analysisDecision Forests and discriminant analysis
Decision Forests and discriminant analysis
potaters
 
Batch normalization: Accelerating Deep Network Training by Reducing Internal ...
Batch normalization: Accelerating Deep Network Training by Reducing Internal ...Batch normalization: Accelerating Deep Network Training by Reducing Internal ...
Batch normalization: Accelerating Deep Network Training by Reducing Internal ...
ssuser6a46522
 
Temporal Segment Network
Temporal Segment NetworkTemporal Segment Network
Temporal Segment Network
Dongang (Sean) Wang
 

Similar to Why Batch Normalization Works so Well (20)

Batch_Normalization.pptx
Batch_Normalization.pptxBatch_Normalization.pptx
Batch_Normalization.pptx
 
4 high performance large-scale image recognition without normalization
4 high performance large-scale image recognition without normalization4 high performance large-scale image recognition without normalization
4 high performance large-scale image recognition without normalization
 
Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Network Deconvolution review [cdm]
Network Deconvolution review [cdm]
 
High performance large-scale image recognition without normalization
High performance large-scale image recognition without normalizationHigh performance large-scale image recognition without normalization
High performance large-scale image recognition without normalization
 
Guide
GuideGuide
Guide
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Generating super resolution images using transformers
Generating super resolution images using transformersGenerating super resolution images using transformers
Generating super resolution images using transformers
 
Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reduction
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed Recognition
 
Decision Forests and discriminant analysis
Decision Forests and discriminant analysisDecision Forests and discriminant analysis
Decision Forests and discriminant analysis
 
Batch normalization: Accelerating Deep Network Training by Reducing Internal ...
Batch normalization: Accelerating Deep Network Training by Reducing Internal ...Batch normalization: Accelerating Deep Network Training by Reducing Internal ...
Batch normalization: Accelerating Deep Network Training by Reducing Internal ...
 
Temporal Segment Network
Temporal Segment NetworkTemporal Segment Network
Temporal Segment Network
 

Recently uploaded

Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
terusbelajar5
 

Recently uploaded (20)

Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
 

Why Batch Normalization Works so Well

  • 1. Why Batch Normalization Works so Well Group:We are the REAL baseline D05921027 Chun-Min Chang, D05921018 Chia-Ching Lin F03942038 Chia-Hao Chung, R05942102 Kuan-Hua Wang
  • 2. Internal Covariate Shift •  During training, layers need to continuously adapt to the new distribution of their inputs w ​ 𝑧↓ 1  ​ 𝑧↓ 2  𝑧 w
  • 3. Batch Normalization (BN) •  Goal: to speed up the process of training deep neural networks by reducing internal covariate shift ​ 𝑧↓ 1  ​ 𝑧↓ 2  ​ 𝑧↑′  ww ​ 𝑧↓ 1 ↑′  ​ 𝑧↓ 2 BN BN
  • 4. Idea of BN •  Full whitening? Too costly! •  2 necessary simplifications a.  Normalize each feature dimention (no decorrelation) b.  Normalize each batch •  E.g., for the 𝑘-dim input vector: •  Also, “scale” and “shift” parameters are introduced to preserve network capacity batch mean batch variance ​​ 𝑥 ↑( 𝑘) =​​ 𝑥↑( 𝑘) −E[​ 𝑥↑( 𝑘) ]/√⁠Var[​ 𝑥↑( 𝑘) ]   ​ 𝑦↑( 𝑘) =​ 𝛾↑( 𝑘) ​​ 𝑥 ↑( 𝑘) +​ 𝛽↑( 𝑘) 
  • 5. BN Algorithm (1/2) •  Training: 𝜖 is a constant preventing division by zero, e.g., 0.001
  • 6. BN Algorithm (2/2) •  Testing: use population statistics ( 𝜇 and 𝜎) estimated using moving averages of batch statistics (​ 𝜇↓𝐵  and ​ 𝜎↓𝐵 ) during training 𝛼 is the moving average momentum, e.g., 0.999 𝜇← 𝛼𝜇+(1− 𝛼)​ 𝜇↓B  𝜎← 𝛼𝜎+(1− 𝛼)​ 𝜎↓B 
  • 7. Problems of Interest •  To understand the effect of BN w.r.t. the following network components (1) activation function (2) optimizer (3) batch size (4) training/testing data distribution •  To validate the claims in the original BN paper (5) BN solves the issue of gradient vanishing (6) BN regularizes the model (7) BN helps making singular values of layers’ Jacobian closer to 1 •  (8) To compare BN with batch renormalization (BRN)
  • 8. Experiment Setup •  Toolkit: tensorflow •  Dataset: MNIST •  Network structure: DNN of 2 hidden layers, both 100 neurons •  Default parameters (may change for different experiments) (1) learning rate: 0.0001 (2) batch size: 64 (3) activation function: sigmoid (4) optimizer: SGD •  BNs are implemented before activation functions
  • 9. To understand the effect of BN w.r.t. the following network components (1) activation function (2) optimizer (3) batch size (4) training/testing data distribution
  • 10. (1) Activation Function •  In all cases, BN significantly improves the speed of training •  Sigmoid w/o BN: gradient vanishing
  • 11. (2) Optimizer •  ReLU+Adam ≈ ReLU+SGD +BN (same as Sigmoid) with BN, the selection of optimizers does not lead to significant difference
  • 12. (3) Batch Size •  For small batch size (i.e., 4), BN degrades the performance
  • 13. (4) Mismatch between Training and Testing •  For binary classification task with extremely imbalanced testing distribution (e.g., 99 : 1), it is no surprise that BN ruins the performance
  • 14. Brief Summary I 1.  BN speeds up training process and improves performance for all choices of activation functions and optimizers, with the biggest improvement when Sigmoid is used 2.  For BN, the choice of activation functions is more crucial than the choice of optimizer 3.  BN worsens performance if (1) too small batch size, or (2) greatly mismatched training/testing data distribution
  • 15. To validate the claims in the BN paper (5) BN solves the issue of gradient vanishing (6) BN regularizes the model (7) BN helps making singular values of layers’ Jacobian closer to 1
  • 16. (5) BN does solve the issue of gradient vanishing 0.02 0.10 5x Without BN 0.10 0.15 With BN 0.20 0.20 Layer 1 Layer 2 Layer 1 Layer 2 Sigmoid ReLU
  • 17. (6) BN does regularize the model •  E.g., average magnitude of weights in layer 2 w11 w12 w22 w21 BN BN ×​ 𝛾↑1  ​ 𝑎 ↓ 1 ↑ 1  ​ 𝑎 ↓ 2 ↑ 1  ​ 𝑎 ↓ 2 ↑ 2  ​ 𝑎 ↓ 1 ↑ 2  +​ 𝛽↑1  ×​ 𝛾↑2  +​ 𝛽↑2  w’s can be smaller since we have 𝛾’s
  • 18. Does BN benefit the gradient flow? •  Isometry (保距轉換): è singular values are closed to 1 •  Recall that errors are back-propagated via layer Jacobian matrix •  Claim: BN can help making singular values of layers’ Jacobian closer to 1
  • 19. Singular Values of Layer Jacobian (Sigmoid)
  • 20. Singular Values of Layer Jacobian (Sigmoid)
  • 21. Singular Values of Layer Jacobian (Sigmoid)
  • 22. Singular Values of Layer Jacobian (ReLU)
  • 23. Singular Values of Layer Jacobian (ReLU)
  • 24. Singular Values of Layer Jacobian (ReLU)
  • 25. Brief Summary II 1.  BN does solve the issue of gradient vanishing 2.  BN does regularize the weights 3.  BN does benefit the gradient flow by making singular values of layers’ Jacobian closer to 1
  • 26. To compare BN with batch renormalization (BRN) (8) Does BRN really solve the problems of BN?
  • 27. Batch Renormalization (BRN) •  Recall that BN worsens performance if (1) too small batch size, or (2) greatly mismatched training/testing data distribution •  This is mainly due to the mismatch between batch statistics (used during training) and estimated population statistics (used during testing) •  BRN introduces two parameters 𝑟 and 𝑑 to fix this mismatch: BN BRN
  • 28. BRN Algorithm •  During training, population statistics are maintained and introduced in normalization process •  During testing, estimated population statistics are used Note that when 𝑟=1 and 𝑑=0, BRN = BN
  • 29. BN vs. BRN under small batch size •  BRN survives under small batch size: 4
  • 30. Conclusions We have showed experimentally that 1.  BN speeds up training process and improves performance no matter which activation functions or optimizers are used 。With BN, activation function is more crucial than optimizer 2.  BN does… (1) solve the issue of gradient vanishing (2) regularize the weights (3) benefit gradient flow through network 3.  BN worsens performance if (1) too small batch size, or (2) greatly mismatched training/testing data distribution è Solved by BRN
  • 31. References •  [S. Ioffe & C. Szegedy, 2015] Ioffe, Sergey, Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. •  [Saxe et al., 2013] Saxe, Andrew M., McClelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013. •  [Nair & Hinton, 2010] Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807–814. Omnipress, 2010. •  [Shimodaira, 2000] Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log- likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, October 2000. •  [LeCun et al., 1998b] LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.), Neural Networks: Tricks of the trade. Springer, 1998b. •  [Wiesler & Ney, 2011] Wiesler, Simon and Ney, Hermann. A convergence analysis of log-linear training. In Shawe- Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 657–665, Granada, Spain, December 2011.
  • 32. References •  [Wiesler et al., 2014] Wiesler, Simon, Richard, Alexander, Schlu ̈ter, Ralf, and Ney, Hermann. Mean-normalized stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 180–184, Florence, Italy, May 2014. •  [Raiko et al., 2012] Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial In- telligence and Statistics (AISTATS), pp. 924–932, 2012. •  [Povey et al., 2014] Povey, Daniel, Zhang, Xiaohui, and Khudanpur, San- jeev. Parallel training of deep neural networks with natural gradient and parameter averaging. CoRR, abs/1410.7455, 2014. •  [Wang et al., 2016] Wang, S., Mohamed, A. R., Caruana, R., Bilmes, J., Plilipose, M., Richardson, M., ... & Aslan, O. (2016, June). Analysis of Deep Neural Networks with the Extended Data Jacobian Matrix. In Proceedings of The 33rd International Conference on Machine Learning (pp. 718-726). •  [K. Jia, 2016] JIA, Kui. Improving training of deep neural networks via Singular Value Bounding. arXiv preprint arXiv: 1611.06013, 2016. •  [R2RT] Implementing Batch Normalization in Tensorflow: https://r2rt.com/implementing-batch-normalization-in-tensorflow.html