SlideShare a Scribd company logo
1 of 32
Download to read offline
Why Batch Normalization
Works so Well
Group:We are the REAL baseline
D05921027 Chun-Min Chang, D05921018 Chia-Ching Lin
F03942038 Chia-Hao Chung, R05942102 Kuan-Hua Wang
Internal Covariate Shift
•  During training, layers need to continuously adapt to the new
distribution of their inputs
w
​
𝑧↓
1 
​
𝑧↓
2 
𝑧
w
Batch Normalization (BN)
•  Goal: to speed up the process of training deep neural networks by
reducing internal covariate shift
​
𝑧↓
1 
​
𝑧↓
2 
​ 𝑧↑′ 
ww
​
𝑧↓
1
↑′ 
​
𝑧↓
2
BN
BN
Idea of BN
•  Full whitening? Too costly!
•  2 necessary simplifications
a.  Normalize each feature dimention (no decorrelation)
b.  Normalize each batch
•  E.g., for the 𝑘-dim input vector:
•  Also, “scale” and “shift” parameters are introduced to preserve network
capacity
batch mean
batch variance
​​ 𝑥 ↑( 𝑘) =​​ 𝑥↑( 𝑘) −E[​ 𝑥↑( 𝑘) ]/√⁠Var[​ 𝑥↑( 𝑘) ]  
​ 𝑦↑( 𝑘) =​ 𝛾↑( 𝑘) ​​ 𝑥 ↑( 𝑘) +​ 𝛽↑( 𝑘) 
BN Algorithm (1/2)
•  Training:
𝜖 is a constant preventing
division by zero, e.g., 0.001
BN Algorithm (2/2)
•  Testing: use population statistics ( 𝜇 and 𝜎) estimated using moving averages
of batch statistics (​ 𝜇↓𝐵  and ​ 𝜎↓𝐵 ) during training
𝛼 is the moving average momentum, e.g., 0.999
𝜇← 𝛼𝜇+(1− 𝛼)​ 𝜇↓B 
𝜎← 𝛼𝜎+(1− 𝛼)​ 𝜎↓B 
Problems of Interest
•  To understand the effect of BN w.r.t. the following network components
(1) activation function
(2) optimizer
(3) batch size
(4) training/testing data distribution
•  To validate the claims in the original BN paper
(5) BN solves the issue of gradient vanishing
(6) BN regularizes the model
(7) BN helps making singular values of layers’ Jacobian closer to 1
•  (8) To compare BN with batch renormalization (BRN)
Experiment Setup
•  Toolkit: tensorflow
•  Dataset: MNIST
•  Network structure: DNN of 2 hidden layers, both 100 neurons
•  Default parameters (may change for different experiments)
(1) learning rate: 0.0001
(2) batch size: 64
(3) activation function: sigmoid
(4) optimizer: SGD
•  BNs are implemented before activation functions
To understand the effect of BN w.r.t. the
following network components
(1) activation function
(2) optimizer
(3) batch size
(4) training/testing data distribution
(1) Activation Function
•  In all cases, BN significantly
improves the speed of
training
•  Sigmoid w/o BN: gradient
vanishing
(2) Optimizer
•  ReLU+Adam ≈ ReLU+SGD
+BN
(same as Sigmoid)
with BN, the selection of
optimizers does not lead to
significant difference
(3) Batch Size
•  For small batch size (i.e., 4), BN
degrades the performance
(4) Mismatch between Training and Testing
•  For binary classification task with extremely imbalanced testing distribution
(e.g., 99 : 1), it is no surprise that BN ruins the performance
Brief Summary I
1.  BN speeds up training process and improves performance for all choices of activation
functions and optimizers, with the biggest improvement when Sigmoid is used
2.  For BN, the choice of activation functions is more crucial than the choice of optimizer
3.  BN worsens performance if (1) too small batch size, or (2) greatly mismatched
training/testing data distribution
To validate the claims in the BN paper
(5) BN solves the issue of gradient vanishing
(6) BN regularizes the model
(7) BN helps making singular values of layers’
Jacobian closer to 1
(5) BN does solve the issue of gradient vanishing
0.02
0.10
5x
Without BN
0.10
0.15
With BN
0.20
0.20
Layer 1
Layer 2
Layer 1
Layer 2
Sigmoid ReLU
(6) BN does regularize the model
•  E.g., average magnitude of weights in layer 2
w11
w12
w22
w21
BN
BN
×​
𝛾↑1 
​
𝑎
↓
1
↑
1 
​
𝑎
↓
2
↑
1 
​
𝑎
↓
2
↑
2 
​
𝑎
↓
1
↑
2 
+​
𝛽↑1 
×​
𝛾↑2  +​
𝛽↑2 
w’s can be smaller since we have 𝛾’s
Does BN benefit the gradient flow?
•  Isometry (保距轉換):
è singular values are closed to 1
•  Recall that errors are back-propagated via layer Jacobian matrix
•  Claim: BN can help making singular values of layers’ Jacobian closer to 1
Singular Values of Layer Jacobian (Sigmoid)
Singular Values of Layer Jacobian (Sigmoid)
Singular Values of Layer Jacobian (Sigmoid)
Singular Values of Layer Jacobian (ReLU)
Singular Values of Layer Jacobian (ReLU)
Singular Values of Layer Jacobian (ReLU)
Brief Summary II
1.  BN does solve the issue of gradient vanishing
2.  BN does regularize the weights
3.  BN does benefit the gradient flow by making singular values of layers’
Jacobian closer to 1
To compare BN with batch
renormalization (BRN)
(8) Does BRN really solve the problems of BN?
Batch Renormalization (BRN)
•  Recall that BN worsens performance if (1) too small batch size, or (2) greatly
mismatched training/testing data distribution
•  This is mainly due to the mismatch between batch statistics (used during
training) and estimated population statistics (used during testing)
•  BRN introduces two parameters 𝑟 and 𝑑 to fix this mismatch:
BN
 BRN
BRN Algorithm
•  During training, population statistics are
maintained and introduced in normalization
process
•  During testing, estimated population
statistics are used
Note that when 𝑟=1 and 𝑑=0, BRN = BN
BN vs. BRN under small batch size
•  BRN survives under small batch size: 4
Conclusions
We have showed experimentally that
1.  BN speeds up training process and improves performance no matter which
activation functions or optimizers are used
。With BN, activation function is more crucial than optimizer
2.  BN does…
(1) solve the issue of gradient vanishing
(2) regularize the weights
(3) benefit gradient flow through network
3.  BN worsens performance if (1) too small batch size, or (2) greatly
mismatched training/testing data distribution
è Solved by BRN
References
•  [S. Ioffe & C. Szegedy, 2015] Ioffe, Sergey, Szegedy, Christian. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
•  [Saxe et al., 2013] Saxe, Andrew M., McClelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear
dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.
•  [Nair & Hinton, 2010] Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann
machines. In ICML, pp. 807–814. Omnipress, 2010.
•  [Shimodaira, 2000] Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log-
likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, October 2000.
•  [LeCun et al., 1998b] LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.),
Neural Networks: Tricks of the trade. Springer, 1998b.
•  [Wiesler & Ney, 2011] Wiesler, Simon and Ney, Hermann. A convergence analysis of log-linear training. In Shawe-
Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information
Processing Systems 24, pp. 657–665, Granada, Spain, December 2011.
References
•  [Wiesler et al., 2014] Wiesler, Simon, Richard, Alexander, Schlu ̈ter, Ralf, and Ney, Hermann. Mean-normalized
stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech, and Signal
Processing, pp. 180–184, Florence, Italy, May 2014.
•  [Raiko et al., 2012] Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear
transformations in perceptrons. In International Conference on Artificial In- telligence and Statistics (AISTATS), pp.
924–932, 2012.
•  [Povey et al., 2014] Povey, Daniel, Zhang, Xiaohui, and Khudanpur, San- jeev. Parallel training of deep neural
networks with natural gradient and parameter averaging. CoRR, abs/1410.7455, 2014.
•  [Wang et al., 2016] Wang, S., Mohamed, A. R., Caruana, R., Bilmes, J., Plilipose, M., Richardson, M., ... & Aslan, O.
(2016, June). Analysis of Deep Neural Networks with the Extended Data Jacobian Matrix. In Proceedings of The 33rd
International Conference on Machine Learning (pp. 718-726).
•  [K. Jia, 2016] JIA, Kui. Improving training of deep neural networks via Singular Value Bounding. arXiv preprint arXiv:
1611.06013, 2016.
•  [R2RT] Implementing Batch Normalization in Tensorflow:
https://r2rt.com/implementing-batch-normalization-in-tensorflow.html

More Related Content

What's hot

Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
milad abbasi
 
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
Soft computing (ANN and Fuzzy Logic)  : Dr. Purnima PanditSoft computing (ANN and Fuzzy Logic)  : Dr. Purnima Pandit
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
Purnima Pandit
 

What's hot (20)

Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Object detection with deep learning
Object detection with deep learningObject detection with deep learning
Object detection with deep learning
 
Deep Neural Networks (DNN)
Deep Neural Networks (DNN)Deep Neural Networks (DNN)
Deep Neural Networks (DNN)
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning Tutorial
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognition
 
Cnn
CnnCnn
Cnn
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Batch normalization paper review
Batch normalization paper reviewBatch normalization paper review
Batch normalization paper review
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Image classification using cnn
Image classification using cnnImage classification using cnn
Image classification using cnn
 
Back Propagation Neural Network In AI PowerPoint Presentation Slide Templates...
Back Propagation Neural Network In AI PowerPoint Presentation Slide Templates...Back Propagation Neural Network In AI PowerPoint Presentation Slide Templates...
Back Propagation Neural Network In AI PowerPoint Presentation Slide Templates...
 
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
Soft computing (ANN and Fuzzy Logic)  : Dr. Purnima PanditSoft computing (ANN and Fuzzy Logic)  : Dr. Purnima Pandit
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
 
딥러닝 논문읽기 efficient netv2 논문리뷰
딥러닝 논문읽기 efficient netv2  논문리뷰딥러닝 논문읽기 efficient netv2  논문리뷰
딥러닝 논문읽기 efficient netv2 논문리뷰
 
Deep Learning With Neural Networks
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural Networks
 
Multi Layer Network
Multi Layer NetworkMulti Layer Network
Multi Layer Network
 
Yolov5
Yolov5 Yolov5
Yolov5
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 

Similar to Why Batch Normalization Works so Well

Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
taikhoan262
 
ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)
WoochulShin10
 

Similar to Why Batch Normalization Works so Well (20)

Batch_Normalization.pptx
Batch_Normalization.pptxBatch_Normalization.pptx
Batch_Normalization.pptx
 
4 high performance large-scale image recognition without normalization
4 high performance large-scale image recognition without normalization4 high performance large-scale image recognition without normalization
4 high performance large-scale image recognition without normalization
 
Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Network Deconvolution review [cdm]
Network Deconvolution review [cdm]
 
High performance large-scale image recognition without normalization
High performance large-scale image recognition without normalizationHigh performance large-scale image recognition without normalization
High performance large-scale image recognition without normalization
 
Guide
GuideGuide
Guide
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Generating super resolution images using transformers
Generating super resolution images using transformersGenerating super resolution images using transformers
Generating super resolution images using transformers
 
Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reduction
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed Recognition
 
Decision Forests and discriminant analysis
Decision Forests and discriminant analysisDecision Forests and discriminant analysis
Decision Forests and discriminant analysis
 
Batch normalization: Accelerating Deep Network Training by Reducing Internal ...
Batch normalization: Accelerating Deep Network Training by Reducing Internal ...Batch normalization: Accelerating Deep Network Training by Reducing Internal ...
Batch normalization: Accelerating Deep Network Training by Reducing Internal ...
 
Temporal Segment Network
Temporal Segment NetworkTemporal Segment Network
Temporal Segment Network
 

Recently uploaded

Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 

Recently uploaded (20)

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 

Why Batch Normalization Works so Well

  • 1. Why Batch Normalization Works so Well Group:We are the REAL baseline D05921027 Chun-Min Chang, D05921018 Chia-Ching Lin F03942038 Chia-Hao Chung, R05942102 Kuan-Hua Wang
  • 2. Internal Covariate Shift •  During training, layers need to continuously adapt to the new distribution of their inputs w ​ 𝑧↓ 1  ​ 𝑧↓ 2  𝑧 w
  • 3. Batch Normalization (BN) •  Goal: to speed up the process of training deep neural networks by reducing internal covariate shift ​ 𝑧↓ 1  ​ 𝑧↓ 2  ​ 𝑧↑′  ww ​ 𝑧↓ 1 ↑′  ​ 𝑧↓ 2 BN BN
  • 4. Idea of BN •  Full whitening? Too costly! •  2 necessary simplifications a.  Normalize each feature dimention (no decorrelation) b.  Normalize each batch •  E.g., for the 𝑘-dim input vector: •  Also, “scale” and “shift” parameters are introduced to preserve network capacity batch mean batch variance ​​ 𝑥 ↑( 𝑘) =​​ 𝑥↑( 𝑘) −E[​ 𝑥↑( 𝑘) ]/√⁠Var[​ 𝑥↑( 𝑘) ]   ​ 𝑦↑( 𝑘) =​ 𝛾↑( 𝑘) ​​ 𝑥 ↑( 𝑘) +​ 𝛽↑( 𝑘) 
  • 5. BN Algorithm (1/2) •  Training: 𝜖 is a constant preventing division by zero, e.g., 0.001
  • 6. BN Algorithm (2/2) •  Testing: use population statistics ( 𝜇 and 𝜎) estimated using moving averages of batch statistics (​ 𝜇↓𝐵  and ​ 𝜎↓𝐵 ) during training 𝛼 is the moving average momentum, e.g., 0.999 𝜇← 𝛼𝜇+(1− 𝛼)​ 𝜇↓B  𝜎← 𝛼𝜎+(1− 𝛼)​ 𝜎↓B 
  • 7. Problems of Interest •  To understand the effect of BN w.r.t. the following network components (1) activation function (2) optimizer (3) batch size (4) training/testing data distribution •  To validate the claims in the original BN paper (5) BN solves the issue of gradient vanishing (6) BN regularizes the model (7) BN helps making singular values of layers’ Jacobian closer to 1 •  (8) To compare BN with batch renormalization (BRN)
  • 8. Experiment Setup •  Toolkit: tensorflow •  Dataset: MNIST •  Network structure: DNN of 2 hidden layers, both 100 neurons •  Default parameters (may change for different experiments) (1) learning rate: 0.0001 (2) batch size: 64 (3) activation function: sigmoid (4) optimizer: SGD •  BNs are implemented before activation functions
  • 9. To understand the effect of BN w.r.t. the following network components (1) activation function (2) optimizer (3) batch size (4) training/testing data distribution
  • 10. (1) Activation Function •  In all cases, BN significantly improves the speed of training •  Sigmoid w/o BN: gradient vanishing
  • 11. (2) Optimizer •  ReLU+Adam ≈ ReLU+SGD +BN (same as Sigmoid) with BN, the selection of optimizers does not lead to significant difference
  • 12. (3) Batch Size •  For small batch size (i.e., 4), BN degrades the performance
  • 13. (4) Mismatch between Training and Testing •  For binary classification task with extremely imbalanced testing distribution (e.g., 99 : 1), it is no surprise that BN ruins the performance
  • 14. Brief Summary I 1.  BN speeds up training process and improves performance for all choices of activation functions and optimizers, with the biggest improvement when Sigmoid is used 2.  For BN, the choice of activation functions is more crucial than the choice of optimizer 3.  BN worsens performance if (1) too small batch size, or (2) greatly mismatched training/testing data distribution
  • 15. To validate the claims in the BN paper (5) BN solves the issue of gradient vanishing (6) BN regularizes the model (7) BN helps making singular values of layers’ Jacobian closer to 1
  • 16. (5) BN does solve the issue of gradient vanishing 0.02 0.10 5x Without BN 0.10 0.15 With BN 0.20 0.20 Layer 1 Layer 2 Layer 1 Layer 2 Sigmoid ReLU
  • 17. (6) BN does regularize the model •  E.g., average magnitude of weights in layer 2 w11 w12 w22 w21 BN BN ×​ 𝛾↑1  ​ 𝑎 ↓ 1 ↑ 1  ​ 𝑎 ↓ 2 ↑ 1  ​ 𝑎 ↓ 2 ↑ 2  ​ 𝑎 ↓ 1 ↑ 2  +​ 𝛽↑1  ×​ 𝛾↑2  +​ 𝛽↑2  w’s can be smaller since we have 𝛾’s
  • 18. Does BN benefit the gradient flow? •  Isometry (保距轉換): è singular values are closed to 1 •  Recall that errors are back-propagated via layer Jacobian matrix •  Claim: BN can help making singular values of layers’ Jacobian closer to 1
  • 19. Singular Values of Layer Jacobian (Sigmoid)
  • 20. Singular Values of Layer Jacobian (Sigmoid)
  • 21. Singular Values of Layer Jacobian (Sigmoid)
  • 22. Singular Values of Layer Jacobian (ReLU)
  • 23. Singular Values of Layer Jacobian (ReLU)
  • 24. Singular Values of Layer Jacobian (ReLU)
  • 25. Brief Summary II 1.  BN does solve the issue of gradient vanishing 2.  BN does regularize the weights 3.  BN does benefit the gradient flow by making singular values of layers’ Jacobian closer to 1
  • 26. To compare BN with batch renormalization (BRN) (8) Does BRN really solve the problems of BN?
  • 27. Batch Renormalization (BRN) •  Recall that BN worsens performance if (1) too small batch size, or (2) greatly mismatched training/testing data distribution •  This is mainly due to the mismatch between batch statistics (used during training) and estimated population statistics (used during testing) •  BRN introduces two parameters 𝑟 and 𝑑 to fix this mismatch: BN BRN
  • 28. BRN Algorithm •  During training, population statistics are maintained and introduced in normalization process •  During testing, estimated population statistics are used Note that when 𝑟=1 and 𝑑=0, BRN = BN
  • 29. BN vs. BRN under small batch size •  BRN survives under small batch size: 4
  • 30. Conclusions We have showed experimentally that 1.  BN speeds up training process and improves performance no matter which activation functions or optimizers are used 。With BN, activation function is more crucial than optimizer 2.  BN does… (1) solve the issue of gradient vanishing (2) regularize the weights (3) benefit gradient flow through network 3.  BN worsens performance if (1) too small batch size, or (2) greatly mismatched training/testing data distribution è Solved by BRN
  • 31. References •  [S. Ioffe & C. Szegedy, 2015] Ioffe, Sergey, Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. •  [Saxe et al., 2013] Saxe, Andrew M., McClelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013. •  [Nair & Hinton, 2010] Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807–814. Omnipress, 2010. •  [Shimodaira, 2000] Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log- likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, October 2000. •  [LeCun et al., 1998b] LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.), Neural Networks: Tricks of the trade. Springer, 1998b. •  [Wiesler & Ney, 2011] Wiesler, Simon and Ney, Hermann. A convergence analysis of log-linear training. In Shawe- Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 657–665, Granada, Spain, December 2011.
  • 32. References •  [Wiesler et al., 2014] Wiesler, Simon, Richard, Alexander, Schlu ̈ter, Ralf, and Ney, Hermann. Mean-normalized stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 180–184, Florence, Italy, May 2014. •  [Raiko et al., 2012] Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial In- telligence and Statistics (AISTATS), pp. 924–932, 2012. •  [Povey et al., 2014] Povey, Daniel, Zhang, Xiaohui, and Khudanpur, San- jeev. Parallel training of deep neural networks with natural gradient and parameter averaging. CoRR, abs/1410.7455, 2014. •  [Wang et al., 2016] Wang, S., Mohamed, A. R., Caruana, R., Bilmes, J., Plilipose, M., Richardson, M., ... & Aslan, O. (2016, June). Analysis of Deep Neural Networks with the Extended Data Jacobian Matrix. In Proceedings of The 33rd International Conference on Machine Learning (pp. 718-726). •  [K. Jia, 2016] JIA, Kui. Improving training of deep neural networks via Singular Value Bounding. arXiv preprint arXiv: 1611.06013, 2016. •  [R2RT] Implementing Batch Normalization in Tensorflow: https://r2rt.com/implementing-batch-normalization-in-tensorflow.html