Final project, Machine Learning Having it Deep and Structured, NTU
- Rank 1/25 in peer review, original score: 16.2/17
- 2nd presentation pride (voted by audience)
1. Why Batch Normalization
Works so Well
Group:We are the REAL baseline
D05921027 Chun-Min Chang, D05921018 Chia-Ching Lin
F03942038 Chia-Hao Chung, R05942102 Kuan-Hua Wang
2. Internal Covariate Shift
• During training, layers need to continuously adapt to the new
distribution of their inputs
w
𝑧↓
1
𝑧↓
2
𝑧
w
3. Batch Normalization (BN)
• Goal: to speed up the process of training deep neural networks by
reducing internal covariate shift
𝑧↓
1
𝑧↓
2
𝑧↑′
ww
𝑧↓
1
↑′
𝑧↓
2
BN
BN
4. Idea of BN
• Full whitening? Too costly!
• 2 necessary simplifications
a. Normalize each feature dimention (no decorrelation)
b. Normalize each batch
• E.g., for the 𝑘-dim input vector:
• Also, “scale” and “shift” parameters are introduced to preserve network
capacity
batch mean
batch variance
𝑥 ↑( 𝑘) = 𝑥↑( 𝑘) −E[ 𝑥↑( 𝑘) ]/√Var[ 𝑥↑( 𝑘) ]
𝑦↑( 𝑘) = 𝛾↑( 𝑘) 𝑥 ↑( 𝑘) + 𝛽↑( 𝑘)
5. BN Algorithm (1/2)
• Training:
𝜖 is a constant preventing
division by zero, e.g., 0.001
6. BN Algorithm (2/2)
• Testing: use population statistics ( 𝜇 and 𝜎) estimated using moving averages
of batch statistics ( 𝜇↓𝐵 and 𝜎↓𝐵 ) during training
𝛼 is the moving average momentum, e.g., 0.999
𝜇← 𝛼𝜇+(1− 𝛼) 𝜇↓B
𝜎← 𝛼𝜎+(1− 𝛼) 𝜎↓B
7. Problems of Interest
• To understand the effect of BN w.r.t. the following network components
(1) activation function
(2) optimizer
(3) batch size
(4) training/testing data distribution
• To validate the claims in the original BN paper
(5) BN solves the issue of gradient vanishing
(6) BN regularizes the model
(7) BN helps making singular values of layers’ Jacobian closer to 1
• (8) To compare BN with batch renormalization (BRN)
8. Experiment Setup
• Toolkit: tensorflow
• Dataset: MNIST
• Network structure: DNN of 2 hidden layers, both 100 neurons
• Default parameters (may change for different experiments)
(1) learning rate: 0.0001
(2) batch size: 64
(3) activation function: sigmoid
(4) optimizer: SGD
• BNs are implemented before activation functions
9. To understand the effect of BN w.r.t. the
following network components
(1) activation function
(2) optimizer
(3) batch size
(4) training/testing data distribution
10. (1) Activation Function
• In all cases, BN significantly
improves the speed of
training
• Sigmoid w/o BN: gradient
vanishing
11. (2) Optimizer
• ReLU+Adam ≈ ReLU+SGD
+BN
(same as Sigmoid)
with BN, the selection of
optimizers does not lead to
significant difference
12. (3) Batch Size
• For small batch size (i.e., 4), BN
degrades the performance
13. (4) Mismatch between Training and Testing
• For binary classification task with extremely imbalanced testing distribution
(e.g., 99 : 1), it is no surprise that BN ruins the performance
14. Brief Summary I
1. BN speeds up training process and improves performance for all choices of activation
functions and optimizers, with the biggest improvement when Sigmoid is used
2. For BN, the choice of activation functions is more crucial than the choice of optimizer
3. BN worsens performance if (1) too small batch size, or (2) greatly mismatched
training/testing data distribution
15. To validate the claims in the BN paper
(5) BN solves the issue of gradient vanishing
(6) BN regularizes the model
(7) BN helps making singular values of layers’
Jacobian closer to 1
16. (5) BN does solve the issue of gradient vanishing
0.02
0.10
5x
Without BN
0.10
0.15
With BN
0.20
0.20
Layer 1
Layer 2
Layer 1
Layer 2
Sigmoid ReLU
17. (6) BN does regularize the model
• E.g., average magnitude of weights in layer 2
w11
w12
w22
w21
BN
BN
×
𝛾↑1
𝑎
↓
1
↑
1
𝑎
↓
2
↑
1
𝑎
↓
2
↑
2
𝑎
↓
1
↑
2
+
𝛽↑1
×
𝛾↑2 +
𝛽↑2
w’s can be smaller since we have 𝛾’s
18. Does BN benefit the gradient flow?
• Isometry (保距轉換):
è singular values are closed to 1
• Recall that errors are back-propagated via layer Jacobian matrix
• Claim: BN can help making singular values of layers’ Jacobian closer to 1
25. Brief Summary II
1. BN does solve the issue of gradient vanishing
2. BN does regularize the weights
3. BN does benefit the gradient flow by making singular values of layers’
Jacobian closer to 1
26. To compare BN with batch
renormalization (BRN)
(8) Does BRN really solve the problems of BN?
27. Batch Renormalization (BRN)
• Recall that BN worsens performance if (1) too small batch size, or (2) greatly
mismatched training/testing data distribution
• This is mainly due to the mismatch between batch statistics (used during
training) and estimated population statistics (used during testing)
• BRN introduces two parameters 𝑟 and 𝑑 to fix this mismatch:
BN
BRN
28. BRN Algorithm
• During training, population statistics are
maintained and introduced in normalization
process
• During testing, estimated population
statistics are used
Note that when 𝑟=1 and 𝑑=0, BRN = BN
29. BN vs. BRN under small batch size
• BRN survives under small batch size: 4
30. Conclusions
We have showed experimentally that
1. BN speeds up training process and improves performance no matter which
activation functions or optimizers are used
。With BN, activation function is more crucial than optimizer
2. BN does…
(1) solve the issue of gradient vanishing
(2) regularize the weights
(3) benefit gradient flow through network
3. BN worsens performance if (1) too small batch size, or (2) greatly
mismatched training/testing data distribution
è Solved by BRN
31. References
• [S. Ioffe & C. Szegedy, 2015] Ioffe, Sergey, Szegedy, Christian. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
• [Saxe et al., 2013] Saxe, Andrew M., McClelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear
dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.
• [Nair & Hinton, 2010] Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann
machines. In ICML, pp. 807–814. Omnipress, 2010.
• [Shimodaira, 2000] Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log-
likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, October 2000.
• [LeCun et al., 1998b] LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.),
Neural Networks: Tricks of the trade. Springer, 1998b.
• [Wiesler & Ney, 2011] Wiesler, Simon and Ney, Hermann. A convergence analysis of log-linear training. In Shawe-
Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information
Processing Systems 24, pp. 657–665, Granada, Spain, December 2011.
32. References
• [Wiesler et al., 2014] Wiesler, Simon, Richard, Alexander, Schlu ̈ter, Ralf, and Ney, Hermann. Mean-normalized
stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech, and Signal
Processing, pp. 180–184, Florence, Italy, May 2014.
• [Raiko et al., 2012] Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear
transformations in perceptrons. In International Conference on Artificial In- telligence and Statistics (AISTATS), pp.
924–932, 2012.
• [Povey et al., 2014] Povey, Daniel, Zhang, Xiaohui, and Khudanpur, San- jeev. Parallel training of deep neural
networks with natural gradient and parameter averaging. CoRR, abs/1410.7455, 2014.
• [Wang et al., 2016] Wang, S., Mohamed, A. R., Caruana, R., Bilmes, J., Plilipose, M., Richardson, M., ... & Aslan, O.
(2016, June). Analysis of Deep Neural Networks with the Extended Data Jacobian Matrix. In Proceedings of The 33rd
International Conference on Machine Learning (pp. 718-726).
• [K. Jia, 2016] JIA, Kui. Improving training of deep neural networks via Singular Value Bounding. arXiv preprint arXiv:
1611.06013, 2016.
• [R2RT] Implementing Batch Normalization in Tensorflow:
https://r2rt.com/implementing-batch-normalization-in-tensorflow.html