Introduction to batch normalization

Batch Normalization
Taka Wang
20210721

Batch
Normalization
Distribution changes
Higher learning rate
Less care about initialization
2015, ICML, Google Researchers
Mini-Batch

Recap
When training neural networks we'll feed in observations and compare the
expected output to the true output of the network. We'll then use gradient
descent to update the parameters of the model in the direction which will
minimize the difference between our expected (or ideal) outcome and the true
outcome. In other words, we're attempting to minimize the error we observe in
our model's predictions.
Ultimately, gradient descent is a search among a loss function surface in an
attempt to ﬁnd the values for each parameter such that the loss function is
minimized. In other words, we're looking for the lowest value on the loss function
surface.
3
Source: jeremyjordan

Recap
height
age
NBA?
age
height
Contour Plot (Loss function)
4

Neuron
c
b
a
output Y
w1
w2
w3
x =
aw1
+bw2
+cw3
f(x)
Linear
Combination
Activation
function
5

Optimization
6

Learning Rate Matters
7

Why Feature Scaling Matter
Player Age Height NBA
#1 45 190 No
#2 19 192 Yes
#3 32 195 No
#4 48 200 No
#5 25 193 Yes
#6 31 182 Yes
#7 49 177 No
#8 22 201 Yes
8

Feature Scaling (Normalization)
9
age
age
height
height

10
Z-Score Normalization
Min-Max Normalization
Source: Codecademy

13
變異數 (variance) 是用來衡量資料發散程度

Batch, Stochastic & Mini Batch Gradient Descent
14
(Batch) Gradient Descent
Stochastic Gradient Descent
Mini Batch Gradient Descent
(Batch = 4)

Vanishing gradient Problem
15
Source: Andre Ye

17
直覺
★ 既然 feature scaling 對於 input layer 有幫助，那 hidden layer 應該也有幫助
★ 如果你把麥克風放大器的旋鈕轉到接近0，別人就聽不到你的聲音，但如果你
把它轉到接近最大聲，你的聲音就會飽和。
★ 現在想像一下你將這種放大器串接起來，你必須正確的設定它們，才能讓聲音
在串接的末端既響亮又清晰。你的聲音從各個放大器出來時的振幅必須和它進
入該放大器時一樣。

BN Layer
c
b
a
output Y
w1
w2
w3
x =
aw1
+bw2
+cw3
f(x)
18
Linear
Combination
Activation
function

20
Before/After Activation Function
Input Output
Source: 莫煩

21
Back to Paper
μ and σ2
are calculated
on a per-batch basis
while γ and β are
learned parameters
used across all
batches.

23
Stable Training
2018, NIPS, MIT
Internal
Covariate Shift
Smooth
Landscape

24
Source: How Does Batch Normalization Help Optimization?

Pros and Cons
Pros
➔ Allow sub-optimal starts (weight
initialization)
➔ Speed up training (Larger learning
rate)
➔ Solve vanishing gradient problem
➔ Acts a regularizer (Introduce
randomness)
Cons
➔ Small batch size leads to
unstable mean and variance
(Batch >= 32)
➔ Not for RNN (Recurrent neural
networks)
25

Further Reading
- 什麼是 Batch Normalization 批標準化 (Youtube 5:08)
- Batch Normalization - Explained (Youtube 8:48)
- Batch Normalization 介紹 - (Medium in Mandarin)
- Normalizing your data (speciﬁcally, input and batch normalization)
26

Introduction to batch normalization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to batch normalization

Similar to Introduction to batch normalization (20)

More from Jamie (Taka) Wang

More from Jamie (Taka) Wang (20)

Recently uploaded

Recently uploaded (20)

Introduction to batch normalization