Introduction to batch normalization

•

0 likes•231 views

Jamie (Taka) Wang

Batch Normalization

Science

Batch
Normalization
Distribution changes
Higher learning rate
Less care about initialization
2015, ICML, Google Researchers
Mini-Batch

Recap
When training neural networks we'll feed in observations and compare the
expected output to the true output of the network. We'll then use gradient
descent to update the parameters of the model in the direction which will
minimize the difference between our expected (or ideal) outcome and the true
outcome. In other words, we're attempting to minimize the error we observe in
our model's predictions.
Ultimately, gradient descent is a search among a loss function surface in an
attempt to ﬁnd the values for each parameter such that the loss function is
minimized. In other words, we're looking for the lowest value on the loss function
surface.
3
Source: jeremyjordan

Recap
height
age
NBA?
age
height
Contour Plot (Loss function)
4

Neuron
c
b
a
output Y
w1
w2
w3
x =
aw1
+bw2
+cw3
f(x)
Linear
Combination
Activation
function
5

Learning Rate Matters
7
Source: jeremyjordan

Why Feature Scaling Matter
Player Age Height NBA
#1 45 190 No
#2 19 192 Yes
#3 32 195 No
#4 48 200 No
#5 25 193 Yes
#6 31 182 Yes
#7 49 177 No
#8 22 201 Yes
8

Feature Scaling (Normalization)
9
age
Contour Plot (Loss function)
age
height
Contour Plot (Loss function)
height

10
Z-Score Normalization
Min-Max Normalization
Source: Codecademy

Batch, Stochastic & Mini Batch Gradient Descent
14
(Batch) Gradient Descent
Stochastic Gradient Descent
Mini Batch Gradient Descent
(Batch = 4)

Vanishing gradient Problem
15
Source: Andre Ye

17
直覺
★ 既然 feature scaling 對於 input layer 有幫助，那 hidden layer 應該也有幫助
★ 如果你把麥克風放大器的旋鈕轉到接近0，別人就聽不到你的聲音，但如果你
把它轉到接近最大聲，你的聲音就會飽和。
★ 現在想像一下你將這種放大器串接起來，你必須正確的設定它們，才能讓聲音
在串接的末端既響亮又清晰。你的聲音從各個放大器出來時的振幅必須和它進
入該放大器時一樣。

BN Layer
c
b
a
output Y
w1
w2
w3
x =
aw1
+bw2
+cw3
f(x)
18
Linear
Combination
Activation
function

20
Before/After Activation Function
Input Output
Source: 莫煩

21
Back to Paper
μ and σ2
are calculated
on a per-batch basis
while γ and β are
learned parameters
used across all
batches.

23
Stable Training
2018, NIPS, MIT
Internal
Covariate Shift
Smooth
Landscape

24
Source: How Does Batch Normalization Help Optimization?

Pros and Cons
Pros
➔ Allow sub-optimal starts (weight
initialization)
➔ Speed up training (Larger learning
rate)
➔ Solve vanishing gradient problem
➔ Acts a regularizer (Introduce
randomness)
Cons
➔ Small batch size leads to
unstable mean and variance
(Batch >= 32)
➔ Not for RNN (Recurrent neural
networks)
25

Further Reading
- 什麼是 Batch Normalization 批標準化 (Youtube 5:08)
- Batch Normalization - Explained (Youtube 8:48)
- Batch Normalization 介紹 - (Medium in Mandarin)
- Normalizing your data (speciﬁcally, input and batch normalization)
26

What's hot

Neural Networks: Multilayer PerceptronMostafa G. M. Mostafa

Simple Introduction to AutoEncoderJun Lang

Variational AutoencoderMark Chang

Gradient descent methodProf. Neeta Awasthy

Reinforcement learningDing Li

Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Edureka!

Introduction to Recurrent Neural NetworkKnoldus Inc.

Machine Learning lecture6(regularization)cairo university

AutoencoderHARISH R

Generative adversarial networks남주 김

What is Deep Learning?NVIDIA

Recurrent Neural NetworkMohammad Sabouri

Feed forward ,back propagation,gradient descentMuhammad Rasel

Optimization/Gradient Descentkandelin

NLP using transformers Arvind Devaraj

Deep residual learning for image recognitionYoonho Shin

AutoencodersCloudxLab

Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov

Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place SolutionPreferred Networks

Deep Feed Forward Neural Networks and RegularizationYan Xu

What's hot (20)

Neural Networks: Multilayer Perceptron

Simple Introduction to AutoEncoder

Variational Autoencoder

Gradient descent method

Reinforcement learning

Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...

Introduction to Recurrent Neural Network

Machine Learning lecture6(regularization)

Autoencoder

Generative adversarial networks

What is Deep Learning?

Recurrent Neural Network

Feed forward ,back propagation,gradient descent

Optimization/Gradient Descent

NLP using transformers

Deep residual learning for image recognition

Autoencoders

Recurrent Neural Networks. Part 1: Theory

Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place Solution

Deep Feed Forward Neural Networks and Regularization

Similar to Introduction to batch normalization

[系列活動] 手把手的深度學習實務台灣資料科學年會

Hands-on Tutorial of Deep LearningChun-Ming Chang

machine learning a gentle introduction 2018 (edited)Matthias Zimmermann

Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...Universitat Politècnica de Catalunya

Deep learning: what? how? why? How to win a Kaggle competition317070

Anomaly Detection for Real-World SystemsManojit Nandi

Robust Design And Variation Reduction Using DiscoverSimJohnNoguera

Reinforcement Learning - DQNMohammaderfan Arefimoghaddam

[系列活動] 手把手的深度學實務台灣資料科學年會

Using Topological Data Analysis on your BigDataAnalyticsWeek

Illustrative Introductory Neural NetworksYasutoTamura1

Getting Started with Machine LearningHumberto Marchezi

gan.pdfDr.rukmani Devi

ML Study Jams - Session 3.pptxMayankChadha14

04 numericalRonald Teo

InfoGAN and Generative Adversarial NetworksZak Jost

Musings of kagglerKai Xin Thia

Intelligent Ruby + Machine LearningIlya Grigorik

Deep Learning with Apache MXNet (September 2017)Julien SIMON

Generative adversarial network_Ayadi_AlaeddineDeep Learning Italia

Similar to Introduction to batch normalization (20)

[系列活動] 手把手的深度學習實務

Hands-on Tutorial of Deep Learning

machine learning a gentle introduction 2018 (edited)

Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...

Deep learning: what? how? why? How to win a Kaggle competition

Anomaly Detection for Real-World Systems

Robust Design And Variation Reduction Using DiscoverSim

Reinforcement Learning - DQN

[系列活動] 手把手的深度學實務

Using Topological Data Analysis on your BigData

Illustrative Introductory Neural Networks

Getting Started with Machine Learning

gan.pdf

ML Study Jams - Session 3.pptx

04 numerical

InfoGAN and Generative Adversarial Networks

Musings of kaggler

Intelligent Ruby + Machine Learning

Deep Learning with Apache MXNet (September 2017)

Generative adversarial network_Ayadi_Alaeddine

Recently uploaded

Orientation, design and principles of polyhousejana861314

Isotopic evidence of long-lived volcanism on IoSérgio Sacani

Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY

Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar

Engler and Prantl system of classification in plant taxonomyNistarini College, Purulia (W.B) India

VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P

Nanoparticles synthesis and characterization kaibalyasahoo82800

Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh

Is RISC-V ready for HPC workload? Maybe?Patrick Diehl

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk

GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari

Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823

A relative description on Sonoporation.pdfnehabiju2046

Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani

The Philosophy of ScienceUniversity of Hertfordshire

Disentangling the origin of chemical differences using GHOSTSérgio Sacani

Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013

G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2

Recently uploaded (20)

Orientation, design and principles of polyhouse

Isotopic evidence of long-lived volcanism on Io

Behavioral Disorder: Schizophrenia & it's Case Study.pdf

Analytical Profile of Coleus Forskohlii | Forskolin .pptx

Engler and Prantl system of classification in plant taxonomy

VIRUSES structure and classification ppt by Dr.Prince C P

Nanoparticles synthesis and characterization

Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝

Is RISC-V ready for HPC workload? Maybe?

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx

GFP in rDNA Technology (Biotechnology).pptx

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...

Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...

A relative description on Sonoporation.pdf

Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...

The Philosophy of Science

Disentangling the origin of chemical differences using GHOST

Scheme-of-Work-Science-Stage-4 cambridge science.docx

G9 Science Q4- Week 1-2 Projectile Motion.ppt

Introduction to batch normalization

1. Batch Normalization Taka Wang 20210721

2. Batch Normalization Distribution changes Higher learning rate Less care about initialization 2015, ICML, Google Researchers Mini-Batch

3. Recap When training neural networks we'll feed in observations and compare the expected output to the true output of the network. We'll then use gradient descent to update the parameters of the model in the direction which will minimize the difference between our expected (or ideal) outcome and the true outcome. In other words, we're attempting to minimize the error we observe in our model's predictions. Ultimately, gradient descent is a search among a loss function surface in an attempt to ﬁnd the values for each parameter such that the loss function is minimized. In other words, we're looking for the lowest value on the loss function surface. 3 Source: jeremyjordan

4. Recap height age NBA? age height Contour Plot (Loss function) 4

5. Neuron c b a output Y w1 w2 w3 x = aw1 +bw2 +cw3 f(x) Linear Combination Activation function 5

6. Optimization 6 Source: jeremyjordan

7. Learning Rate Matters 7 Source: jeremyjordan

8. Why Feature Scaling Matter Player Age Height NBA #1 45 190 No #2 19 192 Yes #3 32 195 No #4 48 200 No #5 25 193 Yes #6 31 182 Yes #7 49 177 No #8 22 201 Yes 8

9. Feature Scaling (Normalization) 9 age Contour Plot (Loss function) age height Contour Plot (Loss function) height

10. 10 Z-Score Normalization Min-Max Normalization Source: Codecademy

11. 11 Source: GIGAcalculator

12. 12 平均數 (Mean)

13. 13 變異數 (variance) 是用來衡量資料發散程度

14. Batch, Stochastic & Mini Batch Gradient Descent 14 (Batch) Gradient Descent Stochastic Gradient Descent Mini Batch Gradient Descent (Batch = 4)

15. Vanishing gradient Problem 15 Source: Andre Ye

16. Batch Normalization Distribution changes Higher learning rate Less care about initialization 2015, ICML, Google Researchers Mini-Batch

17. 17 直覺 ★ 既然 feature scaling 對於 input layer 有幫助，那 hidden layer 應該也有幫助 ★ 如果你把麥克風放大器的旋鈕轉到接近0，別人就聽不到你的聲音，但如果你把它轉到接近最大聲，你的聲音就會飽和。 ★ 現在想像一下你將這種放大器串接起來，你必須正確的設定它們，才能讓聲音在串接的末端既響亮又清晰。你的聲音從各個放大器出來時的振幅必須和它進入該放大器時一樣。

18. BN Layer c b a output Y w1 w2 w3 x = aw1 +bw2 +cw3 f(x) 18 Linear Combination Activation function

19. 19

20. 20 Before/After Activation Function Input Output Source: 莫煩

21. 21 Back to Paper μ and σ2 are calculated on a per-batch basis while γ and β are learned parameters used across all batches.

22. 22 Source: Andre Ye

23. 23 Stable Training 2018, NIPS, MIT Internal Covariate Shift Smooth Landscape

24. 24 Source: How Does Batch Normalization Help Optimization?

25. Pros and Cons Pros ➔ Allow sub-optimal starts (weight initialization) ➔ Speed up training (Larger learning rate) ➔ Solve vanishing gradient problem ➔ Acts a regularizer (Introduce randomness) Cons ➔ Small batch size leads to unstable mean and variance (Batch >= 32) ➔ Not for RNN (Recurrent neural networks) 25

26. Further Reading - 什麼是 Batch Normalization 批標準化 (Youtube 5:08) - Batch Normalization - Explained (Youtube 8:48) - Batch Normalization 介紹 - (Medium in Mandarin) - Normalizing your data (speciﬁcally, input and batch normalization) 26

Introduction to batch normalization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to batch normalization

Similar to Introduction to batch normalization (20)

More from Jamie (Taka) Wang

More from Jamie (Taka) Wang (20)

Recently uploaded

Recently uploaded (20)

Introduction to batch normalization