DeepLearningLecture.pptx

Introduction to Deep Learning:
How to make your own deep
learning framework
Azure iPython Notebook
https://notebooks.azure.com/ryotat/libraries/DLTutorial

Agenda
• This lecture covers
• Introduction to machine learning
(keywords: model, training, inference, stochastic
gradient descent, overfitting)
• How to compute the gradient
(keywords: backpropagation, multi-layer perceptrons,
activation function)

What is Machine Learning (ML)?
• The goal of ML is to learn from data ---->
• Technically, combines statistics and
computational tools (optimization)
• Example (supervised learning) tasks
Cat or Dog
Image recognition / classification
Model

“Hello”
Speech recognition
Model

“How are you?” “Wie geht’s dir?”
Machine translation
Model

“How are you?” “I am fine thank you”
Conversational agent / chatbot
Model

Model
“How are you?”
“I am fine thank you”
Cat or Dog
“Wie geht’s dir?”

• Image recognition
• Speech recognition
• Machine translation
• Other form of learning
• Unsupervised learning
• Reinforcement learning
Cat or Dog
“Hello”
“Hello” “Bonjour”

Training and inference
Training
• The loss tells what the output of
the model should have been
• Training objective can be overly
optimistic (overfitting)
Inference (validation)
• We care about the performance
in this setting
Training
data
𝑓𝜃(𝑥) Loss 𝑓𝜃(𝑥)
Frozen
parameter
cat
0.99
(cat)
0.1
(dog)

Learning objective
• Objective: minimize
Loss 𝑓𝜃 𝑥 , 𝑦
for a randomly chosen (x,y) from some distribution D.
• We don’t know the distribution D.
• We only have access to (training) samples from D.
input x label
4
𝑓𝜃 𝑥
prediction
𝜃: parameters
model

Training objective
Training
data
Objective function: 𝐿 𝜃 =
1
𝑛 𝑖=1
𝑛
𝐿𝑜𝑠𝑠(𝑓𝜃 𝑥𝑖 , 𝑦𝑖)
Approximate the unknown distribution D with the training data average
( , 4) 𝑓𝜃(𝑥1) Loss
( , 5) 𝑓𝜃(𝑥2) Loss
+

Mapping from input to prediction
input
x
784
dim
score
z
10
dim
𝑧 = 𝑊𝑥 + 𝑏
probability
p
𝑊, 𝑏: parameters
Softmax
𝑝𝑐 =
exp 𝑧𝑐
𝑐′ exp(𝑧𝑐′)
10
dim

Cross-entropy loss
• Interpretation 1: you pay penalty
where y is the correct label.
• Interpretation 2:
Kullback-Leibler divergence
−log 𝑝𝑦 𝐷𝐾𝐿(𝑚𝑚𝑚𝑚𝑚𝑚, 𝑚𝑚𝑚𝑚𝑚)
All the probability
mass on the correct
label (‘4’)
prediction

Landscape of training objective
Objective function: 𝐿 𝜃 =
1
𝑛 𝑖=1
𝑛
𝐿𝑜𝑠𝑠(𝑓𝜃 𝑥𝑖 , 𝑦𝑖)
Parameters
𝜃
Initial
parameters
final
parameters

𝜃1
𝜃2
Parameter space Example space
𝑥1
𝑥2
𝜃𝑖𝑛𝑖𝑡

𝜃1
𝜃2
Parameter space Example space
𝑥1
𝑥2
𝜃𝑓𝑖𝑛𝑎𝑙

Gradient descent
• Initialize 𝜃0 randomly
• For t in 0,…, Tmaxiter
𝜃𝑡+1
= 𝜃𝑡
− 𝜂𝑡 ⋅ 𝛻𝐿 𝜃𝑡
𝜃0
𝜃𝑡
𝜃𝑡+1
Gradient of the objective
Learning rate (step size)
• computation of 𝛻𝐿 𝜃𝑡
requires a full sweep over the training data
• Per-iteration comp. cost = O(n)

Stochastic gradient descent (SGD)
𝜃𝑡+1
= 𝜃𝑡
− 𝜂𝑡 ⋅ 𝛻𝐿𝑜𝑠𝑠 𝑓𝜃 𝑥𝑖 , 𝑦𝑖
where index i is chosen randomly
𝜃0
𝜃𝑡
𝜃𝑡+1
Stochastic gradient
• computation of 𝛻𝐿𝑜𝑠𝑠 … requires only one training example
• Per-iteration comp. cost = O(1)

Minibatch stochastic gradient descent
𝜃𝑡+1
= 𝜃𝑡
− 𝜂𝑡 ⋅ 𝛻𝐵𝐿 𝜃
where minibatch B is chosen randomly
𝜃0
𝜃𝑡
𝜃𝑡+1
minibatch gradient
• 𝛻𝐿 𝜃 is average gradient over random subset of data of size B
• Per-iteration comp. cost = O(B)

Overfitting – what is signal vs noise?
• Imagine:
• Powerful models are more likely to overfit
• We need validation data: leave out some portion of the
training data to validate the generalizability of the model
Training
data
cat
dog
Validation data

Typical learning curve
Number of training steps
Training loss
Validation loss

Techniques to reduce overfitting
• Reduce the number of parameters
• Parameter sharing (convnets, recurrent neural nets)
• Weight decay (aka L2 regularization)
• Penalizes the magnitude of the parameters 𝑗=1
𝑑
𝑤𝑗
2
• Early stopping
• Indirectly controls the magnitude of the parameters
• More recent techniques
• Dropout, batch normalization

Summary so far
• A machine learning problem can be specified by:
• Task: What’s the input? What’s the output?
• Model: maps from the input to some numbers
• Loss function: measures how the model is doing
• Training: mini-batch SGD on the sum of empirical losses
• Validation: Are we overfitting?

How do we compute the gradient?

𝑑
𝑑𝑥
𝑥2
+ exp 𝑥2
= 2𝑥 + 2𝑥 ⋅ exp(𝑥2
)
Square Exp
Add
x
Square

𝑑
𝑑𝑥
𝑥2
+ exp 𝑥2
= 2𝑥 + 2𝑥 ⋅ exp(𝑥2
)
Mul Exp
Add
x
Mul

How do we compute the gradient?
• Manually
• Tedious (model specific), error prone, and not easy to
explore new models
• Algorithmically
• Back-propagation
• Allow researchers to focus on model building rather
than implementing each model correctly

Back propagation for the linear predictor
• Identify how each variable influence the loss
x
w b
z p Loss 𝑧 = 𝑊 ⋅ 𝑥 + 𝑏
𝑝 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧)

Back propagation for the linear predictor
x
w b
z p Loss
𝜕Loss
𝜕𝑊
=
𝜕𝐿𝑜𝑠𝑠
𝜕𝑝
⋅
𝜕𝑝
𝜕𝑧
⋅
𝜕𝑧
𝜕𝑊
𝜕Loss
𝜕𝑏
=
𝜕𝑝
⋅
𝜕𝑝
𝜕𝑧
⋅
𝜕𝑧
𝜕𝑏

Back propagation
• Don’t repeat shared compute. Propagate the gradients
backward
x
w b
z p Loss
𝜕Loss
𝜕𝑊
=
𝜕𝑝
⋅
𝜕𝑝
𝜕𝑧
⋅
𝜕𝑧
𝜕𝑊
𝜕Loss
𝜕𝑏
=
𝜕𝑝
⋅
𝜕𝑝
𝜕𝑧
⋅
𝜕𝑧
𝜕𝑏

Back propagation
• Don’t repeat shared compute. Propagate the gradients
backward
x
w b
z p Loss
𝜕Loss
𝜕𝑊
=
𝜕𝑝
⋅
𝜕𝑝
𝜕𝑧
⋅
𝜕𝑧
𝜕𝑊
∆𝑝 =
𝜕𝑝
∆𝑧 = Δ𝑝 ⋅
𝜕𝑝
𝜕𝑧
∆𝑊 = Δ𝑧 ⋅
𝜕𝑧
𝜕𝑊
𝜕Loss
𝜕𝑏
=
𝜕𝑝
⋅
𝜕𝑝
𝜕𝑧
⋅
𝜕𝑧
𝜕𝑏
∆𝑏 = Δ𝑧 ⋅
𝜕𝑧
𝜕𝑏

Going deeper
input
x
784
score
z
1024
ℎ𝑖𝑛 = 𝑊0𝑥 + 𝑏0
probability
p
𝑊0, 𝑏0, 𝑊1, 𝑏1: parameters
Softmax
10
ReLU
hidden
h
pre-
hidden
hin
10
1024
𝑧 = 𝑊1ℎ + 𝑏1
input
x
784
score
z
1024
ℎ𝑖𝑛 = 𝑊0𝑥 + 𝑏0
probability
p
𝑊0, 𝑏0, 𝑊1, 𝑏1: parameters
Softmax
10
ReLU
hidden
h
pre-
hidden
hin
10
1024
𝑧 = 𝑊1ℎ + 𝑏1

Activation functions
• Rectified Linear Unit (ReLU)
ℎ = max 0, ℎ𝑖𝑛
• Hyperbolic tangent (tanh)
ℎ = tanh(ℎ𝑖𝑛)
• Sigmoid
ℎ =
1
1 + 𝑒−ℎ𝑖𝑛

Comparison to tanh
Gradient is almost zero in these areas
Gradient can flow backwards
“ReLU is non-saturating”

Chain rule
x
W0 b0
hin p Loss
W1 b1
z
h
𝜕Loss
𝜕𝑊1
=
𝜕𝑝
⋅ ⋯ ⋅
𝜕𝑧
𝜕𝑊1
𝜕Loss
𝜕𝑏1
=
𝜕𝑝
⋅ ⋯ ⋅
𝜕𝑧
𝜕𝑏1
𝜕Loss
𝜕𝑊0
=
𝜕𝑝
⋅ ⋯ ⋅
𝜕ℎ𝑖𝑛
𝜕𝑊0
𝜕Loss
𝜕𝑏0
=
𝜕𝑝
⋅ ⋯ ⋅
𝜕ℎ𝑖𝑛
𝜕𝑏0

Back propagation
x
W0 b0
hin p Loss
W1 b1
z
h
∆𝑝 =
𝜕𝑝
∆𝑧 = Δ𝑝 ⋅
𝜕𝑝
𝜕𝑧
∆𝑊1 = Δ𝑧 ⋅
𝜕𝑧
𝜕𝑊1
∆𝑏1 = Δ𝑧 ⋅
𝜕𝑧
𝜕𝑏1
∆ℎ = Δ𝑧 ⋅
𝜕𝑧
𝜕ℎ
∆ℎ𝑖𝑛 = Δℎ ⋅
𝜕ℎ
𝜕ℎ𝑖𝑛
∆𝑊0 = Δℎ𝑖𝑛 ⋅
𝜕ℎ𝑖𝑛
𝜕𝑊0
∆𝑏0 = Δℎ𝑖𝑛 ⋅
𝜕ℎ𝑖𝑛
𝜕𝑏0
Softmax
ReLU Linear
Linear
• Don’t repeat shared compute - Propagate the gradients backward

More complex example
x
W0 b0
hin p Loss
W1 b1
z0
h
Softmax
ReLU Linear
Linear
z
∆𝑝 =
𝜕𝑝
∆𝑧 = Δ𝑝 ⋅
𝜕𝑝
𝜕𝑧
∆𝑊1 = Δ𝑧0 ⋅
𝜕𝑧0
𝜕𝑊1
∆𝑏1 = Δ𝑧0 ⋅
𝜕𝑧0
𝜕𝑏1
∆ℎ = Δ𝑧0 ⋅
𝜕𝑧0
𝜕ℎ
+ Δ𝑧 ⋅
𝜕𝑧
𝜕ℎ
∆ℎ𝑖𝑛 = Δℎ ⋅
𝜕ℎ
𝜕ℎ𝑖𝑛
∆𝑊0 = Δℎ𝑖𝑛 ⋅
𝜕ℎ𝑖𝑛
𝜕𝑊0
∆𝑏0 = Δℎ𝑖𝑛 ⋅
𝜕ℎ𝑖𝑛
𝜕𝑏0
∆𝑧0 = Δ𝑧 ⋅
𝜕𝑧
𝜕𝑧0
∆ℎ = Δ𝑧 ⋅
𝜕𝑧
𝜕ℎ

Deep learning frameworks
• Collection of implementations of popular layers (or
modules), e.g., ReLU, Softmax, Convolution, RNNs
• Provides an easy front-end to the layers/modules
• Handles different array libraries / hardware backends (CPUs,
GPUs, …)
• If there were an exchange format…

Books
Convex Optimization
Stephen Boyd and Lieven
Vandenberghe
Cambridge University Press
Information Theory,
Inference and Learning
Algorithms
David J. C. MacKay
2003
Neural Networks for
Pattern Recognition
Christopher M. Bishop

Conclusion
• Training a network consists of
• Forward propagation: computing the loss
• Backward propagation: computing the gradient
• Parameter update: move in the direction of the computed stochastic gradient
• Fairly standard set of building blocks are used to build complex models
• Linear, ReLU, Softmax, Tanh, …
• Advanced topics
• How to prevent overfitting
• How to scale neural network training to multiple machines / devices

Minibatch size and convergence speed
1 pass over the dataset (1 epoch)
Full-batch gradient descent
Large-minibatch gradient descent
Small-minibatch gradient descent
#samples
Error
#samples
Error
#samples
Error
Parameter
update

Effect of minibatch size B
Learning rate 𝜂 scaled as 𝜂 = 0.025 ⋅ 𝐵

Dropout –randomly drops activations during training
Instance 1

Instance 2

Instance 3

Dropout
• Idea: randomly drop activations during training
• Benefit: reduces overfitting and improves
generalization
• Can be implemented as a layer

Batch normalization
Inputs
scaled to
[0,1]
784 dim
Weights
and biases
drawn
from
N(0, 1)
R.V. with
scale
at most
784
1024 dim

Batch normalization
784 dim
Weights
and biases
drawn
from
N(0, 1)
1024 dim
𝜎1
𝜇1
𝜎2
𝜇2
𝜎3
𝜇3
𝜎𝐻
𝜇𝐻
Inputs
scaled to
[0,1]
R.V. with
scale
at most
784
R.V. with
scale
at most
784

Batch normalization [Ioffe &Szegedy, 2015]
• Idea: normalize the activation of each unit to have zero
mean and unit standard deviation using a mini-batch
estimate of mean and variance.
• Benefit: more stable and faster training. Often
generalizes better
• Can be implemented as a layer

More optimization algorithms
• Momentum SGD: improves SGD by incorporating
“momentum”

Adaptive optimization algorithms
• Adam [Kingma & Ba 2015]: uses first and second order
statistics of the gradients so that gradients are
normalized
• Benefit: prevents the vanishing/exploding gradient
problem

Learning rate decay
• Reduce the learning rate or step-size parameter (𝜂) once in
a while
• Typical setting: multiply 0.98 to 𝜂 every epoch.

Summary
• Training and inference
• Training objective and optimization
• Neural networks and backpropagation
• Importance of software tools – turns research into Lego
block engineering
• Various tricks to speed-up training and reduce
overfitting

Gradient explosion/diminishing problem
Linear + ReLU Linear + ReLU
𝑊1 𝑊2
𝑊1
𝑇
𝑊2
𝑇
Linear + ReLU
𝑊3
𝑊3
𝑇
ℎ0 ℎ1 ℎ2 ℎ3

Gradient explosion/diminishing problem
Linear Linear
𝑊 𝑊
𝑊𝑇
𝑊𝑇
Linear
𝑊
𝑊𝑇
ℎ0 ℎ1 ℎ2 ℎ3
Δℎ3
Δℎ2 = 𝑊𝑇
Δℎ3
Δℎ1 = 𝑊𝑇 2
Δℎ3
Δℎ0 = 𝑊𝑇 3
Δℎ3
• Gradient is magnified or diminished by factor W at every layer.
• If we have many layers, they can explode or diminish to zero.

What is a model?
• A model is a function specified by a set of parameters 𝜃
• Example: linear predictor
𝑓𝜃 𝑥 = 𝑤𝑇 ⋅ 𝑥 + 𝑏 (𝜃 = 𝑤, 𝑏 )
0.99
parameters
𝑓𝜃(𝑥)

What is a model?
• A model is a function specified by a set of parameters 𝜃
• Example: linear predictor
𝑓𝜃 𝑥 = 𝑤𝑇 ⋅ 𝑥 + 𝑏 (𝜃 = 𝑤, 𝑏 )
parameters
×
×
×
×
×
= sum + b
w1
w2
w3
w4
w5
x1
x2
x3
x4
x5

Loss functions for binary classification
• Miss classification loss
(-accuracy)
• Squared loss
• Cross entropy loss
𝐼( 𝑝 − 0.5 ⋅ (𝑦 − 0.5))
𝑝 − 𝑦 2
−𝑦 ⋅ log 𝑝
− 1 − 𝑦 ⋅ log(1 − 𝑝)
p=prediction, y=ground truth (0 or 1)

DeepLearningLecture.pptx

Recommended

Recommended

More Related Content

Similar to DeepLearningLecture.pptx

Similar to DeepLearningLecture.pptx (20)

Recently uploaded

Recently uploaded (20)

DeepLearningLecture.pptx