SlideShare a Scribd company logo
1 of 71
Introduction to Deep Learning:
How to make your own deep
learning framework
Azure iPython Notebook
https://notebooks.azure.com/ryotat/libraries/DLTutorial
Agenda
• This lecture covers
• Introduction to machine learning
(keywords: model, training, inference, stochastic
gradient descent, overfitting)
• How to compute the gradient
(keywords: backpropagation, multi-layer perceptrons,
activation function)
What is Machine Learning (ML)?
• The goal of ML is to learn from data ---->
• Technically, combines statistics and
computational tools (optimization)
• Example (supervised learning) tasks
Cat or Dog
Image recognition / classification
Model
What is Machine Learning (ML)?
• The goal of ML is to learn from data ---->
• Technically, combines statistics and
computational tools (optimization)
• Example (supervised learning) tasks
“Hello”
Speech recognition
Model
What is Machine Learning (ML)?
• The goal of ML is to learn from data ---->
• Technically, combines statistics and
computational tools (optimization)
• Example (supervised learning) tasks
“How are you?” “Wie geht’s dir?”
Machine translation
Model
What is Machine Learning (ML)?
• The goal of ML is to learn from data ---->
• Technically, combines statistics and
computational tools (optimization)
• Example (supervised learning) tasks
“How are you?” “I am fine thank you”
Conversational agent / chatbot
Model
What is Machine Learning (ML)?
• The goal of ML is to learn from data ---->
• Technically, combines statistics and
computational tools (optimization)
• Example (supervised learning) tasks
Model
“How are you?”
“I am fine thank you”
Cat or Dog
“Wie geht’s dir?”
What is Machine Learning (ML)?
• The goal of ML is to learn from data ---->
• Technically, combines statistics and
computational tools (optimization)
• Example (supervised learning) tasks
• Image recognition
• Speech recognition
• Machine translation
• Other form of learning
• Unsupervised learning
• Reinforcement learning
Cat or Dog
“Hello”
“Hello” “Bonjour”
Training and inference
Training
• The loss tells what the output of
the model should have been
• Training objective can be overly
optimistic (overfitting)
Inference (validation)
• We care about the performance
in this setting
Training
data
𝑓𝜃(𝑥) Loss 𝑓𝜃(𝑥)
Frozen
parameter
cat
0.99
(cat)
0.1
(dog)
Learning objective
• Objective: minimize
Loss 𝑓𝜃 𝑥 , 𝑦
for a randomly chosen (x,y) from some distribution D.
• We don’t know the distribution D.
• We only have access to (training) samples from D.
input x label
4
𝑓𝜃 𝑥
prediction
𝜃: parameters
model
Training objective
Training
data
Objective function: 𝐿 𝜃 =
1
𝑛 𝑖=1
𝑛
𝐿𝑜𝑠𝑠(𝑓𝜃 𝑥𝑖 , 𝑦𝑖)
Approximate the unknown distribution D with the training data average
( , 4) 𝑓𝜃(𝑥1) Loss
( , 5) 𝑓𝜃(𝑥2) Loss
+
Mapping from input to prediction
input
x
784
dim
score
z
10
dim
𝑧 = 𝑊𝑥 + 𝑏
probability
p
𝑊, 𝑏: parameters
Softmax
𝑝𝑐 =
exp 𝑧𝑐
𝑐′ exp(𝑧𝑐′)
10
dim
Cross-entropy loss
• Interpretation 1: you pay penalty
where y is the correct label.
• Interpretation 2:
Kullback-Leibler divergence
−log 𝑝𝑦 𝐷𝐾𝐿(𝑚𝑚𝑚𝑚𝑚𝑚, 𝑚𝑚𝑚𝑚𝑚)
All the probability
mass on the correct
label (‘4’)
prediction
Landscape of training objective
Objective function: 𝐿 𝜃 =
1
𝑛 𝑖=1
𝑛
𝐿𝑜𝑠𝑠(𝑓𝜃 𝑥𝑖 , 𝑦𝑖)
Parameters
𝜃
Initial
parameters
final
parameters
Landscape of training objective
𝜃1
𝜃2
Parameter space Example space
𝑥1
𝑥2
𝜃𝑖𝑛𝑖𝑡
𝜃𝑖𝑛𝑖𝑡
Landscape of training objective
𝜃1
𝜃2
Parameter space Example space
𝑥1
𝑥2
𝜃𝑓𝑖𝑛𝑎𝑙
𝜃𝑖𝑛𝑖𝑡
𝜃𝑓𝑖𝑛𝑎𝑙
Gradient descent
• Initialize 𝜃0 randomly
• For t in 0,…, Tmaxiter
𝜃𝑡+1
= 𝜃𝑡
− 𝜂𝑡 ⋅ 𝛻𝐿 𝜃𝑡
𝜃0
𝜃𝑓𝑖𝑛𝑎𝑙
𝜃𝑡
𝜃𝑡+1
Gradient of the objective
Learning rate (step size)
• computation of 𝛻𝐿 𝜃𝑡
requires a full sweep over the training data
• Per-iteration comp. cost = O(n)
Stochastic gradient descent (SGD)
• Initialize 𝜃0 randomly
• For t in 0,…, Tmaxiter
𝜃𝑡+1
= 𝜃𝑡
− 𝜂𝑡 ⋅ 𝛻𝐿𝑜𝑠𝑠 𝑓𝜃 𝑥𝑖 , 𝑦𝑖
where index i is chosen randomly
𝜃0
𝜃𝑓𝑖𝑛𝑎𝑙
𝜃𝑡
𝜃𝑡+1
Stochastic gradient
• computation of 𝛻𝐿𝑜𝑠𝑠 … requires only one training example
• Per-iteration comp. cost = O(1)
Minibatch stochastic gradient descent
• Initialize 𝜃0 randomly
• For t in 0,…, Tmaxiter
𝜃𝑡+1
= 𝜃𝑡
− 𝜂𝑡 ⋅ 𝛻𝐵𝐿 𝜃
where minibatch B is chosen randomly
𝜃0
𝜃𝑓𝑖𝑛𝑎𝑙
𝜃𝑡
𝜃𝑡+1
minibatch gradient
• 𝛻𝐿 𝜃 is average gradient over random subset of data of size B
• Per-iteration comp. cost = O(B)
Overfitting – what is signal vs noise?
• Imagine:
• Powerful models are more likely to overfit
• We need validation data: leave out some portion of the
training data to validate the generalizability of the model
Training
data
cat
dog
Validation data
Typical learning curve
Number of training steps
Training loss
Validation loss
Techniques to reduce overfitting
• Reduce the number of parameters
• Parameter sharing (convnets, recurrent neural nets)
• Weight decay (aka L2 regularization)
• Penalizes the magnitude of the parameters 𝑗=1
𝑑
𝑤𝑗
2
• Early stopping
• Indirectly controls the magnitude of the parameters
• More recent techniques
• Dropout, batch normalization
Summary so far
• A machine learning problem can be specified by:
• Task: What’s the input? What’s the output?
• Model: maps from the input to some numbers
• Loss function: measures how the model is doing
• Training: mini-batch SGD on the sum of empirical losses
• Validation: Are we overfitting?
How do we compute the gradient?
𝑑
𝑑𝑥
𝑥2
+ exp 𝑥2
= 2𝑥 + 2𝑥 ⋅ exp(𝑥2
)
Square Exp
Add
x
Square
𝑑
𝑑𝑥
𝑥2
+ exp 𝑥2
= 2𝑥 + 2𝑥 ⋅ exp(𝑥2
)
Mul Exp
Add
x
Mul
How do we compute the gradient?
• Manually
• Tedious (model specific), error prone, and not easy to
explore new models
• Algorithmically
• Back-propagation
• Allow researchers to focus on model building rather
than implementing each model correctly
Back propagation for the linear predictor
• Identify how each variable influence the loss
x
w b
z p Loss 𝑧 = 𝑊 ⋅ 𝑥 + 𝑏
𝑝 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧)
Back propagation for the linear predictor
• Identify how each variable influence the loss
x
w b
z p Loss
𝜕Loss
𝜕𝑊
=
𝜕𝐿𝑜𝑠𝑠
𝜕𝑝
⋅
𝜕𝑝
𝜕𝑧
⋅
𝜕𝑧
𝜕𝑊
𝜕Loss
𝜕𝑏
=
𝜕𝐿𝑜𝑠𝑠
𝜕𝑝
⋅
𝜕𝑝
𝜕𝑧
⋅
𝜕𝑧
𝜕𝑏
Back propagation
• Don’t repeat shared compute. Propagate the gradients
backward
x
w b
z p Loss
𝜕Loss
𝜕𝑊
=
𝜕𝐿𝑜𝑠𝑠
𝜕𝑝
⋅
𝜕𝑝
𝜕𝑧
⋅
𝜕𝑧
𝜕𝑊
𝜕Loss
𝜕𝑏
=
𝜕𝐿𝑜𝑠𝑠
𝜕𝑝
⋅
𝜕𝑝
𝜕𝑧
⋅
𝜕𝑧
𝜕𝑏
Back propagation
• Don’t repeat shared compute. Propagate the gradients
backward
x
w b
z p Loss
𝜕Loss
𝜕𝑊
=
𝜕𝐿𝑜𝑠𝑠
𝜕𝑝
⋅
𝜕𝑝
𝜕𝑧
⋅
𝜕𝑧
𝜕𝑊
∆𝑝 =
𝜕𝐿𝑜𝑠𝑠
𝜕𝑝
∆𝑧 = Δ𝑝 ⋅
𝜕𝑝
𝜕𝑧
∆𝑊 = Δ𝑧 ⋅
𝜕𝑧
𝜕𝑊
𝜕Loss
𝜕𝑏
=
𝜕𝐿𝑜𝑠𝑠
𝜕𝑝
⋅
𝜕𝑝
𝜕𝑧
⋅
𝜕𝑧
𝜕𝑏
∆𝑏 = Δ𝑧 ⋅
𝜕𝑧
𝜕𝑏
Going deeper
Going deeper
input
x
784
score
z
1024
ℎ𝑖𝑛 = 𝑊0𝑥 + 𝑏0
probability
p
𝑊0, 𝑏0, 𝑊1, 𝑏1: parameters
Softmax
10
ReLU
hidden
h
pre-
hidden
hin
10
1024
𝑧 = 𝑊1ℎ + 𝑏1
input
x
784
score
z
1024
ℎ𝑖𝑛 = 𝑊0𝑥 + 𝑏0
probability
p
𝑊0, 𝑏0, 𝑊1, 𝑏1: parameters
Softmax
10
ReLU
hidden
h
pre-
hidden
hin
10
1024
𝑧 = 𝑊1ℎ + 𝑏1
Activation functions
• Rectified Linear Unit (ReLU)
ℎ = max 0, ℎ𝑖𝑛
• Hyperbolic tangent (tanh)
ℎ = tanh(ℎ𝑖𝑛)
• Sigmoid
ℎ =
1
1 + 𝑒−ℎ𝑖𝑛
XOR problem
Rectified linear unit (ReLU)
Rectified linear unit (ReLU)
Rectified linear unit (ReLU)
Rectified linear unit (ReLU)
Rectified linear unit (ReLU)
Comparison to tanh
Gradient is almost zero in these areas
Gradient can flow backwards
“ReLU is non-saturating”
Chain rule
x
W0 b0
hin p Loss
W1 b1
z
h
𝜕Loss
𝜕𝑊1
=
𝜕𝐿𝑜𝑠𝑠
𝜕𝑝
⋅ ⋯ ⋅
𝜕𝑧
𝜕𝑊1
𝜕Loss
𝜕𝑏1
=
𝜕𝐿𝑜𝑠𝑠
𝜕𝑝
⋅ ⋯ ⋅
𝜕𝑧
𝜕𝑏1
𝜕Loss
𝜕𝑊0
=
𝜕𝐿𝑜𝑠𝑠
𝜕𝑝
⋅ ⋯ ⋅
𝜕ℎ𝑖𝑛
𝜕𝑊0
𝜕Loss
𝜕𝑏0
=
𝜕𝐿𝑜𝑠𝑠
𝜕𝑝
⋅ ⋯ ⋅
𝜕ℎ𝑖𝑛
𝜕𝑏0
• Identify how each variable influence the loss
Back propagation
x
W0 b0
hin p Loss
W1 b1
z
h
∆𝑝 =
𝜕𝐿𝑜𝑠𝑠
𝜕𝑝
∆𝑧 = Δ𝑝 ⋅
𝜕𝑝
𝜕𝑧
∆𝑊1 = Δ𝑧 ⋅
𝜕𝑧
𝜕𝑊1
∆𝑏1 = Δ𝑧 ⋅
𝜕𝑧
𝜕𝑏1
∆ℎ = Δ𝑧 ⋅
𝜕𝑧
𝜕ℎ
∆ℎ𝑖𝑛 = Δℎ ⋅
𝜕ℎ
𝜕ℎ𝑖𝑛
∆𝑊0 = Δℎ𝑖𝑛 ⋅
𝜕ℎ𝑖𝑛
𝜕𝑊0
∆𝑏0 = Δℎ𝑖𝑛 ⋅
𝜕ℎ𝑖𝑛
𝜕𝑏0
Softmax
ReLU Linear
Linear
• Don’t repeat shared compute - Propagate the gradients backward
More complex example
x
W0 b0
hin p Loss
W1 b1
z0
h
Softmax
ReLU Linear
Linear
z
∆𝑝 =
𝜕𝐿𝑜𝑠𝑠
𝜕𝑝
∆𝑧 = Δ𝑝 ⋅
𝜕𝑝
𝜕𝑧
∆𝑊1 = Δ𝑧0 ⋅
𝜕𝑧0
𝜕𝑊1
∆𝑏1 = Δ𝑧0 ⋅
𝜕𝑧0
𝜕𝑏1
∆ℎ = Δ𝑧0 ⋅
𝜕𝑧0
𝜕ℎ
+ Δ𝑧 ⋅
𝜕𝑧
𝜕ℎ
∆ℎ𝑖𝑛 = Δℎ ⋅
𝜕ℎ
𝜕ℎ𝑖𝑛
∆𝑊0 = Δℎ𝑖𝑛 ⋅
𝜕ℎ𝑖𝑛
𝜕𝑊0
∆𝑏0 = Δℎ𝑖𝑛 ⋅
𝜕ℎ𝑖𝑛
𝜕𝑏0
∆𝑧0 = Δ𝑧 ⋅
𝜕𝑧
𝜕𝑧0
∆ℎ = Δ𝑧 ⋅
𝜕𝑧
𝜕ℎ
Demo
Deep learning frameworks
• Collection of implementations of popular layers (or
modules), e.g., ReLU, Softmax, Convolution, RNNs
• Provides an easy front-end to the layers/modules
• Handles different array libraries / hardware backends (CPUs,
GPUs, …)
• If there were an exchange format…
Books
Convex Optimization
Stephen Boyd and Lieven
Vandenberghe
Cambridge University Press
Information Theory,
Inference and Learning
Algorithms
David J. C. MacKay
2003
Neural Networks for
Pattern Recognition
Christopher M. Bishop
Conclusion
• Training a network consists of
• Forward propagation: computing the loss
• Backward propagation: computing the gradient
• Parameter update: move in the direction of the computed stochastic gradient
• Fairly standard set of building blocks are used to build complex models
• Linear, ReLU, Softmax, Tanh, …
• Advanced topics
• How to prevent overfitting
• How to scale neural network training to multiple machines / devices
Advanced topics
Minibatch size and convergence speed
1 pass over the dataset (1 epoch)
Full-batch gradient descent
Large-minibatch gradient descent
Small-minibatch gradient descent
#samples
Error
#samples
Error
#samples
Error
Parameter
update
Effect of minibatch size B
Learning rate 𝜂 scaled as 𝜂 = 0.025 ⋅ 𝐵
Dropout –randomly drops activations during training
Instance 1
Dropout –randomly drops activations during training
Instance 2
Dropout –randomly drops activations during training
Instance 3
Dropout
• Idea: randomly drop activations during training
• Benefit: reduces overfitting and improves
generalization
• Can be implemented as a layer
Batch normalization
Inputs
scaled to
[0,1]
784 dim
Weights
and biases
drawn
from
N(0, 1)
R.V. with
scale
at most
784
1024 dim
Batch normalization
784 dim
Weights
and biases
drawn
from
N(0, 1)
1024 dim
𝜎1
𝜇1
𝜎2
𝜇2
𝜎3
𝜇3
𝜎𝐻
𝜇𝐻
Inputs
scaled to
[0,1]
R.V. with
scale
at most
784
R.V. with
scale
at most
784
Batch normalization [Ioffe &Szegedy, 2015]
• Idea: normalize the activation of each unit to have zero
mean and unit standard deviation using a mini-batch
estimate of mean and variance.
• Benefit: more stable and faster training. Often
generalizes better
• Can be implemented as a layer
More optimization algorithms
• Momentum SGD: improves SGD by incorporating
“momentum”
Adaptive optimization algorithms
• Adam [Kingma & Ba 2015]: uses first and second order
statistics of the gradients so that gradients are
normalized
• Benefit: prevents the vanishing/exploding gradient
problem
Learning rate decay
• Reduce the learning rate or step-size parameter (𝜂) once in
a while
• Typical setting: multiply 0.98 to 𝜂 every epoch.
Summary
• Training and inference
• Training objective and optimization
• Neural networks and backpropagation
• Importance of software tools – turns research into Lego
block engineering
• Various tricks to speed-up training and reduce
overfitting
Gradient explosion/diminishing problem
Linear + ReLU Linear + ReLU
𝑊1 𝑊2
𝑊1
𝑇
𝑊2
𝑇
Linear + ReLU
𝑊3
𝑊3
𝑇
ℎ0 ℎ1 ℎ2 ℎ3
Gradient explosion/diminishing problem
Linear Linear
𝑊 𝑊
𝑊𝑇
𝑊𝑇
Linear
𝑊
𝑊𝑇
ℎ0 ℎ1 ℎ2 ℎ3
Δℎ3
Δℎ2 = 𝑊𝑇
Δℎ3
Δℎ1 = 𝑊𝑇 2
Δℎ3
Δℎ0 = 𝑊𝑇 3
Δℎ3
• Gradient is magnified or diminished by factor W at every layer.
• If we have many layers, they can explode or diminish to zero.
What is a model?
• A model is a function specified by a set of parameters 𝜃
• Example: linear predictor
𝑓𝜃 𝑥 = 𝑤𝑇 ⋅ 𝑥 + 𝑏 (𝜃 = 𝑤, 𝑏 )
0.99
parameters
𝑓𝜃(𝑥)
What is a model?
• A model is a function specified by a set of parameters 𝜃
• Example: linear predictor
𝑓𝜃 𝑥 = 𝑤𝑇 ⋅ 𝑥 + 𝑏 (𝜃 = 𝑤, 𝑏 )
parameters
×
×
×
×
×
= sum + b
w1
w2
w3
w4
w5
x1
x2
x3
x4
x5
Training and inference
Training
• The loss tells what the output of
the model should have been
• Training objective can be overly
optimistic (overfitting)
Inference (validation)
• We care about the performance
in this setting
Training
data
𝑓𝜃(𝑥) Loss 𝑓𝜃(𝑥)
Frozen
parameter
cat
0.99
(cat)
0.1
(dog)
Loss functions for binary classification
• Miss classification loss
(-accuracy)
• Squared loss
• Cross entropy loss
𝐼( 𝑝 − 0.5 ⋅ (𝑦 − 0.5))
𝑝 − 𝑦 2
−𝑦 ⋅ log 𝑝
− 1 − 𝑦 ⋅ log(1 − 𝑝)
p=prediction, y=ground truth (0 or 1)

More Related Content

Similar to DeepLearningLecture.pptx

Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionTe-Yen Liu
 
Machine Learning on Azure - AzureConf
Machine Learning on Azure - AzureConfMachine Learning on Azure - AzureConf
Machine Learning on Azure - AzureConfSeth Juarez
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataWeCloudData
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networksAkash Goel
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...PyData
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptxEmanAl15
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learningNimrita Koul
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningVishwas Lele
 
Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018Apache MXNet
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)Tech in Asia ID
 

Similar to DeepLearningLecture.pptx (20)

Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 
Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
 
Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
 
Machine Learning on Azure - AzureConf
Machine Learning on Azure - AzureConfMachine Learning on Azure - AzureConf
Machine Learning on Azure - AzureConf
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
supervised.pptx
supervised.pptxsupervised.pptx
supervised.pptx
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptx
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Ml ppt at
Ml ppt atMl ppt at
Ml ppt at
 
Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
 

Recently uploaded

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 

Recently uploaded (20)

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 

DeepLearningLecture.pptx

  • 1. Introduction to Deep Learning: How to make your own deep learning framework Azure iPython Notebook https://notebooks.azure.com/ryotat/libraries/DLTutorial
  • 2. Agenda • This lecture covers • Introduction to machine learning (keywords: model, training, inference, stochastic gradient descent, overfitting) • How to compute the gradient (keywords: backpropagation, multi-layer perceptrons, activation function)
  • 3. What is Machine Learning (ML)? • The goal of ML is to learn from data ----> • Technically, combines statistics and computational tools (optimization) • Example (supervised learning) tasks Cat or Dog Image recognition / classification Model
  • 4. What is Machine Learning (ML)? • The goal of ML is to learn from data ----> • Technically, combines statistics and computational tools (optimization) • Example (supervised learning) tasks “Hello” Speech recognition Model
  • 5. What is Machine Learning (ML)? • The goal of ML is to learn from data ----> • Technically, combines statistics and computational tools (optimization) • Example (supervised learning) tasks “How are you?” “Wie geht’s dir?” Machine translation Model
  • 6. What is Machine Learning (ML)? • The goal of ML is to learn from data ----> • Technically, combines statistics and computational tools (optimization) • Example (supervised learning) tasks “How are you?” “I am fine thank you” Conversational agent / chatbot Model
  • 7. What is Machine Learning (ML)? • The goal of ML is to learn from data ----> • Technically, combines statistics and computational tools (optimization) • Example (supervised learning) tasks Model “How are you?” “I am fine thank you” Cat or Dog “Wie geht’s dir?”
  • 8. What is Machine Learning (ML)? • The goal of ML is to learn from data ----> • Technically, combines statistics and computational tools (optimization) • Example (supervised learning) tasks • Image recognition • Speech recognition • Machine translation • Other form of learning • Unsupervised learning • Reinforcement learning Cat or Dog “Hello” “Hello” “Bonjour”
  • 9. Training and inference Training • The loss tells what the output of the model should have been • Training objective can be overly optimistic (overfitting) Inference (validation) • We care about the performance in this setting Training data 𝑓𝜃(𝑥) Loss 𝑓𝜃(𝑥) Frozen parameter cat 0.99 (cat) 0.1 (dog)
  • 10.
  • 11. Learning objective • Objective: minimize Loss 𝑓𝜃 𝑥 , 𝑦 for a randomly chosen (x,y) from some distribution D. • We don’t know the distribution D. • We only have access to (training) samples from D. input x label 4 𝑓𝜃 𝑥 prediction 𝜃: parameters model
  • 12. Training objective Training data Objective function: 𝐿 𝜃 = 1 𝑛 𝑖=1 𝑛 𝐿𝑜𝑠𝑠(𝑓𝜃 𝑥𝑖 , 𝑦𝑖) Approximate the unknown distribution D with the training data average ( , 4) 𝑓𝜃(𝑥1) Loss ( , 5) 𝑓𝜃(𝑥2) Loss +
  • 13. Mapping from input to prediction input x 784 dim score z 10 dim 𝑧 = 𝑊𝑥 + 𝑏 probability p 𝑊, 𝑏: parameters Softmax 𝑝𝑐 = exp 𝑧𝑐 𝑐′ exp(𝑧𝑐′) 10 dim
  • 14. Cross-entropy loss • Interpretation 1: you pay penalty where y is the correct label. • Interpretation 2: Kullback-Leibler divergence −log 𝑝𝑦 𝐷𝐾𝐿(𝑚𝑚𝑚𝑚𝑚𝑚, 𝑚𝑚𝑚𝑚𝑚) All the probability mass on the correct label (‘4’) prediction
  • 15. Landscape of training objective Objective function: 𝐿 𝜃 = 1 𝑛 𝑖=1 𝑛 𝐿𝑜𝑠𝑠(𝑓𝜃 𝑥𝑖 , 𝑦𝑖) Parameters 𝜃 Initial parameters final parameters
  • 16. Landscape of training objective 𝜃1 𝜃2 Parameter space Example space 𝑥1 𝑥2 𝜃𝑖𝑛𝑖𝑡 𝜃𝑖𝑛𝑖𝑡
  • 17. Landscape of training objective 𝜃1 𝜃2 Parameter space Example space 𝑥1 𝑥2 𝜃𝑓𝑖𝑛𝑎𝑙 𝜃𝑖𝑛𝑖𝑡 𝜃𝑓𝑖𝑛𝑎𝑙
  • 18. Gradient descent • Initialize 𝜃0 randomly • For t in 0,…, Tmaxiter 𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡 ⋅ 𝛻𝐿 𝜃𝑡 𝜃0 𝜃𝑓𝑖𝑛𝑎𝑙 𝜃𝑡 𝜃𝑡+1 Gradient of the objective Learning rate (step size) • computation of 𝛻𝐿 𝜃𝑡 requires a full sweep over the training data • Per-iteration comp. cost = O(n)
  • 19. Stochastic gradient descent (SGD) • Initialize 𝜃0 randomly • For t in 0,…, Tmaxiter 𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡 ⋅ 𝛻𝐿𝑜𝑠𝑠 𝑓𝜃 𝑥𝑖 , 𝑦𝑖 where index i is chosen randomly 𝜃0 𝜃𝑓𝑖𝑛𝑎𝑙 𝜃𝑡 𝜃𝑡+1 Stochastic gradient • computation of 𝛻𝐿𝑜𝑠𝑠 … requires only one training example • Per-iteration comp. cost = O(1)
  • 20. Minibatch stochastic gradient descent • Initialize 𝜃0 randomly • For t in 0,…, Tmaxiter 𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡 ⋅ 𝛻𝐵𝐿 𝜃 where minibatch B is chosen randomly 𝜃0 𝜃𝑓𝑖𝑛𝑎𝑙 𝜃𝑡 𝜃𝑡+1 minibatch gradient • 𝛻𝐿 𝜃 is average gradient over random subset of data of size B • Per-iteration comp. cost = O(B)
  • 21. Overfitting – what is signal vs noise? • Imagine: • Powerful models are more likely to overfit • We need validation data: leave out some portion of the training data to validate the generalizability of the model Training data cat dog Validation data
  • 22. Typical learning curve Number of training steps Training loss Validation loss
  • 23. Techniques to reduce overfitting • Reduce the number of parameters • Parameter sharing (convnets, recurrent neural nets) • Weight decay (aka L2 regularization) • Penalizes the magnitude of the parameters 𝑗=1 𝑑 𝑤𝑗 2 • Early stopping • Indirectly controls the magnitude of the parameters • More recent techniques • Dropout, batch normalization
  • 24. Summary so far • A machine learning problem can be specified by: • Task: What’s the input? What’s the output? • Model: maps from the input to some numbers • Loss function: measures how the model is doing • Training: mini-batch SGD on the sum of empirical losses • Validation: Are we overfitting?
  • 25. How do we compute the gradient?
  • 26. 𝑑 𝑑𝑥 𝑥2 + exp 𝑥2 = 2𝑥 + 2𝑥 ⋅ exp(𝑥2 ) Square Exp Add x Square
  • 27. 𝑑 𝑑𝑥 𝑥2 + exp 𝑥2 = 2𝑥 + 2𝑥 ⋅ exp(𝑥2 ) Mul Exp Add x Mul
  • 28. How do we compute the gradient? • Manually • Tedious (model specific), error prone, and not easy to explore new models • Algorithmically • Back-propagation • Allow researchers to focus on model building rather than implementing each model correctly
  • 29. Back propagation for the linear predictor • Identify how each variable influence the loss x w b z p Loss 𝑧 = 𝑊 ⋅ 𝑥 + 𝑏 𝑝 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧)
  • 30. Back propagation for the linear predictor • Identify how each variable influence the loss x w b z p Loss 𝜕Loss 𝜕𝑊 = 𝜕𝐿𝑜𝑠𝑠 𝜕𝑝 ⋅ 𝜕𝑝 𝜕𝑧 ⋅ 𝜕𝑧 𝜕𝑊 𝜕Loss 𝜕𝑏 = 𝜕𝐿𝑜𝑠𝑠 𝜕𝑝 ⋅ 𝜕𝑝 𝜕𝑧 ⋅ 𝜕𝑧 𝜕𝑏
  • 31. Back propagation • Don’t repeat shared compute. Propagate the gradients backward x w b z p Loss 𝜕Loss 𝜕𝑊 = 𝜕𝐿𝑜𝑠𝑠 𝜕𝑝 ⋅ 𝜕𝑝 𝜕𝑧 ⋅ 𝜕𝑧 𝜕𝑊 𝜕Loss 𝜕𝑏 = 𝜕𝐿𝑜𝑠𝑠 𝜕𝑝 ⋅ 𝜕𝑝 𝜕𝑧 ⋅ 𝜕𝑧 𝜕𝑏
  • 32. Back propagation • Don’t repeat shared compute. Propagate the gradients backward x w b z p Loss 𝜕Loss 𝜕𝑊 = 𝜕𝐿𝑜𝑠𝑠 𝜕𝑝 ⋅ 𝜕𝑝 𝜕𝑧 ⋅ 𝜕𝑧 𝜕𝑊 ∆𝑝 = 𝜕𝐿𝑜𝑠𝑠 𝜕𝑝 ∆𝑧 = Δ𝑝 ⋅ 𝜕𝑝 𝜕𝑧 ∆𝑊 = Δ𝑧 ⋅ 𝜕𝑧 𝜕𝑊 𝜕Loss 𝜕𝑏 = 𝜕𝐿𝑜𝑠𝑠 𝜕𝑝 ⋅ 𝜕𝑝 𝜕𝑧 ⋅ 𝜕𝑧 𝜕𝑏 ∆𝑏 = Δ𝑧 ⋅ 𝜕𝑧 𝜕𝑏
  • 34. Going deeper input x 784 score z 1024 ℎ𝑖𝑛 = 𝑊0𝑥 + 𝑏0 probability p 𝑊0, 𝑏0, 𝑊1, 𝑏1: parameters Softmax 10 ReLU hidden h pre- hidden hin 10 1024 𝑧 = 𝑊1ℎ + 𝑏1 input x 784 score z 1024 ℎ𝑖𝑛 = 𝑊0𝑥 + 𝑏0 probability p 𝑊0, 𝑏0, 𝑊1, 𝑏1: parameters Softmax 10 ReLU hidden h pre- hidden hin 10 1024 𝑧 = 𝑊1ℎ + 𝑏1
  • 35. Activation functions • Rectified Linear Unit (ReLU) ℎ = max 0, ℎ𝑖𝑛 • Hyperbolic tangent (tanh) ℎ = tanh(ℎ𝑖𝑛) • Sigmoid ℎ = 1 1 + 𝑒−ℎ𝑖𝑛
  • 42. Comparison to tanh Gradient is almost zero in these areas Gradient can flow backwards “ReLU is non-saturating”
  • 43. Chain rule x W0 b0 hin p Loss W1 b1 z h 𝜕Loss 𝜕𝑊1 = 𝜕𝐿𝑜𝑠𝑠 𝜕𝑝 ⋅ ⋯ ⋅ 𝜕𝑧 𝜕𝑊1 𝜕Loss 𝜕𝑏1 = 𝜕𝐿𝑜𝑠𝑠 𝜕𝑝 ⋅ ⋯ ⋅ 𝜕𝑧 𝜕𝑏1 𝜕Loss 𝜕𝑊0 = 𝜕𝐿𝑜𝑠𝑠 𝜕𝑝 ⋅ ⋯ ⋅ 𝜕ℎ𝑖𝑛 𝜕𝑊0 𝜕Loss 𝜕𝑏0 = 𝜕𝐿𝑜𝑠𝑠 𝜕𝑝 ⋅ ⋯ ⋅ 𝜕ℎ𝑖𝑛 𝜕𝑏0 • Identify how each variable influence the loss
  • 44. Back propagation x W0 b0 hin p Loss W1 b1 z h ∆𝑝 = 𝜕𝐿𝑜𝑠𝑠 𝜕𝑝 ∆𝑧 = Δ𝑝 ⋅ 𝜕𝑝 𝜕𝑧 ∆𝑊1 = Δ𝑧 ⋅ 𝜕𝑧 𝜕𝑊1 ∆𝑏1 = Δ𝑧 ⋅ 𝜕𝑧 𝜕𝑏1 ∆ℎ = Δ𝑧 ⋅ 𝜕𝑧 𝜕ℎ ∆ℎ𝑖𝑛 = Δℎ ⋅ 𝜕ℎ 𝜕ℎ𝑖𝑛 ∆𝑊0 = Δℎ𝑖𝑛 ⋅ 𝜕ℎ𝑖𝑛 𝜕𝑊0 ∆𝑏0 = Δℎ𝑖𝑛 ⋅ 𝜕ℎ𝑖𝑛 𝜕𝑏0 Softmax ReLU Linear Linear • Don’t repeat shared compute - Propagate the gradients backward
  • 45. More complex example x W0 b0 hin p Loss W1 b1 z0 h Softmax ReLU Linear Linear z ∆𝑝 = 𝜕𝐿𝑜𝑠𝑠 𝜕𝑝 ∆𝑧 = Δ𝑝 ⋅ 𝜕𝑝 𝜕𝑧 ∆𝑊1 = Δ𝑧0 ⋅ 𝜕𝑧0 𝜕𝑊1 ∆𝑏1 = Δ𝑧0 ⋅ 𝜕𝑧0 𝜕𝑏1 ∆ℎ = Δ𝑧0 ⋅ 𝜕𝑧0 𝜕ℎ + Δ𝑧 ⋅ 𝜕𝑧 𝜕ℎ ∆ℎ𝑖𝑛 = Δℎ ⋅ 𝜕ℎ 𝜕ℎ𝑖𝑛 ∆𝑊0 = Δℎ𝑖𝑛 ⋅ 𝜕ℎ𝑖𝑛 𝜕𝑊0 ∆𝑏0 = Δℎ𝑖𝑛 ⋅ 𝜕ℎ𝑖𝑛 𝜕𝑏0 ∆𝑧0 = Δ𝑧 ⋅ 𝜕𝑧 𝜕𝑧0 ∆ℎ = Δ𝑧 ⋅ 𝜕𝑧 𝜕ℎ
  • 46. Demo
  • 47. Deep learning frameworks • Collection of implementations of popular layers (or modules), e.g., ReLU, Softmax, Convolution, RNNs • Provides an easy front-end to the layers/modules • Handles different array libraries / hardware backends (CPUs, GPUs, …) • If there were an exchange format…
  • 48.
  • 49. Books Convex Optimization Stephen Boyd and Lieven Vandenberghe Cambridge University Press Information Theory, Inference and Learning Algorithms David J. C. MacKay 2003 Neural Networks for Pattern Recognition Christopher M. Bishop
  • 50. Conclusion • Training a network consists of • Forward propagation: computing the loss • Backward propagation: computing the gradient • Parameter update: move in the direction of the computed stochastic gradient • Fairly standard set of building blocks are used to build complex models • Linear, ReLU, Softmax, Tanh, … • Advanced topics • How to prevent overfitting • How to scale neural network training to multiple machines / devices
  • 52. Minibatch size and convergence speed 1 pass over the dataset (1 epoch) Full-batch gradient descent Large-minibatch gradient descent Small-minibatch gradient descent #samples Error #samples Error #samples Error Parameter update
  • 53. Effect of minibatch size B Learning rate 𝜂 scaled as 𝜂 = 0.025 ⋅ 𝐵
  • 54. Dropout –randomly drops activations during training Instance 1
  • 55. Dropout –randomly drops activations during training Instance 2
  • 56. Dropout –randomly drops activations during training Instance 3
  • 57. Dropout • Idea: randomly drop activations during training • Benefit: reduces overfitting and improves generalization • Can be implemented as a layer
  • 58. Batch normalization Inputs scaled to [0,1] 784 dim Weights and biases drawn from N(0, 1) R.V. with scale at most 784 1024 dim
  • 59. Batch normalization 784 dim Weights and biases drawn from N(0, 1) 1024 dim 𝜎1 𝜇1 𝜎2 𝜇2 𝜎3 𝜇3 𝜎𝐻 𝜇𝐻 Inputs scaled to [0,1] R.V. with scale at most 784 R.V. with scale at most 784
  • 60. Batch normalization [Ioffe &Szegedy, 2015] • Idea: normalize the activation of each unit to have zero mean and unit standard deviation using a mini-batch estimate of mean and variance. • Benefit: more stable and faster training. Often generalizes better • Can be implemented as a layer
  • 61. More optimization algorithms • Momentum SGD: improves SGD by incorporating “momentum”
  • 62. Adaptive optimization algorithms • Adam [Kingma & Ba 2015]: uses first and second order statistics of the gradients so that gradients are normalized • Benefit: prevents the vanishing/exploding gradient problem
  • 63. Learning rate decay • Reduce the learning rate or step-size parameter (𝜂) once in a while • Typical setting: multiply 0.98 to 𝜂 every epoch.
  • 64. Summary • Training and inference • Training objective and optimization • Neural networks and backpropagation • Importance of software tools – turns research into Lego block engineering • Various tricks to speed-up training and reduce overfitting
  • 65. Gradient explosion/diminishing problem Linear + ReLU Linear + ReLU 𝑊1 𝑊2 𝑊1 𝑇 𝑊2 𝑇 Linear + ReLU 𝑊3 𝑊3 𝑇 ℎ0 ℎ1 ℎ2 ℎ3
  • 66. Gradient explosion/diminishing problem Linear Linear 𝑊 𝑊 𝑊𝑇 𝑊𝑇 Linear 𝑊 𝑊𝑇 ℎ0 ℎ1 ℎ2 ℎ3 Δℎ3 Δℎ2 = 𝑊𝑇 Δℎ3 Δℎ1 = 𝑊𝑇 2 Δℎ3 Δℎ0 = 𝑊𝑇 3 Δℎ3 • Gradient is magnified or diminished by factor W at every layer. • If we have many layers, they can explode or diminish to zero.
  • 67. What is a model? • A model is a function specified by a set of parameters 𝜃 • Example: linear predictor 𝑓𝜃 𝑥 = 𝑤𝑇 ⋅ 𝑥 + 𝑏 (𝜃 = 𝑤, 𝑏 ) 0.99 parameters 𝑓𝜃(𝑥)
  • 68. What is a model? • A model is a function specified by a set of parameters 𝜃 • Example: linear predictor 𝑓𝜃 𝑥 = 𝑤𝑇 ⋅ 𝑥 + 𝑏 (𝜃 = 𝑤, 𝑏 ) parameters × × × × × = sum + b w1 w2 w3 w4 w5 x1 x2 x3 x4 x5
  • 69. Training and inference Training • The loss tells what the output of the model should have been • Training objective can be overly optimistic (overfitting) Inference (validation) • We care about the performance in this setting Training data 𝑓𝜃(𝑥) Loss 𝑓𝜃(𝑥) Frozen parameter cat 0.99 (cat) 0.1 (dog)
  • 70.
  • 71. Loss functions for binary classification • Miss classification loss (-accuracy) • Squared loss • Cross entropy loss 𝐼( 𝑝 − 0.5 ⋅ (𝑦 − 0.5)) 𝑝 − 𝑦 2 −𝑦 ⋅ log 𝑝 − 1 − 𝑦 ⋅ log(1 − 𝑝) p=prediction, y=ground truth (0 or 1)