More Related Content Similar to "A Shallow Dive into Training Deep Neural Networks," a Presentation from DeepScale (20) More from Edge AI and Vision Alliance (20) "A Shallow Dive into Training Deep Neural Networks," a Presentation from DeepScale1. Copyright © 2017 DeepScale 1
A Shallow Dive into Training Deep
Neural Networks
Sammy Sidhu
May 2017
2. Copyright © 2017 DeepScale 2
• Perception systems for autonomous vehicles
• Focusing on enabling technologies for mass-produced autonomous
vehicles
• Working with a number of OEMs and automotive suppliers
• Open Source ☺
• Visit http://deepscale.ai
About DeepScale
3. Copyright © 2017 DeepScale 3
• Feature Engineering vs. Learned Features
• Neural Network Review
• Loss Function (Objective Function)
• Gradients
• Optimization Techniques
• Datasets
• Overfitting and Underfitting
Overview
4. Copyright © 2017 DeepScale 4
Feature Engineering vs. Learned Features
Example of hand written features for face detection
5. Copyright © 2017 DeepScale 5
• Feature Engineering for computer vision can work well
• Very time consuming to find useful features
• Requires BOTH domain expertise and programming know-how
• Hard to generalize all cases (lumination, pose and variations in
domain)
• Can use generalized features like HOG/SIFT but accuracy suffers
Feature Engineering vs. Learned Features (Cont’d.)
6. Copyright © 2017 DeepScale 6
Feature Engineering vs. Learned Features (Cont’d.)
Example of learned features of a CNN for facial
classification [DeepFace CVPR14]
7. Copyright © 2017 DeepScale 7
• Learned Features for computer vision can work extremely well
• Image Classification: 5.71% vs. 26.2% error [ResNet-152 vs. SIFT
sparse]
• Only requires labeled data, deep learning expertise and computing
power
• “Training” the network is essentially learning features layer by layer
• The deeper you go, the features become much more complex
• Hard to perform validation outside of putting in data and seeing what
happens
Feature Engineering vs. Learned Features (Cont’d.)
8. Copyright © 2017 DeepScale 8
y = fw(x)
where w is a set of parameters we can learn and f is a nonlinear function
A neural network can be seen as a function approximation
Neural Networks — Quick Review
8
Typical nonlinear functions in DNN
9. Copyright © 2017 DeepScale 9
• Take the example of a Linear Regression
• Given data, we fit a line (𝑦 = 𝑚𝑥 + 𝑏) that minimizes the sum of the
squares of differences (Euclidian distance loss function)
• This function that we minimize is the loss function
• An example would be to predict house value given square footage and
median income
• f(sqft, income) --> value where value is [0, inf] dollars
• we want to minimize L(actual_value, predicted_value) where L is the
loss function
Loss Function (Objective Function)
11. Copyright © 2017 DeepScale 11
Loss Function (Objective Function) (Cont’d.)
• Another loss function is the Softmax loss for classification
• This is useful for the case if we want to predict the probability of an event
• For Example: Predict if an image is of a cat or a dog
12. Copyright © 2017 DeepScale 12
• Loss functions can be used for either classification or regression
• The goal is to pick a set of weights that makes this loss value as small
as possible
• It is very crucial to pick the right objective function for the right task, i.e.,
one technically can use a squared loss for predicting probability
Loss Function (Objective Function) (Cont’d.)
13. Copyright © 2017 DeepScale 13
• Now if we have a loss function and a neural network, how do we know
what part of the network is “responsible” for causing that error?
• Let’s go back to the simple linear regression!
Gradients
14. Copyright © 2017 DeepScale 14
• Let’s define the loss function
• 𝐿 =
1
2
(𝑌 − 𝑌)2 where 𝑌 is the predicted
• Let’s then take the derivative to see how 𝑌 contributes to the loss L
•
𝑑𝐿
𝑑 𝑌
= −(𝑌 − 𝑌) = 𝑌 − 𝑌
• We’re fitting a line
• 𝑌 = 𝑚𝑋 + 𝑏
• Two weights to optimize (slope and bias)
•
𝑑 𝑌
𝑑𝑚
= X,
𝑑 𝑌
𝑑𝑏
= 1
Gradients (Cont’d.)
15. Copyright © 2017 DeepScale 15
Gradients (Cont’d.)
Line with noise to fit Surface of loss w.r.t slope and bias (m, b)
https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
16. Copyright © 2017 DeepScale 16
• We know
𝑑𝐿
𝑑 𝑌
= 𝑌 − 𝑌 and
𝑑 𝑌
𝑑𝑚
= X,
𝑑 𝑌
𝑑𝑏
= 1
• To optimize our line [slope and bias] we use the chain rule!
•
𝑑𝐿
𝑑𝑚
=
𝑑𝐿
𝑑𝑌
𝑑 𝑌
𝑑𝑚
= X(𝑌 − 𝑌) and
𝑑𝐿
𝑑𝑏
=
𝑑𝐿
𝑑𝑌
𝑑 𝑌
𝑑𝑏
= (𝑌 − 𝑌)
• Together, these two derivatives make a Gradient!
• We update our weights with the following
• 𝑚 = 𝑚 + 𝛼
𝑑𝐿
𝑑𝑚
and 𝑏 = 𝑏 + 𝛼
𝑑𝐿
𝑑𝑏
• where 𝛼 is a rate parameter
Gradients (Cont’d.)
17. Copyright © 2017 DeepScale 17
• How to minimize loss?
• Walk down surface via gradient steps until you reach the minimum!
Gradients (Cont’d.)
https://github.com/mattnedrich/GradientDescentExample
18. Copyright © 2017 DeepScale 18
• Gradient descent is not just limited to linear regression
• We can take derivatives with respect to any parameter in the
neural network
• To avoid math complexity and recomputation, we can use the
chain rule again
• We can even do this through our nonlinear functions that are
not continuous
Gradients (Cont’d.)
19. Copyright © 2017 DeepScale 19
Gradients (Cont.)
• This process of computing and applying gradient updates to a neural
network layer by layer is called Back Propagation
20. Copyright © 2017 DeepScale 20
• Given the fact that we now have gradients, and the weights, what's the
best way to apply the updates?
• In the previous linear regression example
• Grab random sample and apply updates to slope and bias
• Repeat until converges
• Known as SGD
• Can we do better to find the best possible set of weights to minimize
loss? (Optimization)
Optimization Techniques
21. Copyright © 2017 DeepScale 21
• Momentum
• Keep a running average of previous updates and add to each update
Optimization Techniques (Cont’d.)
Steps without Momentum Steps with Momentum
22. Copyright © 2017 DeepScale 22
• AdaGrad, AdaProp, RMSProp, ADAM
• Automatically tune learning rate to reach convergence in less
updates
• Great for fast convergence
• Sometimes finicky for reaching lowest loss possible for a network
Optimization Techniques (Cont’d.)
24. Copyright © 2017 DeepScale 24
• When it comes to neural networks, you want to have a diverse dataset
that large enough to training your network without overfitting (more on
this later)
• You can also augment your data to generate more samples
• Rotations / reflections when makes sense
• Add noise / hue / contrast
• This is extremely useful in the case where you have rare samples classes
Datasets
28. Copyright © 2017 DeepScale 28
• What is Overfitting?
• Fitting to the training data but not generalizing well
• What is Underfitting?
• The model does not capture the trends in the data
• How to tell?
Overfitting and Underfitting
30. Copyright © 2017 DeepScale 30
• We can split the training data into 3 disjoint parts
• Training set, Validation set, Test set
• During training
• “Learn” via the training set
• Evaluate the model every epoch with the validation set
• After Training
• Test the model with the test set which the model hasn’t seen before
Overfitting and Underfitting (Cont’d.)
31. Copyright © 2017 DeepScale 31
Overfitting and Underfitting (Cont’d.)
• Overfitting when
• Training loss is low but validation and test loss is high
32. Copyright © 2017 DeepScale 32
• How to combat overfitting?
• More data
• Data augmentation
• Regularization (weight decay)
• Add the magnitude of the weights to the loss function
• Ignore some of the weight updates (Dropout)
• Simpler model?
Overfitting and Underfitting (Cont’d.)
33. Copyright © 2017 DeepScale 33
• Underfitting when
• Training loss drops at first then stops
• Training loss is still high
• Training loss tracks validation loss
• More complex model?
• Turn down regularization
Overfitting and Underfitting (Cont’d.)
34. Copyright © 2017 DeepScale 34
• Neural Nets are function approximators
• Deep Learning can work surprising well
• Optimizing nets is an art that requires intuition
• Making good datasets is hard
• Overfitting makes it hard to generalize for applications
• We can find how robust our models are with validation testing
Takeaways