"A Shallow Dive into Training Deep Neural Networks," a Presentation from DeepScale

Copyright © 2017 DeepScale 1
A Shallow Dive into Training Deep
Neural Networks
Sammy Sidhu
May 2017

• Perception systems for autonomous vehicles
• Focusing on enabling technologies for mass-produced autonomous
vehicles
• Working with a number of OEMs and automotive suppliers
• Open Source ☺
• Visit http://deepscale.ai
About DeepScale

• Feature Engineering vs. Learned Features
• Neural Network Review
• Loss Function (Objective Function)
• Gradients
• Optimization Techniques
• Datasets
• Overfitting and Underfitting
Overview

Feature Engineering vs. Learned Features
Example of hand written features for face detection

• Feature Engineering for computer vision can work well
• Very time consuming to find useful features
• Requires BOTH domain expertise and programming know-how
• Hard to generalize all cases (lumination, pose and variations in
domain)
• Can use generalized features like HOG/SIFT but accuracy suffers
Feature Engineering vs. Learned Features (Cont’d.)

Example of learned features of a CNN for facial
classification [DeepFace CVPR14]

• Learned Features for computer vision can work extremely well
• Image Classification: 5.71% vs. 26.2% error [ResNet-152 vs. SIFT
sparse]
• Only requires labeled data, deep learning expertise and computing
power
• “Training” the network is essentially learning features layer by layer
• The deeper you go, the features become much more complex
• Hard to perform validation outside of putting in data and seeing what
happens

y = fw(x)
where w is a set of parameters we can learn and f is a nonlinear function
A neural network can be seen as a function approximation
Neural Networks — Quick Review
8
Typical nonlinear functions in DNN

• Take the example of a Linear Regression
• Given data, we fit a line (𝑦 = 𝑚𝑥 + 𝑏) that minimizes the sum of the
squares of differences (Euclidian distance loss function)
• This function that we minimize is the loss function
• An example would be to predict house value given square footage and
median income
• f(sqft, income) --> value where value is [0, inf] dollars
• we want to minimize L(actual_value, predicted_value) where L is the
loss function
Loss Function (Objective Function)

Loss Function (Objective Function) (Cont’d.)

• Another loss function is the Softmax loss for classification
• This is useful for the case if we want to predict the probability of an event
• For Example: Predict if an image is of a cat or a dog

• Loss functions can be used for either classification or regression
• The goal is to pick a set of weights that makes this loss value as small
as possible
• It is very crucial to pick the right objective function for the right task, i.e.,
one technically can use a squared loss for predicting probability

• Now if we have a loss function and a neural network, how do we know
what part of the network is “responsible” for causing that error?
• Let’s go back to the simple linear regression!
Gradients

• Let’s define the loss function
• 𝐿 =
1
2
(𝑌 − ෠𝑌)2 where ෠𝑌 is the predicted
• Let’s then take the derivative to see how ෠𝑌 contributes to the loss L
•
𝑑𝐿
𝑑 ෠𝑌
= −(𝑌 − ෠𝑌) = ෠𝑌 − 𝑌
• We’re fitting a line
• ෠𝑌 = 𝑚𝑋 + 𝑏
• Two weights to optimize (slope and bias)
•
𝑑 ෠𝑌
𝑑𝑚
= X,
𝑑 ෠𝑌
𝑑𝑏
= 1
Gradients (Cont’d.)

Line with noise to fit Surface of loss w.r.t slope and bias (m, b)
https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/

• We know
𝑑𝐿
𝑑 ෠𝑌
= ෠𝑌 − 𝑌 and
𝑑 ෠𝑌
𝑑𝑚
= X,
𝑑 ෠𝑌
𝑑𝑏
= 1
• To optimize our line [slope and bias] we use the chain rule!
•
𝑑𝐿
𝑑𝑚
=
𝑑𝐿
𝑑෢𝑌
𝑑 ෠𝑌
𝑑𝑚
= X(෠𝑌 − 𝑌) and
𝑑𝐿
𝑑𝑏
=
𝑑𝐿
𝑑෢𝑌
𝑑 ෠𝑌
𝑑𝑏
= (෠𝑌 − 𝑌)
• Together, these two derivatives make a Gradient!
• We update our weights with the following
• 𝑚 = 𝑚 + 𝛼
𝑑𝐿
𝑑𝑚
and 𝑏 = 𝑏 + 𝛼
𝑑𝐿
𝑑𝑏
• where 𝛼 is a rate parameter

• How to minimize loss?
• Walk down surface via gradient steps until you reach the minimum!
https://github.com/mattnedrich/GradientDescentExample

• Gradient descent is not just limited to linear regression
• We can take derivatives with respect to any parameter in the
neural network
• To avoid math complexity and recomputation, we can use the
chain rule again
• We can even do this through our nonlinear functions that are
not continuous

Gradients (Cont.)
• This process of computing and applying gradient updates to a neural
network layer by layer is called Back Propagation

• Given the fact that we now have gradients, and the weights, what's the
best way to apply the updates?
• In the previous linear regression example
• Grab random sample and apply updates to slope and bias
• Repeat until converges
• Known as SGD
• Can we do better to find the best possible set of weights to minimize
loss? (Optimization)
Optimization Techniques

• Momentum
• Keep a running average of previous updates and add to each update
Optimization Techniques (Cont’d.)
Steps without Momentum Steps with Momentum

• AdaGrad, AdaProp, RMSProp, ADAM
• Automatically tune learning rate to reach convergence in less
updates
• Great for fast convergence
• Sometimes finicky for reaching lowest loss possible for a network

• When it comes to neural networks, you want to have a diverse dataset
that large enough to training your network without overfitting (more on
this later)
• You can also augment your data to generate more samples
• Rotations / reflections when makes sense
• Add noise / hue / contrast
• This is extremely useful in the case where you have rare samples classes
Datasets

Datasets (Cont’d.)
MNIST

CIFAR-10

Imagenet

• What is Overfitting?
• Fitting to the training data but not generalizing well
• What is Underfitting?
• The model does not capture the trends in the data
• How to tell?
Overfitting and Underfitting

Overfitting and Underfitting (Cont’d.)

• We can split the training data into 3 disjoint parts
• Training set, Validation set, Test set
• During training
• “Learn” via the training set
• Evaluate the model every epoch with the validation set
• After Training
• Test the model with the test set which the model hasn’t seen before

• Overfitting when
• Training loss is low but validation and test loss is high

• How to combat overfitting?
• More data
• Data augmentation
• Regularization (weight decay)
• Add the magnitude of the weights to the loss function
• Ignore some of the weight updates (Dropout)
• Simpler model?

• Underfitting when
• Training loss drops at first then stops
• Training loss is still high
• Training loss tracks validation loss
• More complex model?
• Turn down regularization

• Neural Nets are function approximators
• Deep Learning can work surprising well
• Optimizing nets is an art that requires intuition
• Making good datasets is hard
• Overfitting makes it hard to generalize for applications
• We can find how robust our models are with validation testing
Takeaways

Thank you!
Questions?

"A Shallow Dive into Training Deep Neural Networks," a Presentation from DeepScale

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to "A Shallow Dive into Training Deep Neural Networks," a Presentation from DeepScale

Similar to "A Shallow Dive into Training Deep Neural Networks," a Presentation from DeepScale (20)

More from Edge AI and Vision Alliance

More from Edge AI and Vision Alliance (20)

Recently uploaded

Recently uploaded (20)

"A Shallow Dive into Training Deep Neural Networks," a Presentation from DeepScale