SlideShare a Scribd company logo
1 of 75
CS294-129: Designing, Visualizing and
Understanding Deep Neural Networks
John Canny
Fall 2016
Lecture 4: Optimization
Based on notes from Andrej Karpathy, Fei-Fei Li,
Justin Johnson
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Last Time: Linear Classifier
Example trained weights
of a linear classifier
trained on CIFAR-10:
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Interpreting a Linear Classifier
[32x32x3]
array of numbers 0...1
(3072 numbers total)
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Last Time: Loss Functions
- We have some dataset of (x,y)
- We have a score function:
- We have a loss function:
e.g.
Softmax
SVM
Full loss
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Optimization
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Strategy #1: A first very bad idea solution: Random search
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lets see how well this works on the test set...
15.5% accuracy! not bad!
(State of the art is ~95%)
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Strategy #2: Follow the slope
In 1-dimension, the derivative of a function:
In multiple dimensions, the gradient is the vector of (partial derivatives).
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
gradient dW:
[?,
?,
?,
?,
?,
?,
?,
?,
?,…]
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
W + h (first dim):
[0.34 + 0.0001,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25322
gradient dW:
[?,
?,
?,
?,
?,
?,
?,
?,
?,…]
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
gradient dW:
[-2.5,
?,
?,
?,
?,
?,
?,
?,
?,…]
(1.25322 - 1.25347)/0.0001
= -2.5
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
W + h (first dim):
[0.34 + 0.0001,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25322
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
gradient dW:
[-2.5,
?,
?,
?,
?,
?,
?,
?,
?,…]
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
W + h (second dim):
[0.34,
-1.11 + 0.0001,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25353
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
gradient dW:
[-2.5,
0.6,
?,
?,
?,
?,
?,
?,
?,…]
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
W + h (second dim):
[0.34,
-1.11 + 0.0001,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25353
(1.25353 - 1.25347)/0.0001
= 0.6
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
gradient dW:
[-2.5,
0.6,
?,
?,
?,
?,
?,
?,
?,…]
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
W + h (third dim):
[0.34,
-1.11,
0.78 + 0.0001,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
gradient dW:
[-2.5,
0.6,
0,
?,
?,
?,
?,
?,
?,…]
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
W + h (third dim):
[0.34,
-1.11,
0.78 + 0.0001,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
(1.25347 - 1.25347)/0.0001
= 0
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Evaluating the
gradient numerically
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Evaluating the
gradient numerically
- approximate
- very slow to evaluate
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
This is silly. The loss is just a function of W:
want
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
This is silly. The loss is just a function of W:
want
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
This is silly. The loss is just a function of W:
Calculus
want
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
This is silly. The loss is just a function of W:
= ...
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
gradient dW:
[-2.5,
0.6,
0,
0.2,
0.7,
-0.5,
1.1,
1.3,
-2.1,…]
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
dW = ...
(some function
data and W)
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
In summary:
- Numerical gradient: approximate, slow, easy to write
- Analytic gradient: exact, fast, error-prone
=>
In practice: Always use analytic gradient, but check
implementation with numerical gradient. This is called a
gradient check.
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Gradient Descent
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
original W
negative gradient direction
W_1
W_2
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Mini-batch Gradient Descent
- only use a small portion of the training set to compute the
gradient.
Common mini-batch sizes are 32/64/128 examples
e.g. Krizhevsky ILSVRC ConvNet used 256 examples
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Example of optimization progress
while training a neural network.
(Loss over mini-batches goes
down over time.)
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
The effects of step size (or “learning rate”)
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Mini-batch Gradient Descent
• only use a small portion of the training set to compute the
gradient.
Common mini-batch sizes are 32/64/128 examples
e.g. Krizhevsky ILSVRC ConvNet used 256 examples
we will look at more
fancy update formulas
(momentum, Adagrad,
RMSProp, Adam, …)
original W
True negative gradient direction
W_1
W_2
Minibatch updates
original W
noisy gradient from minibatch
W_1
W_2
Stochastic Gradient
original W
True gradients in blue
minibatch gradients in red
W_1
W_2
Stochastic Gradient Descent
Gradients are noisy but still make good progress on average
Updates
• We have a GSI, looking for a second.
• Recording is happening from today, should be posted on
Youtube soon after class.
• Trying to increase enrollment. Depends on:
• Petition to the Dean, fire approval.
• New students should be willing to view video lectures.
• Assignment 1 is out, due Sept 14. Please start soon.
Standard gradient
W_1
W_2
You might be wondering…
Can we compute this?
Newton’s method for zeros of a function
Based on the Taylor Series for 𝑓 𝑥 + ℎ :
𝑓 𝑥 + ℎ = 𝑓 𝑥 + ℎ𝑓′ 𝑥 + 𝑂 ℎ2
To find a zero of f, assume 𝑓 𝑥 + ℎ = 0, so
ℎ ≈ −
𝑓 𝑥
𝑓′ 𝑥
And as an iteration:
𝑥𝑛+1 = 𝑥𝑛 −
𝑓 𝑥𝑛
𝑓′ 𝑥𝑛
Newton’s method for optima
For zeros of 𝑓(𝑥):
𝑥𝑛+1 = 𝑥𝑛 −
𝑓 𝑥𝑛
𝑓′ 𝑥𝑛
At a local optima, 𝑓′
𝑥 = 0, so we use:
𝑥𝑛+1 = 𝑥𝑛 −
𝑓′ 𝑥𝑛
𝑓′′ 𝑥𝑛
If 𝑓′′ 𝑥 is constant (𝑓 is quadratic), then Newton’s method
finds the optimum in one step.
More generally, Newton’s method has quadratic
converge.
Newton’s method for gradients:
To find an optimum of a function 𝑓(𝑥) for high-dimensional
𝑥, we want zeros of its gradient: 𝛻𝑓 𝑥 = 0
For zeros of 𝛻𝑓(𝑥) with a vector displacement ℎ, Taylor’s
expansion is:
𝛻𝑓 𝑥 + ℎ = 𝛻𝑓 𝑥 + ℎ𝑇𝐻𝑓 𝑥 + 𝑂 | ℎ |2
Where 𝐻𝑓 is the Hessian matrix of second derivatives of 𝑓.
The update is:
𝑥𝑛+1 = 𝑥𝑛 − 𝐻𝑓 𝑥𝑛
−1𝛻𝑓 𝑥𝑛
Standard gradient
W_1
W_2
Newton step
Newton gradient step
Newton’s method for gradients:
The Newton update is:
𝑥𝑛+1 = 𝑥𝑛 − 𝐻𝑓 𝑥𝑛
−1𝛻𝑓 𝑥𝑛
Converges very fast, but rarely used in DL. Why?
Newton’s method for gradients:
The Newton update is:
𝑥𝑛+1 = 𝑥𝑛 − 𝐻𝑓 𝑥𝑛
−1𝛻𝑓 𝑥𝑛
Converges very fast, but rarely used in DL. Why?
Too expensive: if 𝑥𝑛 has dimension M, the Hessian
𝐻𝑓 𝑥𝑛 has dimension M2 and takes O(M3) time to invert.
Newton’s method for gradients:
The Newton update is:
𝑥𝑛+1 = 𝑥𝑛 − 𝐻𝑓 𝑥𝑛
−1𝛻𝑓 𝑥𝑛
Converges very fast, but rarely used in DL. Why?
Too expensive: if 𝑥𝑛 has dimension M, the Hessian
𝐻𝑓 𝑥𝑛 has dimension M2 and takes O(M3) time to invert.
Too unstable: it involves a high-dimensional matrix
inverse, which has poor numerical stability. The Hessian
may even be singular.
L-BFGS
There is an approximate Newton method that addresses
these issues called L-BGFS, (Limited memory BFGS).
BFGS is Broyden-Fletcher-Goldfarb-Shanno.
Idea: compute a low-dimensional approximation of
𝐻𝑓 𝑥𝑛
−1
directly.
Expense: use a k-dimensional approximation of 𝐻𝑓 𝑥𝑛
−1,
Size is O(kM), cost is O(k2 M).
Stability: much better. Depends on largest singular values
of 𝐻𝑓 𝑥𝑛 .
Convergence Nomenclature
Quadratic Convergence: error decreases quadratically
(Newton’s Method):
𝜖 𝑛+1 ∝ 𝜖 𝑛
2 where 𝜖𝑛 = 𝑥𝑜𝑝𝑡 − 𝑥𝑛
Linearly Convergence: error decreases linearly:
𝜖 𝑛+1 ≤ 𝜇 𝜖 𝑛 where 𝜇 is the rate of convergence.
SGD: ??
Convergence Behavior
Quadratic Convergence: error decreases quadratically
𝜖 𝑛+1 ∝ 𝜖 𝑛
2 so 𝜖𝑛 is 𝑂 𝜇2𝑛
Linearly Convergence: error decreases linearly:
𝜖 𝑛+1 ≤ 𝜇 𝜖 𝑛 so 𝜖𝑛 is 𝑂 𝜇𝑛 .
SGD: If learning rate is adjusted as 1/n, then (Nemirofski)
𝜖𝑛 is O(1/n).
Convergence Comparison
Quadratic Convergence: 𝜖𝑛 is 𝑂 𝜇2𝑛
, time log log 𝜖
Linearly Convergence: 𝜖𝑛 is 𝑂 𝜇𝑛 , time log 𝜖
SGD: 𝜖𝑛 is O(1/n), time 1/𝜖
SGD is terrible compared to the others. Why is it used?
Convergence Comparison
Quadratic Convergence: 𝜖𝑛 is 𝑂 𝜇2𝑛
, time log log 𝜖
Linearly Convergence: 𝜖𝑛 is 𝑂 𝜇𝑛 , time log 𝜖
SGD: 𝜖𝑛 is O(1/n), time 1/𝜖
SGD is good enough for machine learning applications.
Remember “n” for SGD is minibatch count.
After 1 million updates, 𝜖 is order O(1/n) which is
approaching floating point single precision.
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
(image credits to Alec Radford)
The effects of different update formulas
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Another approach to poor gradients:
Q: What is the trajectory along which we converge
towards the minimum with SGD?
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Q: What is the trajectory along which we converge
towards the minimum with SGD? very slow progress
along flat direction, jitter along steep one
Another approach to poor gradients:
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Momentum update
- Physical interpretation as ball rolling down the loss function + friction (mu coefficient).
- mu = usually ~0.5, 0.9, or 0.99 (Sometimes annealed over time, e.g. from 0.5 -> 0.99)
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Momentum update
- Allows a velocity to “build up” along shallow directions
- Velocity becomes damped in steep direction due to quickly changing sign
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
SGD
vs
Momentum
notice momentum
overshooting the target,
but overall getting to the
minimum much faster
than vanilla SGD.
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Nesterov Momentum update
gradient
step
momentum
step
actual step
Ordinary momentum update:
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Nesterov Momentum update
gradient
step
momentum
step
actual step
momentum
step
“lookahead”
gradient step (bit
different than
original)
actual step
Momentum update Nesterov momentum
update
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Nesterov Momentum update
gradient
step
momentum
step
actual step
momentum
step
“lookahead”
gradient step (bit
different than
original)
actual step
Momentum update Nesterov momentum
update
Nesterov: the only difference...
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
58
Nesterov Momentum update
Slightly inconvenient…
usually we have :
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
59
Nesterov Momentum update
Slightly inconvenient…
usually we have :
Variable transform and rearranging saves the day:
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Nesterov Momentum update
Slightly inconvenient…
usually we have :
Variable transform and rearranging saves the day:
Replace all thetas with phis, rearrange and obtain:
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
nag =
Nesterov
Accelerated
Gradient
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
AdaGrad update
Added element-wise scaling of the gradient based on the
historical sum of squares in each dimension
[Duchi et al., 2011]
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Q: What happens with AdaGrad?
AdaGrad update
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Q2: What happens to the step size over long time?
AdaGrad update
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
RMSProp update [Tieleman and Hinton, 2012]
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Introduced in a slide in
Geoff Hinton’s Coursera
class, lecture 6
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Introduced in a slide in
Geoff Hinton’s Coursera
class, lecture 6
Cited by several
papers as:
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
adagrad
rmsprop
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Adam update
[Kingma and Ba, 2014]
(incomplete, but close)
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Adam update
[Kingma and Ba, 2014]
(incomplete, but close)
momentum
RMSProp-like
Looks a bit like RMSProp with momentum
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Adam update
[Kingma and Ba, 2014]
(incomplete, but close)
momentum
RMSProp-like
Looks a bit like RMSProp with momentum
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Adam update
[Kingma and Ba, 2014]
RMSProp-like
bias correction
(only relevant in first few
iterations when t is small)
momentum
The bias correction compensates for the fact that m,v are
initialized at zero and need some time to “warm up”.
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.
Q: Which one of these
learning rates is best to use?
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.
=> Learning rate decay over time!
step decay:
e.g. decay learning rate by half every few
epochs.
exponential decay:
1/t decay:
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Summary
• Simple Gradient Methods like SGD can make adequate
progress to an optimum when used on minibatches of data.
• Second-order methods make much better progress toward
the goal, but are more expensive and unstable.
• Convergence rates: quadratic, linear, O(1/n).
• Momentum: is another method to produce better effective
gradients.
• ADAGRAD, RMSprop diagonally scale the gradient. ADAM
scales and applies momentum.
• Assignment 1 is out, due Sept 14. Please start soon.

More Related Content

Similar to lec04.pptx

A study of the worst case ratio of a simple algorithm for simple assembly lin...
A study of the worst case ratio of a simple algorithm for simple assembly lin...A study of the worst case ratio of a simple algorithm for simple assembly lin...
A study of the worst case ratio of a simple algorithm for simple assembly lin...
narmo
 
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
The Statistical and Applied Mathematical Sciences Institute
 

Similar to lec04.pptx (20)

COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
Backpropagation for Deep Learning
Backpropagation for Deep LearningBackpropagation for Deep Learning
Backpropagation for Deep Learning
 
Pydata Katya Vasilaky
Pydata Katya VasilakyPydata Katya Vasilaky
Pydata Katya Vasilaky
 
Dual SVM Problem.pdf
Dual SVM Problem.pdfDual SVM Problem.pdf
Dual SVM Problem.pdf
 
AAC ch 3 Advance strategies (Dynamic Programming).pptx
AAC ch 3 Advance strategies (Dynamic Programming).pptxAAC ch 3 Advance strategies (Dynamic Programming).pptx
AAC ch 3 Advance strategies (Dynamic Programming).pptx
 
Divide and conquer
Divide and conquerDivide and conquer
Divide and conquer
 
[系列活動] 手把手的深度學實務
[系列活動] 手把手的深度學實務[系列活動] 手把手的深度學實務
[系列活動] 手把手的深度學實務
 
A study of the worst case ratio of a simple algorithm for simple assembly lin...
A study of the worst case ratio of a simple algorithm for simple assembly lin...A study of the worst case ratio of a simple algorithm for simple assembly lin...
A study of the worst case ratio of a simple algorithm for simple assembly lin...
 
Module4 s dynamics- rajesh sir
Module4 s dynamics- rajesh sirModule4 s dynamics- rajesh sir
Module4 s dynamics- rajesh sir
 
Module4 s dynamics- rajesh sir
Module4 s dynamics- rajesh sirModule4 s dynamics- rajesh sir
Module4 s dynamics- rajesh sir
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03
 
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
 
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the the
 
DynamicProgramming.ppt
DynamicProgramming.pptDynamicProgramming.ppt
DynamicProgramming.ppt
 
Lecture6 svdd
Lecture6 svddLecture6 svdd
Lecture6 svdd
 
Connected components and shortest path
Connected components and shortest pathConnected components and shortest path
Connected components and shortest path
 

Recently uploaded

Personalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes GuàrdiaPersonalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes Guàrdia
EADTU
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
AnaAcapella
 

Recently uploaded (20)

Major project report on Tata Motors and its marketing strategies
Major project report on Tata Motors and its marketing strategiesMajor project report on Tata Motors and its marketing strategies
Major project report on Tata Motors and its marketing strategies
 
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
 
PSYPACT- Practicing Over State Lines May 2024.pptx
PSYPACT- Practicing Over State Lines May 2024.pptxPSYPACT- Practicing Over State Lines May 2024.pptx
PSYPACT- Practicing Over State Lines May 2024.pptx
 
ANTI PARKISON DRUGS.pptx
ANTI         PARKISON          DRUGS.pptxANTI         PARKISON          DRUGS.pptx
ANTI PARKISON DRUGS.pptx
 
Including Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdfIncluding Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdf
 
Personalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes GuàrdiaPersonalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes Guàrdia
 
How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17
 
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUMDEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
 
VAMOS CUIDAR DO NOSSO PLANETA! .
VAMOS CUIDAR DO NOSSO PLANETA!                    .VAMOS CUIDAR DO NOSSO PLANETA!                    .
VAMOS CUIDAR DO NOSSO PLANETA! .
 
Improved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppImproved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio App
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
 
OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...
 
How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17
 
The Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDFThe Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDF
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
 
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjStl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
 
8 Tips for Effective Working Capital Management
8 Tips for Effective Working Capital Management8 Tips for Effective Working Capital Management
8 Tips for Effective Working Capital Management
 
An Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge AppAn Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge App
 
How to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptxHow to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptx
 
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of TransportBasic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
 

lec04.pptx

  • 1. CS294-129: Designing, Visualizing and Understanding Deep Neural Networks John Canny Fall 2016 Lecture 4: Optimization Based on notes from Andrej Karpathy, Fei-Fei Li, Justin Johnson
  • 2. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Last Time: Linear Classifier Example trained weights of a linear classifier trained on CIFAR-10:
  • 3. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Interpreting a Linear Classifier [32x32x3] array of numbers 0...1 (3072 numbers total)
  • 4. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Last Time: Loss Functions - We have some dataset of (x,y) - We have a score function: - We have a loss function: e.g. Softmax SVM Full loss
  • 5. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Optimization
  • 6. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Strategy #1: A first very bad idea solution: Random search
  • 7. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Lets see how well this works on the test set... 15.5% accuracy! not bad! (State of the art is ~95%)
  • 8. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
  • 9. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
  • 10. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Strategy #2: Follow the slope In 1-dimension, the derivative of a function: In multiple dimensions, the gradient is the vector of (partial derivatives).
  • 11. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 gradient dW: [?, ?, ?, ?, ?, ?, ?, ?, ?,…]
  • 12. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 W + h (first dim): [0.34 + 0.0001, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25322 gradient dW: [?, ?, ?, ?, ?, ?, ?, ?, ?,…]
  • 13. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson gradient dW: [-2.5, ?, ?, ?, ?, ?, ?, ?, ?,…] (1.25322 - 1.25347)/0.0001 = -2.5 current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 W + h (first dim): [0.34 + 0.0001, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25322
  • 14. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson gradient dW: [-2.5, ?, ?, ?, ?, ?, ?, ?, ?,…] current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 W + h (second dim): [0.34, -1.11 + 0.0001, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25353
  • 15. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson gradient dW: [-2.5, 0.6, ?, ?, ?, ?, ?, ?, ?,…] current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 W + h (second dim): [0.34, -1.11 + 0.0001, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25353 (1.25353 - 1.25347)/0.0001 = 0.6
  • 16. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson gradient dW: [-2.5, 0.6, ?, ?, ?, ?, ?, ?, ?,…] current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 W + h (third dim): [0.34, -1.11, 0.78 + 0.0001, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347
  • 17. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson gradient dW: [-2.5, 0.6, 0, ?, ?, ?, ?, ?, ?,…] current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 W + h (third dim): [0.34, -1.11, 0.78 + 0.0001, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 (1.25347 - 1.25347)/0.0001 = 0
  • 18. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Evaluating the gradient numerically
  • 19. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Evaluating the gradient numerically - approximate - very slow to evaluate
  • 20. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson This is silly. The loss is just a function of W: want
  • 21. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson This is silly. The loss is just a function of W: want
  • 22. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson This is silly. The loss is just a function of W: Calculus want
  • 23. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson This is silly. The loss is just a function of W: = ...
  • 24. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson gradient dW: [-2.5, 0.6, 0, 0.2, 0.7, -0.5, 1.1, 1.3, -2.1,…] current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,…] loss 1.25347 dW = ... (some function data and W)
  • 25. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson In summary: - Numerical gradient: approximate, slow, easy to write - Analytic gradient: exact, fast, error-prone => In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.
  • 26. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Gradient Descent
  • 27. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson original W negative gradient direction W_1 W_2
  • 28. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Mini-batch Gradient Descent - only use a small portion of the training set to compute the gradient. Common mini-batch sizes are 32/64/128 examples e.g. Krizhevsky ILSVRC ConvNet used 256 examples
  • 29. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.)
  • 30. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson The effects of step size (or “learning rate”)
  • 31. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Mini-batch Gradient Descent • only use a small portion of the training set to compute the gradient. Common mini-batch sizes are 32/64/128 examples e.g. Krizhevsky ILSVRC ConvNet used 256 examples we will look at more fancy update formulas (momentum, Adagrad, RMSProp, Adam, …)
  • 32. original W True negative gradient direction W_1 W_2 Minibatch updates
  • 33. original W noisy gradient from minibatch W_1 W_2 Stochastic Gradient
  • 34. original W True gradients in blue minibatch gradients in red W_1 W_2 Stochastic Gradient Descent Gradients are noisy but still make good progress on average
  • 35. Updates • We have a GSI, looking for a second. • Recording is happening from today, should be posted on Youtube soon after class. • Trying to increase enrollment. Depends on: • Petition to the Dean, fire approval. • New students should be willing to view video lectures. • Assignment 1 is out, due Sept 14. Please start soon.
  • 36. Standard gradient W_1 W_2 You might be wondering… Can we compute this?
  • 37. Newton’s method for zeros of a function Based on the Taylor Series for 𝑓 𝑥 + ℎ : 𝑓 𝑥 + ℎ = 𝑓 𝑥 + ℎ𝑓′ 𝑥 + 𝑂 ℎ2 To find a zero of f, assume 𝑓 𝑥 + ℎ = 0, so ℎ ≈ − 𝑓 𝑥 𝑓′ 𝑥 And as an iteration: 𝑥𝑛+1 = 𝑥𝑛 − 𝑓 𝑥𝑛 𝑓′ 𝑥𝑛
  • 38. Newton’s method for optima For zeros of 𝑓(𝑥): 𝑥𝑛+1 = 𝑥𝑛 − 𝑓 𝑥𝑛 𝑓′ 𝑥𝑛 At a local optima, 𝑓′ 𝑥 = 0, so we use: 𝑥𝑛+1 = 𝑥𝑛 − 𝑓′ 𝑥𝑛 𝑓′′ 𝑥𝑛 If 𝑓′′ 𝑥 is constant (𝑓 is quadratic), then Newton’s method finds the optimum in one step. More generally, Newton’s method has quadratic converge.
  • 39. Newton’s method for gradients: To find an optimum of a function 𝑓(𝑥) for high-dimensional 𝑥, we want zeros of its gradient: 𝛻𝑓 𝑥 = 0 For zeros of 𝛻𝑓(𝑥) with a vector displacement ℎ, Taylor’s expansion is: 𝛻𝑓 𝑥 + ℎ = 𝛻𝑓 𝑥 + ℎ𝑇𝐻𝑓 𝑥 + 𝑂 | ℎ |2 Where 𝐻𝑓 is the Hessian matrix of second derivatives of 𝑓. The update is: 𝑥𝑛+1 = 𝑥𝑛 − 𝐻𝑓 𝑥𝑛 −1𝛻𝑓 𝑥𝑛
  • 41. Newton’s method for gradients: The Newton update is: 𝑥𝑛+1 = 𝑥𝑛 − 𝐻𝑓 𝑥𝑛 −1𝛻𝑓 𝑥𝑛 Converges very fast, but rarely used in DL. Why?
  • 42. Newton’s method for gradients: The Newton update is: 𝑥𝑛+1 = 𝑥𝑛 − 𝐻𝑓 𝑥𝑛 −1𝛻𝑓 𝑥𝑛 Converges very fast, but rarely used in DL. Why? Too expensive: if 𝑥𝑛 has dimension M, the Hessian 𝐻𝑓 𝑥𝑛 has dimension M2 and takes O(M3) time to invert.
  • 43. Newton’s method for gradients: The Newton update is: 𝑥𝑛+1 = 𝑥𝑛 − 𝐻𝑓 𝑥𝑛 −1𝛻𝑓 𝑥𝑛 Converges very fast, but rarely used in DL. Why? Too expensive: if 𝑥𝑛 has dimension M, the Hessian 𝐻𝑓 𝑥𝑛 has dimension M2 and takes O(M3) time to invert. Too unstable: it involves a high-dimensional matrix inverse, which has poor numerical stability. The Hessian may even be singular.
  • 44. L-BFGS There is an approximate Newton method that addresses these issues called L-BGFS, (Limited memory BFGS). BFGS is Broyden-Fletcher-Goldfarb-Shanno. Idea: compute a low-dimensional approximation of 𝐻𝑓 𝑥𝑛 −1 directly. Expense: use a k-dimensional approximation of 𝐻𝑓 𝑥𝑛 −1, Size is O(kM), cost is O(k2 M). Stability: much better. Depends on largest singular values of 𝐻𝑓 𝑥𝑛 .
  • 45. Convergence Nomenclature Quadratic Convergence: error decreases quadratically (Newton’s Method): 𝜖 𝑛+1 ∝ 𝜖 𝑛 2 where 𝜖𝑛 = 𝑥𝑜𝑝𝑡 − 𝑥𝑛 Linearly Convergence: error decreases linearly: 𝜖 𝑛+1 ≤ 𝜇 𝜖 𝑛 where 𝜇 is the rate of convergence. SGD: ??
  • 46. Convergence Behavior Quadratic Convergence: error decreases quadratically 𝜖 𝑛+1 ∝ 𝜖 𝑛 2 so 𝜖𝑛 is 𝑂 𝜇2𝑛 Linearly Convergence: error decreases linearly: 𝜖 𝑛+1 ≤ 𝜇 𝜖 𝑛 so 𝜖𝑛 is 𝑂 𝜇𝑛 . SGD: If learning rate is adjusted as 1/n, then (Nemirofski) 𝜖𝑛 is O(1/n).
  • 47. Convergence Comparison Quadratic Convergence: 𝜖𝑛 is 𝑂 𝜇2𝑛 , time log log 𝜖 Linearly Convergence: 𝜖𝑛 is 𝑂 𝜇𝑛 , time log 𝜖 SGD: 𝜖𝑛 is O(1/n), time 1/𝜖 SGD is terrible compared to the others. Why is it used?
  • 48. Convergence Comparison Quadratic Convergence: 𝜖𝑛 is 𝑂 𝜇2𝑛 , time log log 𝜖 Linearly Convergence: 𝜖𝑛 is 𝑂 𝜇𝑛 , time log 𝜖 SGD: 𝜖𝑛 is O(1/n), time 1/𝜖 SGD is good enough for machine learning applications. Remember “n” for SGD is minibatch count. After 1 million updates, 𝜖 is order O(1/n) which is approaching floating point single precision.
  • 49. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson (image credits to Alec Radford) The effects of different update formulas
  • 50. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Another approach to poor gradients: Q: What is the trajectory along which we converge towards the minimum with SGD?
  • 51. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Q: What is the trajectory along which we converge towards the minimum with SGD? very slow progress along flat direction, jitter along steep one Another approach to poor gradients:
  • 52. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Momentum update - Physical interpretation as ball rolling down the loss function + friction (mu coefficient). - mu = usually ~0.5, 0.9, or 0.99 (Sometimes annealed over time, e.g. from 0.5 -> 0.99)
  • 53. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Momentum update - Allows a velocity to “build up” along shallow directions - Velocity becomes damped in steep direction due to quickly changing sign
  • 54. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson SGD vs Momentum notice momentum overshooting the target, but overall getting to the minimum much faster than vanilla SGD.
  • 55. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Nesterov Momentum update gradient step momentum step actual step Ordinary momentum update:
  • 56. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Nesterov Momentum update gradient step momentum step actual step momentum step “lookahead” gradient step (bit different than original) actual step Momentum update Nesterov momentum update
  • 57. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Nesterov Momentum update gradient step momentum step actual step momentum step “lookahead” gradient step (bit different than original) actual step Momentum update Nesterov momentum update Nesterov: the only difference...
  • 58. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson 58 Nesterov Momentum update Slightly inconvenient… usually we have :
  • 59. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson 59 Nesterov Momentum update Slightly inconvenient… usually we have : Variable transform and rearranging saves the day:
  • 60. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Nesterov Momentum update Slightly inconvenient… usually we have : Variable transform and rearranging saves the day: Replace all thetas with phis, rearrange and obtain:
  • 61. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson nag = Nesterov Accelerated Gradient
  • 62. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson AdaGrad update Added element-wise scaling of the gradient based on the historical sum of squares in each dimension [Duchi et al., 2011]
  • 63. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Q: What happens with AdaGrad? AdaGrad update
  • 64. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Q2: What happens to the step size over long time? AdaGrad update
  • 65. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson RMSProp update [Tieleman and Hinton, 2012]
  • 66. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Introduced in a slide in Geoff Hinton’s Coursera class, lecture 6
  • 67. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Introduced in a slide in Geoff Hinton’s Coursera class, lecture 6 Cited by several papers as:
  • 68. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson adagrad rmsprop
  • 69. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Adam update [Kingma and Ba, 2014] (incomplete, but close)
  • 70. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Adam update [Kingma and Ba, 2014] (incomplete, but close) momentum RMSProp-like Looks a bit like RMSProp with momentum
  • 71. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Adam update [Kingma and Ba, 2014] (incomplete, but close) momentum RMSProp-like Looks a bit like RMSProp with momentum
  • 72. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Adam update [Kingma and Ba, 2014] RMSProp-like bias correction (only relevant in first few iterations when t is small) momentum The bias correction compensates for the fact that m,v are initialized at zero and need some time to “warm up”.
  • 73. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter. Q: Which one of these learning rates is best to use?
  • 74. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter. => Learning rate decay over time! step decay: e.g. decay learning rate by half every few epochs. exponential decay: 1/t decay:
  • 75. Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson Summary • Simple Gradient Methods like SGD can make adequate progress to an optimum when used on minibatches of data. • Second-order methods make much better progress toward the goal, but are more expensive and unstable. • Convergence rates: quadratic, linear, O(1/n). • Momentum: is another method to produce better effective gradients. • ADAGRAD, RMSprop diagonally scale the gradient. ADAM scales and applies momentum. • Assignment 1 is out, due Sept 14. Please start soon.

Editor's Notes

  1. Where to put block multi-vector algorithms? Percentage of Future Work Diagram