SlideShare a Scribd company logo
1 of 31
Optimization Algorithms for
Deep Learning
Dr. Ash Pahwa
Data Con - LA
August 17, 2019
Copyright 2019 - Dr. Ash Pahwa 1
Bio: Dr. Ash Pahwa
◼ Ph.D. Computer Science
◼ Website: www.AshPahwa.com
◼ Affiliation
◼ California Institute of Technology, Pasadena
◼ UC Irvine
◼ UCLA
◼ UCSD
◼ Chapman University: Adjunct
◼ Field of Expertise
◼ Machine Learning, Deep Learning, Digital Image Processing,
Database Management, CD-ROM/DVD
◼ Worked for
◼ General Electric, AT&T Bell Laboratories, Oracle, UC Santa Barbara
2Copyright 2019 - Dr. Ash Pahwa
Outline
◼ Gradient Descent
◼ Momentum
◼ Nesterov Momentum
◼ AdaGrad
◼ RMS Prop
◼ Adam: Adaptive Moments
Copyright 2019 - Dr. Ash Pahwa 3
Artificial Intelligence
Copyright 2019 - Dr. Ash Pahwa 4
Deep Learning
◼ A Deep Neural Network (DNN) has
multiple hidden layers
Copyright 2019 - Dr. Ash Pahwa 5
Problems that can use
Neural Networks
◼ For simple problems we can
define the rules
◼ We can automate the process
◼ Write software
◼ For complex problems
◼ We cannot define the rules
◼ Object recognition in an
image
◼ To solve these types of problems
◼ We provide data and the
answers
◼ System will create the rules
Copyright 2019 - Dr. Ash Pahwa 6
Classical
Programming
Rules
Data
Answers
Machine
Learning
Data
Answers
Rules
What is Optimization Problem?
◼ Regression
◼ Solution: Minimize the cost function
◼ Closed form solution: Matrix Approach
◼ Iterative Approach: Gradient Descent Algorithm
◼ Neural Networks: Deep Learning
◼ Solution: Minimize the cost function
◼ Iterative Approach: Gradient Descent Algorithm
Copyright 2019 - Dr. Ash Pahwa 7
Who Invented
Gradient Descent Algorithm?
◼ Gradient Descent algorithm was
invented by Cauchy in 1847
◼ Méthode générale pour la
résolution des systèmes d'équations
simultanées. pp. 536–538
Copyright 2019 - Dr. Ash Pahwa 8
What is
Gradient Descent Algorithm?
◼ Suppose a function 𝑦 = 𝑓 𝑥 is given
◼ Gradient Descent algorithm allows us to find the values of ‘x’
where the ‘y’ value becomes minimum or maximum
◼ The Gradient Descent algorithm can be extended to any
function with 2 or more variables ′𝑧 = 𝑓 𝑥, 𝑦 ′
Copyright 2019 - Dr. Ash Pahwa 9
◼ Function y=f(x)
◼ Gradient Descent Algorithm: Minimum
◼ Initialize the value of x
◼ Learning rate = η
◼ While NOT converged:
◼ 𝑥 𝑡+1
← 𝑥 𝑡
− 𝜂
𝜕𝑦
𝜕𝑥
∥ 𝑥 𝑡
◼ Function z=f(x,y)
◼ Gradient Descent Algorithm: Minimum
◼ Initialize the value of x and y
◼ Learning rate = η
◼ While NOT converged:
◼ 𝑥 𝑡+1
← 𝑥 𝑡
− 𝜂
𝜕𝑧
𝜕𝑥
∥ 𝑥 𝑡,𝑦 𝑡
◼ 𝑦 𝑡+1
← 𝑦 𝑡
− 𝜂
𝜕𝑧
𝜕𝑦
∥ 𝑥 𝑡,𝑦 𝑡
Problems with
Gradient Descent Algorithm
◼ When the gradient of the error function is low (flat)
◼ It takes a long time for the algorithm to converge
(reach the minimum point)
◼ If there are multiple local minima, then
◼ There is no guarantee that the procedure will find the
global minimum
Copyright 2019 - Dr. Ash Pahwa 10
Error Function
◼ 𝐸𝑟𝑟𝑜𝑟 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛: 𝑦 = 𝑓(𝑥) = 1 − 𝑒−(𝑥−2)2
◼
𝑑𝑦
𝑑𝑥
= 2(𝑥 − 2)𝑒−(𝑥−2)2
◼ The error function ‘y’ is almost flat
◼ 𝑓 𝑥 = 0: −∞ < 𝑥 < 0 𝑎𝑛𝑑
◼ 𝑓 𝑥 = 0: 4 < 𝑥 < ∞
Copyright 2019 - Dr. Ash Pahwa 11
> x = seq(-5,5,0.1)
> y1 = 1 - exp(-(x-2)^2)
> plot(x,y1,type='l',ylim=c(-2,2),lwd=3)
> text(-2,1.2,"Error Function")
> text(2,1.4,"Minimum Point")
> abline(v=2)
> dy_dx = function (w1) { 2*(w1-2)*exp(-(w1-2)^2) }
> slope = dy_dx(x)
> points(x,slope,type='l',col='red')
> text(-2,0.2,"Derivative of Error Function")
Gradient Descent
Algorithm
◼ Starting Point = -4
◼ Number of iteration = 1,000
Copyright 2019 - Dr. Ash Pahwa 12
> xStart = -4
> learningRate = 0.1
> maxLimit = 1000
> xStartHistory = rep(0,maxLimit)
> for ( i in 1:maxLimit )
+ {
+ xStartHistory[i] = xStart
+ xStart = xStart - learningRate * dy_dx(xStart)
+ }
> plot(xStartHistory,type='l',ylim=c(-5,3),
main="After 1,000 Iterations")
> text(500,-3.5,"X Value")
> abline(h=2, col='dark red', lwd=3)
> text(500,2.5,"Minimum Point")
>
Starting Point = -4, DOES NOT CONVERGE
Problems with
Gradient Descent Algorithm
◼ If the error function has very small
gradient
◼ Starting point is on the flat surface
◼ Gradient Descent Algorithm will take
a long time to converge
Copyright 2019 - Dr. Ash Pahwa 13
Solutions to Gradient Descent
Algorithm Problem
◼ Solution#1: Increase the step size
◼ Momentum based GD
◼ Nesterov GD
◼ Solution#2: Reduce the data points for approximate gradient
◼ Stochastic Gradient Descent
◼ Mini Batch Gradient Descent
◼ Solution#3: Adjust the Learning rate (𝜂)
◼ AdaGrad
◼ RMSProp
◼ Adam
Copyright 2019 - Dr. Ash Pahwa 14
◼ Function y=f(x)
◼ Gradient Descent Algorithm:
Minimum
◼ Initialize the value of x
◼ Learning rate = η
◼ While NOT converged:
◼ 𝑥 𝑡+1
← 𝑥 𝑡
− 𝜂
𝜕𝑦
𝜕𝑥
∥ 𝑥 𝑡
First Solution to the
Gradient Descent Algorithm Problem
◼ When the gradient becomes zero
◼ Find some way to move forward
◼ Step size is not only a function of current gradient at time ‘t’
◼ But also previous gradient at time ‘t-1’
◼ Solution
◼ Momentum
◼ Exponentially Weighted Moving Average of Gradient
◼ 𝑥𝑡+1 = 𝑥𝑡 − 𝜂
𝜕𝑧
𝜕𝑥
− (𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓
𝜕𝑧
𝜕𝑥
)
Copyright 2019 - Dr. Ash Pahwa 15
Exponentially Weighted
Moving Average of Gradients
◼ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡−1 + 𝜂
𝜕𝑧
𝜕𝑥
|| 𝑡
◼ 𝑥𝑡+1 = 𝑥𝑡 − 𝑢𝑝𝑑𝑎𝑡𝑒𝑡
◼ ------------------------------------------------------------
◼ 𝑢𝑝𝑑𝑎𝑡𝑒0 = 0
◼ 𝑢𝑝𝑑𝑎𝑡𝑒1 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒0 + 𝜂
𝜕𝑧
𝜕𝑥
|1
◼ 𝑢𝑝𝑑𝑎𝑡𝑒2 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒1 + 𝜂
𝜕𝑧
𝜕𝑥
|2 = (𝛾2
∗ 𝑢𝑝𝑑𝑎𝑡𝑒0) + 𝛾 ∗ 𝜂
𝜕𝑧
𝜕𝑥
|1 + 𝜂
𝜕𝑧
𝜕𝑥
|2
◼ 𝑢𝑝𝑑𝑎𝑡𝑒3 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒2 + 𝜂
𝜕𝑧
𝜕𝑥
|3 = 𝛾3
∗ 𝑢𝑝𝑑𝑎𝑡𝑒0 + 𝛾2
∗ 𝜂
𝜕𝑧
𝜕𝑥
|1 + 𝛾 ∗ 𝜂
𝜕𝑧
𝜕𝑥
|2 + 𝜂
𝜕𝑧
𝜕𝑥
|3
◼ 𝑢𝑝𝑑𝑎𝑡𝑒4 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒3 + 𝜂
𝜕𝑧
𝜕𝑥
|4 = 𝛾4
∗ 𝑢𝑝𝑑𝑎𝑡𝑒0 + 𝛾3
∗ 𝜂
𝜕𝑧
𝜕𝑥
|1 + 𝛾2
∗ 𝜂
𝜕𝑧
𝜕𝑥
|2 + 𝛾 ∗ 𝜂
𝜕𝑧
𝜕𝑥
|3 + 𝜂
𝜕𝑧
𝜕𝑥
|4
◼ ⋮
◼ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡−1 + 𝜂
𝜕𝑧
𝜕𝑥
| 𝑡 = 𝛾 𝑡
∗ 𝑢𝑝𝑑𝑎𝑡𝑒0 + 𝛾 𝑡−1
∗ 𝜂
𝜕𝑧
𝜕𝑥
|1 + 𝛾 𝑡−2
∗ 𝜂
𝜕𝑧
𝜕𝑥
|2 + ⋯ + 𝜂
𝜕𝑧
𝜕𝑥
| 𝑡
◼ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡−1 + 𝜂
𝜕𝑧
𝜕𝑥
| 𝑡 = 𝛾 𝑡−1
∗ 𝜂
𝜕𝑧
𝜕𝑥
|1 + 𝛾 𝑡−2
∗ 𝜂
𝜕𝑧
𝜕𝑥
|2 + ⋯ + 𝜂
𝜕𝑧
𝜕𝑥
| 𝑡
Copyright 2019 - Dr. Ash Pahwa 16
Example#1: Gradient Descent with Momentum
Converges after 150,000 Iterations
Copyright 2019 - Dr. Ash Pahwa 17
############################################
xStart = -2.0
learningRate = 0.1
#########################################
# Momentum
maxLimit = 160000
xStartHistory = rep(0,maxLimit)
gamma = 0.9
update = 0
for ( i in 1:maxLimit ){
xStartHistory[i] = xStart
gradient = dy_dx(xStart)
update = (gamma * update) + (learningRate * gradient)
xStart = xStart - update
}
plot(xStartHistory,type='l')
abline(h=2,col='red')
◼ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡−1 + 𝜂
𝜕𝑧
𝜕𝑥
|| 𝑡
◼ 𝑥𝑡+1 = 𝑥𝑡 − 𝑢𝑝𝑑𝑎𝑡𝑒𝑡
◼ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡−1 + 𝜂
𝜕𝑧
𝜕𝑥
| 𝑡 = 𝛾 𝑡−1
∗ 𝜂
𝜕𝑧
𝜕𝑥
|1 + 𝛾 𝑡−2
∗ 𝜂
𝜕𝑧
𝜕𝑥
|2 + ⋯ + 𝜂
𝜕𝑧
𝜕𝑥
| 𝑡
Example#1:
Just Before Reaching the Minimum Point
Oscillation
Copyright 2019 - Dr. Ash Pahwa 18
> #####################################################
> # Expand when the values are converging
> #
> xStartHistorySmall = xStartHistory[148500:149500]
> plot(xStartHistorySmall,type='l')
> abline(h=2,col='red')
>
> #################################################
>
Solution to Momentum
Nesterov Momentum
◼ Before jumping to the next check the
gradient at the next step also
◼ If the next step gradient forces to take
a ‘U’ turn
◼ Reduce the size of the step
Copyright 2019 - Dr. Ash Pahwa 19
Learning Rate 𝜂
◼ How to decide the Learning Rate (LR)?
◼ If LR is small
◼ Gradient will gently descend, and we will get the optimum
value of ‘x’ for which the error is minimum
◼ But it may take a long time to converge
◼ If LR is large
◼ NN will converge faster
◼ But when we reach the destination “minimum” point of the
error function we will see the ‘x’ values oscillation
Copyright 2019 - Dr. Ash Pahwa 20
◼ Function y=f(x)
◼ Gradient Descent Algorithm:
Minimum
◼ Initialize the value of x
◼ Learning rate = η
◼ While NOT converged:
◼ 𝑥 𝑡+1
← 𝑥 𝑡
− 𝜂
𝜕𝑦
𝜕𝑥
∥ 𝑥 𝑡
AdaGrad
Adaptive Gradient
◼ AdaGrad
◼ It dynamically varies the learning rate
◼ At each update
◼ For each weight individually
◼ Learning rate decreases as the algorithm moves forward
◼ Every variable will have a separate learning rate
◼ In the above example
◼ 5 weights from input to output layers
◼ 5 separate learning rates which are decreasing as algorithm
moves forward
Copyright 2019 - Dr. Ash Pahwa 21
AdaGrad
Adaptive Gradient Algorithm
◼ Gradient Descent
◼ While NOT converged:
◼ 𝑥 𝑡+1
← 𝑥 𝑡
− 𝜂
𝜕𝑦
𝜕𝑥
| 𝑥 𝑡
◼ AdaGrad: Update rule is for individual weight
◼ 𝑥𝑖
𝑡+1
= 𝑥𝑖
𝑡
−
𝜂
𝐺𝑖
𝑡
+𝜀
∗
𝜕𝑦
𝜕𝑥 𝑖
| 𝑡
◼ 𝐺𝑖
𝑡
= 𝐺𝑖
𝑡−1
+
𝜕𝑦
𝜕𝑥 𝑖
| 𝑡
2
◼ 𝐺 𝑜
𝑖
= 0
◼ As we move ahead, the G value will increase because we are adding
the square of the derivative which will always remain positive
◼ As the value of G increases, the value of learning rate 𝜂 will decrease.
◼ The value of epsilon (an arbitrary small number) is added to the
denominator to avoid a situation of dividing by zero
Copyright 2019 - Dr. Ash Pahwa 22
Adaptive Learning Rate Schedule
for each weight
1 ≤ 𝑖 ≤ 𝑛
Where ‘n’ = number of weights
Difference:
AdaGrad and RMSProp
◼ AdaGrad
◼ Since the ‘G’ value will increase continuously
◼ Learning rate will decrease continuously
◼ Eventually learning rate will be very small
◼ RMS Prop
◼ Since the ‘G’ value will NOT increase continuously
◼ It will adjust to the data and decrease or increase the
learning rate accordingly
Copyright 2019 - Dr. Ash Pahwa 23
◼ 𝑥𝑖
𝑡+1
= 𝑥𝑖
𝑡
−
𝜂
𝐺𝑖
𝑡+𝜀
∗
𝜕𝑦
𝜕𝑥 𝑖
| 𝑡
◼ AdaGrad
◼ 𝐺𝑖
𝑡
= 𝐺𝑖
𝑡−1
+
𝜕𝑦
𝜕𝑥 𝑖
| 𝑡
2
◼ RMS Prop
◼ 𝐺𝑖
𝑡
= 𝛽 ∗ 𝐺𝑖
𝑡−1
+ 1 − 𝛽 ∗
𝜕𝑦
𝜕𝑥 𝑖
| 𝑡
2
Root Mean Square Propagation
“RMS Prop” Algorithm
◼ AdaGrad
◼ 𝑥𝑖
𝑡+1
= 𝑥𝑖
𝑡
−
𝜂
𝐺𝑖
𝑡
+𝜀
∗
𝜕𝑦
𝜕𝑥 𝑖
| 𝑡
◼ 𝐺𝑖
𝑡
= 𝐺𝑖
𝑡−1
+
𝜕𝑦
𝜕𝑥 𝑖
| 𝑡
2
◼ 𝐺 𝑜
𝑖
= 0
◼ RMS Prop
◼ 𝑥𝑖
𝑡+1
= 𝑥𝑖
𝑡
−
𝜂
𝐺𝑖
𝑡
+𝜀
∗
𝜕𝑦
𝜕𝑥 𝑖
| 𝑡
◼ 𝐺𝑖
𝑡
= 𝛽 ∗ 𝐺𝑖
𝑡−1
+ 1 − 𝛽 ∗
𝜕𝑦
𝜕𝑥 𝑖
| 𝑡
2
◼ Where 𝛽 is another hyper parameter: typical value = 0.9
Copyright 2019 - Dr. Ash Pahwa 24
Adaptive Learning Rate Schedule
for each weight
1 ≤ 𝑖 ≤ 𝑛
Where ‘n’ = number of weights
What is Adam?
Adaptive Moment Estimation
◼ Adam = RMS Prop + Momentum
◼ Adam takes the positive points of
◼ Momentum &
◼ RMS Prop algorithm
◼ Authors: Diederik Kingma and Jimmy Bai
Copyright 2019 - Dr. Ash Pahwa 25
Adam
Adaptive Moment Estimation
◼ Adaptive Moment Estimation (Adam) combines ideas from both
RMSProp and Momentum
◼ It computes adaptive learning rates for each parameter and works as
follows
◼ 𝑚 𝑡 = 𝛽1 ∗ 𝑚 𝑡−1 + (1 − 𝛽1)
𝜕𝑧
𝜕𝑥
| 𝑡
◼ 𝑣 𝑡 = 𝛽2 ∗ 𝑣 𝑡−1 + 1 − 𝛽2
𝜕𝑧
𝜕𝑥
| 𝑡
2
◼ Lastly, the parameters are updated using the information from the
calculated averages
◼ 𝑤𝑡 = 𝑤𝑡−1 − 𝜂
𝑚 𝑡
𝑣 𝑡+𝜀
◼ 𝑚 𝑡: 𝐸𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙𝑙𝑦 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑜𝑓 𝑝𝑎𝑠𝑡 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡𝑠
◼ 𝑣 𝑡: 𝐸𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙𝑙𝑦 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑎𝑠𝑡 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡𝑠
𝑚0 = 0
𝑣0 = 0
Copyright 2019 - Dr. Ash Pahwa 26
Bias Correction
◼ The Adam works best when the
◼ Expectations 𝑚 𝑡 = Expectation(distribution of derivative)
◼ E𝑥𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛 𝑣 𝑡 = Expectation(distribution of square of derivative)
◼ After every update we adjust the value of 𝑚 𝑡 𝑎𝑛𝑑 𝑣𝑡
◼ Adam
◼ 𝑚 𝑡 = 𝛽1 ∗ 𝑚 𝑡−1 + (1 − 𝛽1)
𝜕𝑧
𝜕𝑥
| 𝑡
◼ 𝑣 𝑡 = 𝛽2 ∗ 𝑣 𝑡−1 + 1 − 𝛽2
𝜕𝑧
𝜕𝑥
| 𝑡
2
◼ 𝑤𝑡 = 𝑤𝑡−1 − 𝜂
𝑚 𝑡
𝑣 𝑡+𝜀
◼ Adam Bias Correction
◼ 𝑚 𝑡 = 𝛽1 ∗ 𝑚 𝑡−1 + (1 − 𝛽1)
𝜕𝑧
𝜕𝑥
| 𝑡
◼ 𝑣𝑡 = 𝛽2 ∗ 𝑣𝑡−1 + 1 − 𝛽2
𝜕𝑧
𝜕𝑥
| 𝑡
2
◼ ෞ𝑚 𝑡 =
𝑚 𝑡
1−𝛽1
𝑡 ෝ𝑣 𝑡 =
𝑣 𝑡
1−𝛽2
𝑡
◼ 𝑤𝑡 = 𝑤𝑡−1 − 𝜂
ෞ𝑚 𝑡
ෞ𝑣 𝑡+𝜀
Copyright 2019 - Dr. Ash Pahwa 27
Define Adam Hyper Parameters
Epochs = 500
Copyright 2019 - Dr. Ash Pahwa 28
> ############################################
> # EWMA: Exponentially Weighted Moving Average
> # Hyper parameters
> #
> beta1 <- 0.9
> beta2 <- 0.999
>
> ############################################
> # Initial value of 'm' and 'v' is zero
> m <- 0
> v <- 0
>
> epochs <- 500
> l <- nrow(Y)
> costHistory <- array(dim=c(epochs,1))
TensorFlow default parameter values
◼ 𝛽1: 𝐻𝑦𝑝𝑒𝑟 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 = 0.9
◼ Exponential decay rate for the
moment estimate
◼ 𝛽2: 𝐻𝑦𝑝𝑒𝑟 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 = 0.999
◼ Exponential decay rate for the
second moment estimates
◼ 𝜂: 𝐿𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑅𝑎𝑡𝑒 = 0.01
◼ 𝜀: 𝑆𝑚𝑎𝑙𝑙 𝑣𝑎𝑙𝑢𝑒 𝑡𝑜 𝑎𝑣𝑜𝑖𝑑
𝑑𝑖𝑣𝑖𝑑𝑖𝑛𝑔 𝑏𝑦 𝑧𝑒𝑟𝑜 = 1𝑒 − 08
𝑚0 = 0
𝑣0 = 0
Adam Implementation
Copyright 2019 - Dr. Ash Pahwa 29
𝑚0 = 0
𝑣0 = 0
> for(i in 1:epochs) {
+ # 1. Computed output: h = x values * weight Matrix
+ h = X %*% t(W)
+
+ # 2. Loss = Computed output - observed output
+ loss = h - Y
+
+ error = (h - Y)^2
+ costHistory[i] = sum(error)/(2*nrow(Y))
+
+ # 3. gradient = loss * X
+ gradient = (t(loss) %*% X)/l
+
+ # 4. calculate new W weight values
+ m = beta1*m + (1 - beta1)*gradient
+ v = beta2*v + (1 - beta2)*(gradient^2)
+
+ # 5. corrected values Bias Correcttion
+ m_hat = m/(1 - beta1^i)
+ v_hat = v/(1 - beta2^i)
+
+ # 6. Update the weights
+ W = W - learningRate*(m_hat/(sqrt(v_hat) + epsilon))
+ }
◼ Adam
◼ 𝑚 𝑡 = 𝛽1 ∗ 𝑚 𝑡−1 + (1 − 𝛽1)
𝜕𝑧
𝜕𝑥
| 𝑡
◼ 𝑣 𝑡 = 𝛽2 ∗ 𝑣 𝑡−1 + 1 − 𝛽2
𝜕𝑧
𝜕𝑥
| 𝑡
2
◼ 𝑤𝑡 = 𝑤𝑡−1 − 𝜂
𝑚 𝑡
𝑣 𝑡+𝜀
◼ Adam Bias Correction
◼ ෞ𝑚 𝑡 =
𝑚 𝑡
1−𝛽1
𝑡 ෝ𝑣𝑡 =
𝑣 𝑡
1−𝛽2
𝑡
◼ 𝑤𝑡 = 𝑤𝑡−1 − 𝜂
ෞ𝑚 𝑡
ෞ𝑣 𝑡+𝜀
FINAL RESULTS
◼ Raw Data
◼ Gradient Descent
◼ L.R = 0.000,000,1
◼ Epochs = 100 Million
◼ Adam
◼ L.R = 0.01
◼ Epochs = 100,000
Copyright 2019 - Dr. Ash Pahwa 30
◼ Normalized Data
◼ Gradient Descent
◼ L.R = 0.001
◼ Epochs = 2,000
◼ Adam
◼ L.R = 0.01
◼ Epochs = 500
Summary
◼ Gradient Descent
◼ Momentum
◼ Nesterov Momentum
◼ AdaGrad
◼ RMS Prop
◼ Adam: Adaptive Moments
Copyright 2019 - Dr. Ash Pahwa 31

More Related Content

Similar to Data Con LA 2019 - Optimization Algorithms for Deep Learning by Ash Pahwa

Federated Tensor Factorization for Computational Phenotyping
Federated Tensor Factorization for Computational PhenotypingFederated Tensor Factorization for Computational Phenotyping
Federated Tensor Factorization for Computational PhenotypingYejin Kim
 
STLtalk about statistical analysis and its application
STLtalk about statistical analysis and its applicationSTLtalk about statistical analysis and its application
STLtalk about statistical analysis and its applicationJulieDash5
 
Mx/G(a,b)/1 With Modified Vacation, Variant Arrival Rate With Restricted Admi...
Mx/G(a,b)/1 With Modified Vacation, Variant Arrival Rate With Restricted Admi...Mx/G(a,b)/1 With Modified Vacation, Variant Arrival Rate With Restricted Admi...
Mx/G(a,b)/1 With Modified Vacation, Variant Arrival Rate With Restricted Admi...IJRES Journal
 
2. Probability Fundamentals
2. Probability Fundamentals2. Probability Fundamentals
2. Probability FundamentalsDeepika Misra
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017Masa Kato
 
Online Learning Tricks with Vowpal Wabbit
Online Learning Tricks with Vowpal WabbitOnline Learning Tricks with Vowpal Wabbit
Online Learning Tricks with Vowpal Wabbitjxieeducation
 
Scaling decision trees - George Murray, July 2015
Scaling decision trees - George Murray, July 2015Scaling decision trees - George Murray, July 2015
Scaling decision trees - George Murray, July 2015Seattle DAML meetup
 
Backpropagation: Understanding How to Update ANNs Weights Step-by-Step
Backpropagation: Understanding How to Update ANNs Weights Step-by-StepBackpropagation: Understanding How to Update ANNs Weights Step-by-Step
Backpropagation: Understanding How to Update ANNs Weights Step-by-StepAhmed Gad
 
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsDeep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsJason Tsai
 
TCO19 Japan Introduction to Marathon Matches
TCO19 Japan Introduction to Marathon MatchesTCO19 Japan Introduction to Marathon Matches
TCO19 Japan Introduction to Marathon Matchestomerun
 
Review of generative adversarial nets
Review of generative adversarial netsReview of generative adversarial nets
Review of generative adversarial netsSungminYou
 
Class 11_Chapter 3_Lecture_8.pdf
Class 11_Chapter 3_Lecture_8.pdfClass 11_Chapter 3_Lecture_8.pdf
Class 11_Chapter 3_Lecture_8.pdfPranav Sharma
 
Bias correction, and other uncertainty management techniques
Bias correction, and other uncertainty management techniquesBias correction, and other uncertainty management techniques
Bias correction, and other uncertainty management techniquesOlivier Teytaud
 
Uncertainties in large scale power systems
Uncertainties in large scale power systemsUncertainties in large scale power systems
Uncertainties in large scale power systemsOlivier Teytaud
 
Machine Learning, Financial Engineering and Quantitative Investing
Machine Learning, Financial Engineering and Quantitative InvestingMachine Learning, Financial Engineering and Quantitative Investing
Machine Learning, Financial Engineering and Quantitative InvestingShengyuan Wang Steven
 
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Atsushi Nitanda
 

Similar to Data Con LA 2019 - Optimization Algorithms for Deep Learning by Ash Pahwa (20)

Federated Tensor Factorization for Computational Phenotyping
Federated Tensor Factorization for Computational PhenotypingFederated Tensor Factorization for Computational Phenotyping
Federated Tensor Factorization for Computational Phenotyping
 
STLtalk about statistical analysis and its application
STLtalk about statistical analysis and its applicationSTLtalk about statistical analysis and its application
STLtalk about statistical analysis and its application
 
Stochastic Optimization
Stochastic OptimizationStochastic Optimization
Stochastic Optimization
 
Mx/G(a,b)/1 With Modified Vacation, Variant Arrival Rate With Restricted Admi...
Mx/G(a,b)/1 With Modified Vacation, Variant Arrival Rate With Restricted Admi...Mx/G(a,b)/1 With Modified Vacation, Variant Arrival Rate With Restricted Admi...
Mx/G(a,b)/1 With Modified Vacation, Variant Arrival Rate With Restricted Admi...
 
2. Probability Fundamentals
2. Probability Fundamentals2. Probability Fundamentals
2. Probability Fundamentals
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017
 
Online Learning Tricks with Vowpal Wabbit
Online Learning Tricks with Vowpal WabbitOnline Learning Tricks with Vowpal Wabbit
Online Learning Tricks with Vowpal Wabbit
 
Scaling decision trees - George Murray, July 2015
Scaling decision trees - George Murray, July 2015Scaling decision trees - George Murray, July 2015
Scaling decision trees - George Murray, July 2015
 
Backpropagation: Understanding How to Update ANNs Weights Step-by-Step
Backpropagation: Understanding How to Update ANNs Weights Step-by-StepBackpropagation: Understanding How to Update ANNs Weights Step-by-Step
Backpropagation: Understanding How to Update ANNs Weights Step-by-Step
 
Lec05.pptx
Lec05.pptxLec05.pptx
Lec05.pptx
 
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsDeep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
 
QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...
QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...
QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...
 
TCO19 Japan Introduction to Marathon Matches
TCO19 Japan Introduction to Marathon MatchesTCO19 Japan Introduction to Marathon Matches
TCO19 Japan Introduction to Marathon Matches
 
Review of generative adversarial nets
Review of generative adversarial netsReview of generative adversarial nets
Review of generative adversarial nets
 
Class 11_Chapter 3_Lecture_8.pdf
Class 11_Chapter 3_Lecture_8.pdfClass 11_Chapter 3_Lecture_8.pdf
Class 11_Chapter 3_Lecture_8.pdf
 
Bias correction, and other uncertainty management techniques
Bias correction, and other uncertainty management techniquesBias correction, and other uncertainty management techniques
Bias correction, and other uncertainty management techniques
 
Uncertainties in large scale power systems
Uncertainties in large scale power systemsUncertainties in large scale power systems
Uncertainties in large scale power systems
 
Machine Learning, Financial Engineering and Quantitative Investing
Machine Learning, Financial Engineering and Quantitative InvestingMachine Learning, Financial Engineering and Quantitative Investing
Machine Learning, Financial Engineering and Quantitative Investing
 
03_Optimization (1).pptx
03_Optimization (1).pptx03_Optimization (1).pptx
03_Optimization (1).pptx
 
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Data Con LA 2019 - Optimization Algorithms for Deep Learning by Ash Pahwa

  • 1. Optimization Algorithms for Deep Learning Dr. Ash Pahwa Data Con - LA August 17, 2019 Copyright 2019 - Dr. Ash Pahwa 1
  • 2. Bio: Dr. Ash Pahwa ◼ Ph.D. Computer Science ◼ Website: www.AshPahwa.com ◼ Affiliation ◼ California Institute of Technology, Pasadena ◼ UC Irvine ◼ UCLA ◼ UCSD ◼ Chapman University: Adjunct ◼ Field of Expertise ◼ Machine Learning, Deep Learning, Digital Image Processing, Database Management, CD-ROM/DVD ◼ Worked for ◼ General Electric, AT&T Bell Laboratories, Oracle, UC Santa Barbara 2Copyright 2019 - Dr. Ash Pahwa
  • 3. Outline ◼ Gradient Descent ◼ Momentum ◼ Nesterov Momentum ◼ AdaGrad ◼ RMS Prop ◼ Adam: Adaptive Moments Copyright 2019 - Dr. Ash Pahwa 3
  • 5. Deep Learning ◼ A Deep Neural Network (DNN) has multiple hidden layers Copyright 2019 - Dr. Ash Pahwa 5
  • 6. Problems that can use Neural Networks ◼ For simple problems we can define the rules ◼ We can automate the process ◼ Write software ◼ For complex problems ◼ We cannot define the rules ◼ Object recognition in an image ◼ To solve these types of problems ◼ We provide data and the answers ◼ System will create the rules Copyright 2019 - Dr. Ash Pahwa 6 Classical Programming Rules Data Answers Machine Learning Data Answers Rules
  • 7. What is Optimization Problem? ◼ Regression ◼ Solution: Minimize the cost function ◼ Closed form solution: Matrix Approach ◼ Iterative Approach: Gradient Descent Algorithm ◼ Neural Networks: Deep Learning ◼ Solution: Minimize the cost function ◼ Iterative Approach: Gradient Descent Algorithm Copyright 2019 - Dr. Ash Pahwa 7
  • 8. Who Invented Gradient Descent Algorithm? ◼ Gradient Descent algorithm was invented by Cauchy in 1847 ◼ Méthode générale pour la résolution des systèmes d'équations simultanées. pp. 536–538 Copyright 2019 - Dr. Ash Pahwa 8
  • 9. What is Gradient Descent Algorithm? ◼ Suppose a function 𝑦 = 𝑓 𝑥 is given ◼ Gradient Descent algorithm allows us to find the values of ‘x’ where the ‘y’ value becomes minimum or maximum ◼ The Gradient Descent algorithm can be extended to any function with 2 or more variables ′𝑧 = 𝑓 𝑥, 𝑦 ′ Copyright 2019 - Dr. Ash Pahwa 9 ◼ Function y=f(x) ◼ Gradient Descent Algorithm: Minimum ◼ Initialize the value of x ◼ Learning rate = η ◼ While NOT converged: ◼ 𝑥 𝑡+1 ← 𝑥 𝑡 − 𝜂 𝜕𝑦 𝜕𝑥 ∥ 𝑥 𝑡 ◼ Function z=f(x,y) ◼ Gradient Descent Algorithm: Minimum ◼ Initialize the value of x and y ◼ Learning rate = η ◼ While NOT converged: ◼ 𝑥 𝑡+1 ← 𝑥 𝑡 − 𝜂 𝜕𝑧 𝜕𝑥 ∥ 𝑥 𝑡,𝑦 𝑡 ◼ 𝑦 𝑡+1 ← 𝑦 𝑡 − 𝜂 𝜕𝑧 𝜕𝑦 ∥ 𝑥 𝑡,𝑦 𝑡
  • 10. Problems with Gradient Descent Algorithm ◼ When the gradient of the error function is low (flat) ◼ It takes a long time for the algorithm to converge (reach the minimum point) ◼ If there are multiple local minima, then ◼ There is no guarantee that the procedure will find the global minimum Copyright 2019 - Dr. Ash Pahwa 10
  • 11. Error Function ◼ 𝐸𝑟𝑟𝑜𝑟 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛: 𝑦 = 𝑓(𝑥) = 1 − 𝑒−(𝑥−2)2 ◼ 𝑑𝑦 𝑑𝑥 = 2(𝑥 − 2)𝑒−(𝑥−2)2 ◼ The error function ‘y’ is almost flat ◼ 𝑓 𝑥 = 0: −∞ < 𝑥 < 0 𝑎𝑛𝑑 ◼ 𝑓 𝑥 = 0: 4 < 𝑥 < ∞ Copyright 2019 - Dr. Ash Pahwa 11 > x = seq(-5,5,0.1) > y1 = 1 - exp(-(x-2)^2) > plot(x,y1,type='l',ylim=c(-2,2),lwd=3) > text(-2,1.2,"Error Function") > text(2,1.4,"Minimum Point") > abline(v=2) > dy_dx = function (w1) { 2*(w1-2)*exp(-(w1-2)^2) } > slope = dy_dx(x) > points(x,slope,type='l',col='red') > text(-2,0.2,"Derivative of Error Function")
  • 12. Gradient Descent Algorithm ◼ Starting Point = -4 ◼ Number of iteration = 1,000 Copyright 2019 - Dr. Ash Pahwa 12 > xStart = -4 > learningRate = 0.1 > maxLimit = 1000 > xStartHistory = rep(0,maxLimit) > for ( i in 1:maxLimit ) + { + xStartHistory[i] = xStart + xStart = xStart - learningRate * dy_dx(xStart) + } > plot(xStartHistory,type='l',ylim=c(-5,3), main="After 1,000 Iterations") > text(500,-3.5,"X Value") > abline(h=2, col='dark red', lwd=3) > text(500,2.5,"Minimum Point") > Starting Point = -4, DOES NOT CONVERGE
  • 13. Problems with Gradient Descent Algorithm ◼ If the error function has very small gradient ◼ Starting point is on the flat surface ◼ Gradient Descent Algorithm will take a long time to converge Copyright 2019 - Dr. Ash Pahwa 13
  • 14. Solutions to Gradient Descent Algorithm Problem ◼ Solution#1: Increase the step size ◼ Momentum based GD ◼ Nesterov GD ◼ Solution#2: Reduce the data points for approximate gradient ◼ Stochastic Gradient Descent ◼ Mini Batch Gradient Descent ◼ Solution#3: Adjust the Learning rate (𝜂) ◼ AdaGrad ◼ RMSProp ◼ Adam Copyright 2019 - Dr. Ash Pahwa 14 ◼ Function y=f(x) ◼ Gradient Descent Algorithm: Minimum ◼ Initialize the value of x ◼ Learning rate = η ◼ While NOT converged: ◼ 𝑥 𝑡+1 ← 𝑥 𝑡 − 𝜂 𝜕𝑦 𝜕𝑥 ∥ 𝑥 𝑡
  • 15. First Solution to the Gradient Descent Algorithm Problem ◼ When the gradient becomes zero ◼ Find some way to move forward ◼ Step size is not only a function of current gradient at time ‘t’ ◼ But also previous gradient at time ‘t-1’ ◼ Solution ◼ Momentum ◼ Exponentially Weighted Moving Average of Gradient ◼ 𝑥𝑡+1 = 𝑥𝑡 − 𝜂 𝜕𝑧 𝜕𝑥 − (𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝜕𝑧 𝜕𝑥 ) Copyright 2019 - Dr. Ash Pahwa 15
  • 16. Exponentially Weighted Moving Average of Gradients ◼ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡−1 + 𝜂 𝜕𝑧 𝜕𝑥 || 𝑡 ◼ 𝑥𝑡+1 = 𝑥𝑡 − 𝑢𝑝𝑑𝑎𝑡𝑒𝑡 ◼ ------------------------------------------------------------ ◼ 𝑢𝑝𝑑𝑎𝑡𝑒0 = 0 ◼ 𝑢𝑝𝑑𝑎𝑡𝑒1 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒0 + 𝜂 𝜕𝑧 𝜕𝑥 |1 ◼ 𝑢𝑝𝑑𝑎𝑡𝑒2 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒1 + 𝜂 𝜕𝑧 𝜕𝑥 |2 = (𝛾2 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒0) + 𝛾 ∗ 𝜂 𝜕𝑧 𝜕𝑥 |1 + 𝜂 𝜕𝑧 𝜕𝑥 |2 ◼ 𝑢𝑝𝑑𝑎𝑡𝑒3 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒2 + 𝜂 𝜕𝑧 𝜕𝑥 |3 = 𝛾3 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒0 + 𝛾2 ∗ 𝜂 𝜕𝑧 𝜕𝑥 |1 + 𝛾 ∗ 𝜂 𝜕𝑧 𝜕𝑥 |2 + 𝜂 𝜕𝑧 𝜕𝑥 |3 ◼ 𝑢𝑝𝑑𝑎𝑡𝑒4 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒3 + 𝜂 𝜕𝑧 𝜕𝑥 |4 = 𝛾4 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒0 + 𝛾3 ∗ 𝜂 𝜕𝑧 𝜕𝑥 |1 + 𝛾2 ∗ 𝜂 𝜕𝑧 𝜕𝑥 |2 + 𝛾 ∗ 𝜂 𝜕𝑧 𝜕𝑥 |3 + 𝜂 𝜕𝑧 𝜕𝑥 |4 ◼ ⋮ ◼ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡−1 + 𝜂 𝜕𝑧 𝜕𝑥 | 𝑡 = 𝛾 𝑡 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒0 + 𝛾 𝑡−1 ∗ 𝜂 𝜕𝑧 𝜕𝑥 |1 + 𝛾 𝑡−2 ∗ 𝜂 𝜕𝑧 𝜕𝑥 |2 + ⋯ + 𝜂 𝜕𝑧 𝜕𝑥 | 𝑡 ◼ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡−1 + 𝜂 𝜕𝑧 𝜕𝑥 | 𝑡 = 𝛾 𝑡−1 ∗ 𝜂 𝜕𝑧 𝜕𝑥 |1 + 𝛾 𝑡−2 ∗ 𝜂 𝜕𝑧 𝜕𝑥 |2 + ⋯ + 𝜂 𝜕𝑧 𝜕𝑥 | 𝑡 Copyright 2019 - Dr. Ash Pahwa 16
  • 17. Example#1: Gradient Descent with Momentum Converges after 150,000 Iterations Copyright 2019 - Dr. Ash Pahwa 17 ############################################ xStart = -2.0 learningRate = 0.1 ######################################### # Momentum maxLimit = 160000 xStartHistory = rep(0,maxLimit) gamma = 0.9 update = 0 for ( i in 1:maxLimit ){ xStartHistory[i] = xStart gradient = dy_dx(xStart) update = (gamma * update) + (learningRate * gradient) xStart = xStart - update } plot(xStartHistory,type='l') abline(h=2,col='red') ◼ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡−1 + 𝜂 𝜕𝑧 𝜕𝑥 || 𝑡 ◼ 𝑥𝑡+1 = 𝑥𝑡 − 𝑢𝑝𝑑𝑎𝑡𝑒𝑡 ◼ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡 = 𝛾 ∗ 𝑢𝑝𝑑𝑎𝑡𝑒𝑡−1 + 𝜂 𝜕𝑧 𝜕𝑥 | 𝑡 = 𝛾 𝑡−1 ∗ 𝜂 𝜕𝑧 𝜕𝑥 |1 + 𝛾 𝑡−2 ∗ 𝜂 𝜕𝑧 𝜕𝑥 |2 + ⋯ + 𝜂 𝜕𝑧 𝜕𝑥 | 𝑡
  • 18. Example#1: Just Before Reaching the Minimum Point Oscillation Copyright 2019 - Dr. Ash Pahwa 18 > ##################################################### > # Expand when the values are converging > # > xStartHistorySmall = xStartHistory[148500:149500] > plot(xStartHistorySmall,type='l') > abline(h=2,col='red') > > ################################################# >
  • 19. Solution to Momentum Nesterov Momentum ◼ Before jumping to the next check the gradient at the next step also ◼ If the next step gradient forces to take a ‘U’ turn ◼ Reduce the size of the step Copyright 2019 - Dr. Ash Pahwa 19
  • 20. Learning Rate 𝜂 ◼ How to decide the Learning Rate (LR)? ◼ If LR is small ◼ Gradient will gently descend, and we will get the optimum value of ‘x’ for which the error is minimum ◼ But it may take a long time to converge ◼ If LR is large ◼ NN will converge faster ◼ But when we reach the destination “minimum” point of the error function we will see the ‘x’ values oscillation Copyright 2019 - Dr. Ash Pahwa 20 ◼ Function y=f(x) ◼ Gradient Descent Algorithm: Minimum ◼ Initialize the value of x ◼ Learning rate = η ◼ While NOT converged: ◼ 𝑥 𝑡+1 ← 𝑥 𝑡 − 𝜂 𝜕𝑦 𝜕𝑥 ∥ 𝑥 𝑡
  • 21. AdaGrad Adaptive Gradient ◼ AdaGrad ◼ It dynamically varies the learning rate ◼ At each update ◼ For each weight individually ◼ Learning rate decreases as the algorithm moves forward ◼ Every variable will have a separate learning rate ◼ In the above example ◼ 5 weights from input to output layers ◼ 5 separate learning rates which are decreasing as algorithm moves forward Copyright 2019 - Dr. Ash Pahwa 21
  • 22. AdaGrad Adaptive Gradient Algorithm ◼ Gradient Descent ◼ While NOT converged: ◼ 𝑥 𝑡+1 ← 𝑥 𝑡 − 𝜂 𝜕𝑦 𝜕𝑥 | 𝑥 𝑡 ◼ AdaGrad: Update rule is for individual weight ◼ 𝑥𝑖 𝑡+1 = 𝑥𝑖 𝑡 − 𝜂 𝐺𝑖 𝑡 +𝜀 ∗ 𝜕𝑦 𝜕𝑥 𝑖 | 𝑡 ◼ 𝐺𝑖 𝑡 = 𝐺𝑖 𝑡−1 + 𝜕𝑦 𝜕𝑥 𝑖 | 𝑡 2 ◼ 𝐺 𝑜 𝑖 = 0 ◼ As we move ahead, the G value will increase because we are adding the square of the derivative which will always remain positive ◼ As the value of G increases, the value of learning rate 𝜂 will decrease. ◼ The value of epsilon (an arbitrary small number) is added to the denominator to avoid a situation of dividing by zero Copyright 2019 - Dr. Ash Pahwa 22 Adaptive Learning Rate Schedule for each weight 1 ≤ 𝑖 ≤ 𝑛 Where ‘n’ = number of weights
  • 23. Difference: AdaGrad and RMSProp ◼ AdaGrad ◼ Since the ‘G’ value will increase continuously ◼ Learning rate will decrease continuously ◼ Eventually learning rate will be very small ◼ RMS Prop ◼ Since the ‘G’ value will NOT increase continuously ◼ It will adjust to the data and decrease or increase the learning rate accordingly Copyright 2019 - Dr. Ash Pahwa 23 ◼ 𝑥𝑖 𝑡+1 = 𝑥𝑖 𝑡 − 𝜂 𝐺𝑖 𝑡+𝜀 ∗ 𝜕𝑦 𝜕𝑥 𝑖 | 𝑡 ◼ AdaGrad ◼ 𝐺𝑖 𝑡 = 𝐺𝑖 𝑡−1 + 𝜕𝑦 𝜕𝑥 𝑖 | 𝑡 2 ◼ RMS Prop ◼ 𝐺𝑖 𝑡 = 𝛽 ∗ 𝐺𝑖 𝑡−1 + 1 − 𝛽 ∗ 𝜕𝑦 𝜕𝑥 𝑖 | 𝑡 2
  • 24. Root Mean Square Propagation “RMS Prop” Algorithm ◼ AdaGrad ◼ 𝑥𝑖 𝑡+1 = 𝑥𝑖 𝑡 − 𝜂 𝐺𝑖 𝑡 +𝜀 ∗ 𝜕𝑦 𝜕𝑥 𝑖 | 𝑡 ◼ 𝐺𝑖 𝑡 = 𝐺𝑖 𝑡−1 + 𝜕𝑦 𝜕𝑥 𝑖 | 𝑡 2 ◼ 𝐺 𝑜 𝑖 = 0 ◼ RMS Prop ◼ 𝑥𝑖 𝑡+1 = 𝑥𝑖 𝑡 − 𝜂 𝐺𝑖 𝑡 +𝜀 ∗ 𝜕𝑦 𝜕𝑥 𝑖 | 𝑡 ◼ 𝐺𝑖 𝑡 = 𝛽 ∗ 𝐺𝑖 𝑡−1 + 1 − 𝛽 ∗ 𝜕𝑦 𝜕𝑥 𝑖 | 𝑡 2 ◼ Where 𝛽 is another hyper parameter: typical value = 0.9 Copyright 2019 - Dr. Ash Pahwa 24 Adaptive Learning Rate Schedule for each weight 1 ≤ 𝑖 ≤ 𝑛 Where ‘n’ = number of weights
  • 25. What is Adam? Adaptive Moment Estimation ◼ Adam = RMS Prop + Momentum ◼ Adam takes the positive points of ◼ Momentum & ◼ RMS Prop algorithm ◼ Authors: Diederik Kingma and Jimmy Bai Copyright 2019 - Dr. Ash Pahwa 25
  • 26. Adam Adaptive Moment Estimation ◼ Adaptive Moment Estimation (Adam) combines ideas from both RMSProp and Momentum ◼ It computes adaptive learning rates for each parameter and works as follows ◼ 𝑚 𝑡 = 𝛽1 ∗ 𝑚 𝑡−1 + (1 − 𝛽1) 𝜕𝑧 𝜕𝑥 | 𝑡 ◼ 𝑣 𝑡 = 𝛽2 ∗ 𝑣 𝑡−1 + 1 − 𝛽2 𝜕𝑧 𝜕𝑥 | 𝑡 2 ◼ Lastly, the parameters are updated using the information from the calculated averages ◼ 𝑤𝑡 = 𝑤𝑡−1 − 𝜂 𝑚 𝑡 𝑣 𝑡+𝜀 ◼ 𝑚 𝑡: 𝐸𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙𝑙𝑦 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑜𝑓 𝑝𝑎𝑠𝑡 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡𝑠 ◼ 𝑣 𝑡: 𝐸𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙𝑙𝑦 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑎𝑠𝑡 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡𝑠 𝑚0 = 0 𝑣0 = 0 Copyright 2019 - Dr. Ash Pahwa 26
  • 27. Bias Correction ◼ The Adam works best when the ◼ Expectations 𝑚 𝑡 = Expectation(distribution of derivative) ◼ E𝑥𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛 𝑣 𝑡 = Expectation(distribution of square of derivative) ◼ After every update we adjust the value of 𝑚 𝑡 𝑎𝑛𝑑 𝑣𝑡 ◼ Adam ◼ 𝑚 𝑡 = 𝛽1 ∗ 𝑚 𝑡−1 + (1 − 𝛽1) 𝜕𝑧 𝜕𝑥 | 𝑡 ◼ 𝑣 𝑡 = 𝛽2 ∗ 𝑣 𝑡−1 + 1 − 𝛽2 𝜕𝑧 𝜕𝑥 | 𝑡 2 ◼ 𝑤𝑡 = 𝑤𝑡−1 − 𝜂 𝑚 𝑡 𝑣 𝑡+𝜀 ◼ Adam Bias Correction ◼ 𝑚 𝑡 = 𝛽1 ∗ 𝑚 𝑡−1 + (1 − 𝛽1) 𝜕𝑧 𝜕𝑥 | 𝑡 ◼ 𝑣𝑡 = 𝛽2 ∗ 𝑣𝑡−1 + 1 − 𝛽2 𝜕𝑧 𝜕𝑥 | 𝑡 2 ◼ ෞ𝑚 𝑡 = 𝑚 𝑡 1−𝛽1 𝑡 ෝ𝑣 𝑡 = 𝑣 𝑡 1−𝛽2 𝑡 ◼ 𝑤𝑡 = 𝑤𝑡−1 − 𝜂 ෞ𝑚 𝑡 ෞ𝑣 𝑡+𝜀 Copyright 2019 - Dr. Ash Pahwa 27
  • 28. Define Adam Hyper Parameters Epochs = 500 Copyright 2019 - Dr. Ash Pahwa 28 > ############################################ > # EWMA: Exponentially Weighted Moving Average > # Hyper parameters > # > beta1 <- 0.9 > beta2 <- 0.999 > > ############################################ > # Initial value of 'm' and 'v' is zero > m <- 0 > v <- 0 > > epochs <- 500 > l <- nrow(Y) > costHistory <- array(dim=c(epochs,1)) TensorFlow default parameter values ◼ 𝛽1: 𝐻𝑦𝑝𝑒𝑟 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 = 0.9 ◼ Exponential decay rate for the moment estimate ◼ 𝛽2: 𝐻𝑦𝑝𝑒𝑟 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 = 0.999 ◼ Exponential decay rate for the second moment estimates ◼ 𝜂: 𝐿𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑅𝑎𝑡𝑒 = 0.01 ◼ 𝜀: 𝑆𝑚𝑎𝑙𝑙 𝑣𝑎𝑙𝑢𝑒 𝑡𝑜 𝑎𝑣𝑜𝑖𝑑 𝑑𝑖𝑣𝑖𝑑𝑖𝑛𝑔 𝑏𝑦 𝑧𝑒𝑟𝑜 = 1𝑒 − 08 𝑚0 = 0 𝑣0 = 0
  • 29. Adam Implementation Copyright 2019 - Dr. Ash Pahwa 29 𝑚0 = 0 𝑣0 = 0 > for(i in 1:epochs) { + # 1. Computed output: h = x values * weight Matrix + h = X %*% t(W) + + # 2. Loss = Computed output - observed output + loss = h - Y + + error = (h - Y)^2 + costHistory[i] = sum(error)/(2*nrow(Y)) + + # 3. gradient = loss * X + gradient = (t(loss) %*% X)/l + + # 4. calculate new W weight values + m = beta1*m + (1 - beta1)*gradient + v = beta2*v + (1 - beta2)*(gradient^2) + + # 5. corrected values Bias Correcttion + m_hat = m/(1 - beta1^i) + v_hat = v/(1 - beta2^i) + + # 6. Update the weights + W = W - learningRate*(m_hat/(sqrt(v_hat) + epsilon)) + } ◼ Adam ◼ 𝑚 𝑡 = 𝛽1 ∗ 𝑚 𝑡−1 + (1 − 𝛽1) 𝜕𝑧 𝜕𝑥 | 𝑡 ◼ 𝑣 𝑡 = 𝛽2 ∗ 𝑣 𝑡−1 + 1 − 𝛽2 𝜕𝑧 𝜕𝑥 | 𝑡 2 ◼ 𝑤𝑡 = 𝑤𝑡−1 − 𝜂 𝑚 𝑡 𝑣 𝑡+𝜀 ◼ Adam Bias Correction ◼ ෞ𝑚 𝑡 = 𝑚 𝑡 1−𝛽1 𝑡 ෝ𝑣𝑡 = 𝑣 𝑡 1−𝛽2 𝑡 ◼ 𝑤𝑡 = 𝑤𝑡−1 − 𝜂 ෞ𝑚 𝑡 ෞ𝑣 𝑡+𝜀
  • 30. FINAL RESULTS ◼ Raw Data ◼ Gradient Descent ◼ L.R = 0.000,000,1 ◼ Epochs = 100 Million ◼ Adam ◼ L.R = 0.01 ◼ Epochs = 100,000 Copyright 2019 - Dr. Ash Pahwa 30 ◼ Normalized Data ◼ Gradient Descent ◼ L.R = 0.001 ◼ Epochs = 2,000 ◼ Adam ◼ L.R = 0.01 ◼ Epochs = 500
  • 31. Summary ◼ Gradient Descent ◼ Momentum ◼ Nesterov Momentum ◼ AdaGrad ◼ RMS Prop ◼ Adam: Adaptive Moments Copyright 2019 - Dr. Ash Pahwa 31