Gradient Descent is the most commonly used learning algorithm for learning, including Deep Neural Networks with Back Propagation. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
2. Linear Regression
• In its generic form, Multiple Linear Regression is
• Used when X variables are linearly correlated to Y variable
• Trying to represent data points with a linear (e.g.: flat in 2D) Hyperplane,
• Denoted by, Y = β0 + β1*X1 + β2*X2 + ... + βn*Xn
• ML problem is finding the function coefficients (βi values) of this hyperplane
where the error (total of distances from data points to the hyperplane) is
minimized
• We use Mean Square Errors to represent the error
• We can use polynomials of Xi as the variables to represent non-linear
relationships of Xi with Y.
3. Linear Regression Method
• As Linear Regression is a function of parameters, f𝜷(X) =
𝐘 we have to find
β so that the error ε (= Y -
Y) is minimized
• There are two ways to computationally minimize this error and find
parameters
• In the Closed Form, The Normal Equation can directly find the parameter values (β
values) from the matrix formula, β =(XTX)−1XTY
• Using the iterative technique, Gradient Descent
• In this lesson we learn about Gradient Descent as
• Normal Equation is computationally expensive for computing inverse matrices for
large datasets
• Learning with Gradient Descent can tune its algorithm related parameters
(hyperparameters) to go to a stable solution than in the Normal Equation
4. Gradient Descent – Simple Linear Regression
• In the simplest form of Linear Regression we have f𝜷(X) = β0 + β1*X1
where, β1 is the gradient and β0 is the intercept of a straight line
• If we visualize how β0 and β1 relates to the error J(β) (also known as
Cost) varies with β0 and β1 we can visualize it in a 3D graph like
follows
5. Gradient Descent – Simple Linear Regression
• In the Gradient Descent algorithm first we assign some values to
constants β0 and β1 in some way. For example,
1. We can assign random values to β0 and β1 – known as Random Initialization
2. We can assign 0 (zero) values to β0 and β1 – known as Zero Initialization
• Then we try to iteratively move to the lowest cost point. E.g.:
6. Gradient Descent
• As it is difficult to explain this 3D scenario lets assume we want to
minimize the cost function J(β) with related to a single weight, β
J(β)
β
7. Gradient Descent
• When we iteratively move to the minimum cost point, you can see
that the gradient (slope of the curve) is reducing and goes to zero
• Gradient of a function is its derivative
• Therefore, the slope at 𝛃 is
ⅆ𝐉 𝛃
ⅆ𝛃
• But in real, there are more than one 𝛃, like 𝛃𝟎 and 𝛃𝟏
• Therefore, we have to use partial derivative where the slope at 𝛃 is
𝛛𝐉 𝛃
𝛛𝛃
8. Gradient Descent
• When the slope is positive that means the 𝛃 value is higher than the
optimal (with least cost) value of 𝛃
• In that case we have to reduce some value from current 𝛃 to bring it
to the optimal value
• What is the value to be reduced from 𝛃?
• It is better to use a value proportional to the derivative,
𝛛𝐉 𝛃
𝛛𝛃
• But that number should be sufficiently small too
• Otherwise, new 𝛃 will be too smaller than the optimal 𝛃
• For that we use a pre-defined very small constant value 𝛂 known as the Learning Rate
• So we reduce the multiplication of these values: 𝛂
𝛛𝐉 𝛃
𝛛𝛃
9. Gradient Descent
• Now we have the Gradient Descent’s parameter updating formula, to
be applied in each of the iteration (epoch),
𝛃 ≔ 𝛃 - 𝛂
𝛛𝐉 𝛃
𝛛𝛃
Where 𝛂 is a small value like 0.01
• Once we have initialized the value for 𝛃 we can iteratively update the
value of it until the cost functions shows no significant reduction
• Finally, we can use the value of 𝛃 as the solution of Linear Regression
• The same formula can be used when there are more than one value
for 𝛃, taking 𝛃 as the vector of all parameters β0 ,β2 … βn
10. Gradient Descent – Derivative of Cost
In Linear Regression (is what we discuss in this lesson), we use a slightly different
version of Mean Square of Errors (MSE) as the Cost Function, 𝐉 𝛃
J β =
1
2
i=1
n
Yi − Yi
2
Where, n is the number of data points
(This is why you get a convex bowl like shape for Simple Linear Regression when
there are 2 parameters)
Let’s find the derivative of Cost related to any parameter, 𝛃𝐣
𝜕J β
𝜕βj
= 2 *
1
2
i=1
n
Yi − Yi *
𝜕
Yi−Yi
𝜕βj
(from chain rule of derivation)
= i=1
n
Yi − Yi *
𝜕
𝜕βj
β0 + β1∗X1 + β2∗X2 + … + βj∗Xj + … + βn∗Xn − Yi
= i=1
n
Yi − Yi * Xi,j
11. Gradient Descent – Update Rule
• Parameter update rule for parameter 𝛃𝐣, where n is the total number
of data points,
βj ≔ βj - α
𝜕J β
𝜕βj
βj ≔ βj - α i=1
n
Yi − Yi Xi,j
12. Gradient Descent – Algorithm (Summary)
• Initialize 𝛃𝐣 parameters
• Assign a small value to the Learning Rate 𝜶. (e.g.: 0.01)
• Apply the parameter update rule for parameter 𝛃𝐣, (where n is the
number of data points) in each epoch,
βj ≔ βj - α i=1
n
Yi − Yi Xi,j
• Stop the repetition when the cost function reduction is very little
• Now you can use 𝛃𝐣 values to predict
𝐘 values for new X values
13. Gradient Descent – Convergence
• In each epoch the cost is reduced with a
reducing rate if the process is
Convergent (going to a certain lesser
error level)
• After large number of epochs the cost
reduction becomes insignificant and
stables around a certain value
• Linear Regression is always Convergent
when proper learning rate is used, as
there are no multiple local minima (i.e.
no more than one point where the cost
is minimized)
14. Cost Function – 2D Visualization
• As the cost function of Simple Linear Regression, 𝐉 𝛃 needs 3D
visualization, it needs a way to visualize it as a 2D image
• Contour Curves is a way of converting a 3D visualization to 2D
15. Cost Function – Effect of Learning Rate
• Learning rate is a hyperparameter that
has to be manually set making sure,
• The model converge to a solution
• i.e.: Should not diverge
• Training time should be lesser
• Final cost should be lesser
• Too large Learning Rates have a higher
tendency of diverging
• Too lower Learning Rates train slower
• Hence, have to find an optimum rate
16. Cost Function – Effect of Learning Rate
• Learning Rate is like a compromise between high risk for faster convergence
• Depending on the situation, higher learning rates may converge faster, or
convergence may even get slower down due to higher oscillation, or even
diverge
• On the other hand lower learning rate is slow at converging but is highly
probable at converging
17. Batch Gradient Descent
• The iteration step we already learned is Batch Gradient Descent
• The update rule is,
• In each iteration (epoch)
• βj ≔ βj - α i=1
n
Yi − Yi Xi,j
• Here we use the whole dataset (Batch) of size n in each epoch
• Very good at updating to the correct direction in each epoch
• But very computationally expensive as the whole batch of size n is
iterated inside each epoch
18. Stochastic Gradient Descent (SGD)
• Instead of the batch, each data point is used to update 𝛃𝐣 at a time
• The update rule is,
• In each epoch,
• For each data point i
• βj ≔ βj - α
Yi − Yi Xi,j
• As each data point is used for updating, convergence is faster for
larger datasets (e.g.: 100000 data points)
• As each data point is highly different from the distribution, each
update may not be happening on the correct direction
• Will not be that stable on a certain minimum cost as the cost gets
changed during each of the update
19. Mini-Batch Gradient Descent
• This is a balance between Batch Gradient Descent and the Stochastic
Gradient Descent
• The update rule is,
• In each iteration (epoch)
• For all the mini batches (i.e.: n/m)
• βj ≔ βj - α i=1
m
Yi − Yi Xi,j
• Here n is the batch size and m is the minibatch size
• Where m in general is 64, 128, 256, 512 or 1024
• As m >> 1, gradient changes in a much correct direction in each
epoch, and stables much closer to the optimum point than in SGD
21. One Hour Homework
• Officially we have one more hour to do after the end of the lectures
• Therefore, for this week’s extra hour you have a homework
• Gradient Descent is the core learning algorithm in almost all the ML ahead
including in Deep Learning related subject modules
• Go through the slides until you clearly understand Gradient Descent
• Refer external sources to clarify all the ambiguities related to it
• Good Luck!