SlideShare a Scribd company logo
1 of 22
DA 5230 – Statistical & Machine Learning
Lecture 5 – Gradient Descent
Maninda Edirisooriya
manindaw@uom.lk
Linear Regression
• In its generic form, Multiple Linear Regression is
• Used when X variables are linearly correlated to Y variable
• Trying to represent data points with a linear (e.g.: flat in 2D) Hyperplane,
• Denoted by, Y = β0 + β1*X1 + β2*X2 + ... + βn*Xn
• ML problem is finding the function coefficients (βi values) of this hyperplane
where the error (total of distances from data points to the hyperplane) is
minimized
• We use Mean Square Errors to represent the error
• We can use polynomials of Xi as the variables to represent non-linear
relationships of Xi with Y.
Linear Regression Method
• As Linear Regression is a function of parameters, f𝜷(X) = ෠
𝐘 we have to find
β so that the error ε (= Y - ෡
Y) is minimized
• There are two ways to computationally minimize this error and find
parameters
• In the Closed Form, The Normal Equation can directly find the parameter values (β
values) from the matrix formula, β =(XTX)−1XTY
• Using the iterative technique, Gradient Descent
• In this lesson we learn about Gradient Descent as
• Normal Equation is computationally expensive for computing inverse matrices for
large datasets
• Learning with Gradient Descent can tune its algorithm related parameters
(hyperparameters) to go to a stable solution than in the Normal Equation
Gradient Descent – Simple Linear Regression
• In the simplest form of Linear Regression we have f𝜷(X) = β0 + β1*X1
where, β1 is the gradient and β0 is the intercept of a straight line
• If we visualize how β0 and β1 relates to the error J(β) (also known as
Cost) varies with β0 and β1 we can visualize it in a 3D graph like
follows
Gradient Descent – Simple Linear Regression
• In the Gradient Descent algorithm first we assign some values to
constants β0 and β1 in some way. For example,
1. We can assign random values to β0 and β1 – known as Random Initialization
2. We can assign 0 (zero) values to β0 and β1 – known as Zero Initialization
• Then we try to iteratively move to the lowest cost point. E.g.:
Gradient Descent
• As it is difficult to explain this 3D scenario lets assume we want to
minimize the cost function J(β) with related to a single weight, β
J(β)
β
Gradient Descent
• When we iteratively move to the minimum cost point, you can see
that the gradient (slope of the curve) is reducing and goes to zero
• Gradient of a function is its derivative
• Therefore, the slope at 𝛃 is
ⅆ𝐉 𝛃
ⅆ𝛃
• But in real, there are more than one 𝛃, like 𝛃𝟎 and 𝛃𝟏
• Therefore, we have to use partial derivative where the slope at 𝛃 is
𝛛𝐉 𝛃
𝛛𝛃
Gradient Descent
• When the slope is positive that means the 𝛃 value is higher than the
optimal (with least cost) value of 𝛃
• In that case we have to reduce some value from current 𝛃 to bring it
to the optimal value
• What is the value to be reduced from 𝛃?
• It is better to use a value proportional to the derivative,
𝛛𝐉 𝛃
𝛛𝛃
• But that number should be sufficiently small too
• Otherwise, new 𝛃 will be too smaller than the optimal 𝛃
• For that we use a pre-defined very small constant value 𝛂 known as the Learning Rate
• So we reduce the multiplication of these values: 𝛂
𝛛𝐉 𝛃
𝛛𝛃
Gradient Descent
• Now we have the Gradient Descent’s parameter updating formula, to
be applied in each of the iteration (epoch),
𝛃 ≔ 𝛃 - 𝛂
𝛛𝐉 𝛃
𝛛𝛃
Where 𝛂 is a small value like 0.01
• Once we have initialized the value for 𝛃 we can iteratively update the
value of it until the cost functions shows no significant reduction
• Finally, we can use the value of 𝛃 as the solution of Linear Regression
• The same formula can be used when there are more than one value
for 𝛃, taking 𝛃 as the vector of all parameters β0 ,β2 … βn
Gradient Descent – Derivative of Cost
In Linear Regression (is what we discuss in this lesson), we use a slightly different
version of Mean Square of Errors (MSE) as the Cost Function, 𝐉 𝛃
J β =
1
2
෍
i=1
n
෡
Yi − Yi
2
Where, n is the number of data points
(This is why you get a convex bowl like shape for Simple Linear Regression when
there are 2 parameters)
Let’s find the derivative of Cost related to any parameter, 𝛃𝐣
𝜕J β
𝜕βj
= 2 *
1
2
෌i=1
n
෡
Yi − Yi *
𝜕 ෡
Yi−Yi
𝜕βj
(from chain rule of derivation)
= ෌i=1
n
෡
Yi − Yi *
𝜕
𝜕βj
β0 + β1∗X1 + β2∗X2 + … + βj∗Xj + … + βn∗Xn − Yi
= ෌i=1
n
෡
Yi − Yi * Xi,j
Gradient Descent – Update Rule
• Parameter update rule for parameter 𝛃𝐣, where n is the total number
of data points,
βj ≔ βj - α
𝜕J β
𝜕βj
βj ≔ βj - α ෌i=1
n
෡
Yi − Yi Xi,j
Gradient Descent – Algorithm (Summary)
• Initialize 𝛃𝐣 parameters
• Assign a small value to the Learning Rate 𝜶. (e.g.: 0.01)
• Apply the parameter update rule for parameter 𝛃𝐣, (where n is the
number of data points) in each epoch,
βj ≔ βj - α ෌i=1
n
෡
Yi − Yi Xi,j
• Stop the repetition when the cost function reduction is very little
• Now you can use 𝛃𝐣 values to predict ෡
𝐘 values for new X values
Gradient Descent – Convergence
• In each epoch the cost is reduced with a
reducing rate if the process is
Convergent (going to a certain lesser
error level)
• After large number of epochs the cost
reduction becomes insignificant and
stables around a certain value
• Linear Regression is always Convergent
when proper learning rate is used, as
there are no multiple local minima (i.e.
no more than one point where the cost
is minimized)
Cost Function – 2D Visualization
• As the cost function of Simple Linear Regression, 𝐉 𝛃 needs 3D
visualization, it needs a way to visualize it as a 2D image
• Contour Curves is a way of converting a 3D visualization to 2D
Cost Function – Effect of Learning Rate
• Learning rate is a hyperparameter that
has to be manually set making sure,
• The model converge to a solution
• i.e.: Should not diverge
• Training time should be lesser
• Final cost should be lesser
• Too large Learning Rates have a higher
tendency of diverging
• Too lower Learning Rates train slower
• Hence, have to find an optimum rate
Cost Function – Effect of Learning Rate
• Learning Rate is like a compromise between high risk for faster convergence
• Depending on the situation, higher learning rates may converge faster, or
convergence may even get slower down due to higher oscillation, or even
diverge
• On the other hand lower learning rate is slow at converging but is highly
probable at converging
Batch Gradient Descent
• The iteration step we already learned is Batch Gradient Descent
• The update rule is,
• In each iteration (epoch)
• βj ≔ βj - α ෌i=1
n
෡
Yi − Yi Xi,j
• Here we use the whole dataset (Batch) of size n in each epoch
• Very good at updating to the correct direction in each epoch
• But very computationally expensive as the whole batch of size n is
iterated inside each epoch
Stochastic Gradient Descent (SGD)
• Instead of the batch, each data point is used to update 𝛃𝐣 at a time
• The update rule is,
• In each epoch,
• For each data point i
• βj ≔ βj - α ෡
Yi − Yi Xi,j
• As each data point is used for updating, convergence is faster for
larger datasets (e.g.: 100000 data points)
• As each data point is highly different from the distribution, each
update may not be happening on the correct direction
• Will not be that stable on a certain minimum cost as the cost gets
changed during each of the update
Mini-Batch Gradient Descent
• This is a balance between Batch Gradient Descent and the Stochastic
Gradient Descent
• The update rule is,
• In each iteration (epoch)
• For all the mini batches (i.e.: n/m)
• βj ≔ βj - α ෌i=1
m
෡
Yi − Yi Xi,j
• Here n is the batch size and m is the minibatch size
• Where m in general is 64, 128, 256, 512 or 1024
• As m >> 1, gradient changes in a much correct direction in each
epoch, and stables much closer to the optimum point than in SGD
Convergence Patterns (Summary)
One Hour Homework
• Officially we have one more hour to do after the end of the lectures
• Therefore, for this week’s extra hour you have a homework
• Gradient Descent is the core learning algorithm in almost all the ML ahead
including in Deep Learning related subject modules
• Go through the slides until you clearly understand Gradient Descent
• Refer external sources to clarify all the ambiguities related to it
• Good Luck!
Questions?

More Related Content

Similar to Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machine Learning

08 distributed optimization
08 distributed optimization08 distributed optimization
08 distributed optimizationMarco Quartulli
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
 
Linear Programing.pptx
Linear Programing.pptxLinear Programing.pptx
Linear Programing.pptxAdnanHaleem
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSungchul Kim
 
Regression Analysis.pptx
Regression Analysis.pptxRegression Analysis.pptx
Regression Analysis.pptxShivankAggatwal
 
Linear Algebra and Matlab tutorial
Linear Algebra and Matlab tutorialLinear Algebra and Matlab tutorial
Linear Algebra and Matlab tutorialJia-Bin Huang
 
Bootcamp of new world to taken seriously
Bootcamp of new world to taken seriouslyBootcamp of new world to taken seriously
Bootcamp of new world to taken seriouslykhaled125087
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descentRevanth Kumar
 
Regression Analysis.pptx
Regression Analysis.pptxRegression Analysis.pptx
Regression Analysis.pptxarsh260174
 
Regression Analysis Techniques.pptx
Regression Analysis Techniques.pptxRegression Analysis Techniques.pptx
Regression Analysis Techniques.pptxYutaItadori
 
07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descentSubhas Kumar Ghosh
 
EMOD_Optimization_Presentation.pptx
EMOD_Optimization_Presentation.pptxEMOD_Optimization_Presentation.pptx
EMOD_Optimization_Presentation.pptxAliElMoselhy
 
GradientDescent.pptx
GradientDescent.pptxGradientDescent.pptx
GradientDescent.pptxDhiraj7023
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningSungchul Kim
 

Similar to Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machine Learning (20)

15303589.ppt
15303589.ppt15303589.ppt
15303589.ppt
 
08 distributed optimization
08 distributed optimization08 distributed optimization
08 distributed optimization
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
5163147.ppt
5163147.ppt5163147.ppt
5163147.ppt
 
Linear Programing.pptx
Linear Programing.pptxLinear Programing.pptx
Linear Programing.pptx
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Regression Analysis.pptx
Regression Analysis.pptxRegression Analysis.pptx
Regression Analysis.pptx
 
Linear Algebra and Matlab tutorial
Linear Algebra and Matlab tutorialLinear Algebra and Matlab tutorial
Linear Algebra and Matlab tutorial
 
Bootcamp of new world to taken seriously
Bootcamp of new world to taken seriouslyBootcamp of new world to taken seriously
Bootcamp of new world to taken seriously
 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptx
 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptx
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
Regression Analysis.pptx
Regression Analysis.pptxRegression Analysis.pptx
Regression Analysis.pptx
 
Regression Analysis Techniques.pptx
Regression Analysis Techniques.pptxRegression Analysis Techniques.pptx
Regression Analysis Techniques.pptx
 
07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
 
EMOD_Optimization_Presentation.pptx
EMOD_Optimization_Presentation.pptxEMOD_Optimization_Presentation.pptx
EMOD_Optimization_Presentation.pptx
 
GradientDescent.pptx
GradientDescent.pptxGradientDescent.pptx
GradientDescent.pptx
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
 

More from Maninda Edirisooriya

Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...Maninda Edirisooriya
 
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...Maninda Edirisooriya
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Maninda Edirisooriya
 
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Maninda Edirisooriya
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Maninda Edirisooriya
 
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...Maninda Edirisooriya
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Maninda Edirisooriya
 
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...Maninda Edirisooriya
 
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...Maninda Edirisooriya
 
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...Maninda Edirisooriya
 
Analyzing the effectiveness of mobile and web channels using WSO2 BAM
Analyzing the effectiveness of mobile and web channels using WSO2 BAMAnalyzing the effectiveness of mobile and web channels using WSO2 BAM
Analyzing the effectiveness of mobile and web channels using WSO2 BAMManinda Edirisooriya
 

More from Maninda Edirisooriya (17)

Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
 
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
 
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
 
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
 
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
 
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
 
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
 
Analyzing the effectiveness of mobile and web channels using WSO2 BAM
Analyzing the effectiveness of mobile and web channels using WSO2 BAMAnalyzing the effectiveness of mobile and web channels using WSO2 BAM
Analyzing the effectiveness of mobile and web channels using WSO2 BAM
 
WSO2 BAM - Your big data toolbox
WSO2 BAM - Your big data toolboxWSO2 BAM - Your big data toolbox
WSO2 BAM - Your big data toolbox
 
Training Report
Training ReportTraining Report
Training Report
 
GViz - Project Report
GViz - Project ReportGViz - Project Report
GViz - Project Report
 
Mortivation
MortivationMortivation
Mortivation
 
Hafnium impact 2008
Hafnium impact 2008Hafnium impact 2008
Hafnium impact 2008
 
ChatCrypt
ChatCryptChatCrypt
ChatCrypt
 

Recently uploaded

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 

Recently uploaded (20)

Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 

Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machine Learning

  • 1. DA 5230 – Statistical & Machine Learning Lecture 5 – Gradient Descent Maninda Edirisooriya manindaw@uom.lk
  • 2. Linear Regression • In its generic form, Multiple Linear Regression is • Used when X variables are linearly correlated to Y variable • Trying to represent data points with a linear (e.g.: flat in 2D) Hyperplane, • Denoted by, Y = β0 + β1*X1 + β2*X2 + ... + βn*Xn • ML problem is finding the function coefficients (βi values) of this hyperplane where the error (total of distances from data points to the hyperplane) is minimized • We use Mean Square Errors to represent the error • We can use polynomials of Xi as the variables to represent non-linear relationships of Xi with Y.
  • 3. Linear Regression Method • As Linear Regression is a function of parameters, f𝜷(X) = ෠ 𝐘 we have to find β so that the error ε (= Y - ෡ Y) is minimized • There are two ways to computationally minimize this error and find parameters • In the Closed Form, The Normal Equation can directly find the parameter values (β values) from the matrix formula, β =(XTX)−1XTY • Using the iterative technique, Gradient Descent • In this lesson we learn about Gradient Descent as • Normal Equation is computationally expensive for computing inverse matrices for large datasets • Learning with Gradient Descent can tune its algorithm related parameters (hyperparameters) to go to a stable solution than in the Normal Equation
  • 4. Gradient Descent – Simple Linear Regression • In the simplest form of Linear Regression we have f𝜷(X) = β0 + β1*X1 where, β1 is the gradient and β0 is the intercept of a straight line • If we visualize how β0 and β1 relates to the error J(β) (also known as Cost) varies with β0 and β1 we can visualize it in a 3D graph like follows
  • 5. Gradient Descent – Simple Linear Regression • In the Gradient Descent algorithm first we assign some values to constants β0 and β1 in some way. For example, 1. We can assign random values to β0 and β1 – known as Random Initialization 2. We can assign 0 (zero) values to β0 and β1 – known as Zero Initialization • Then we try to iteratively move to the lowest cost point. E.g.:
  • 6. Gradient Descent • As it is difficult to explain this 3D scenario lets assume we want to minimize the cost function J(β) with related to a single weight, β J(β) β
  • 7. Gradient Descent • When we iteratively move to the minimum cost point, you can see that the gradient (slope of the curve) is reducing and goes to zero • Gradient of a function is its derivative • Therefore, the slope at 𝛃 is ⅆ𝐉 𝛃 ⅆ𝛃 • But in real, there are more than one 𝛃, like 𝛃𝟎 and 𝛃𝟏 • Therefore, we have to use partial derivative where the slope at 𝛃 is 𝛛𝐉 𝛃 𝛛𝛃
  • 8. Gradient Descent • When the slope is positive that means the 𝛃 value is higher than the optimal (with least cost) value of 𝛃 • In that case we have to reduce some value from current 𝛃 to bring it to the optimal value • What is the value to be reduced from 𝛃? • It is better to use a value proportional to the derivative, 𝛛𝐉 𝛃 𝛛𝛃 • But that number should be sufficiently small too • Otherwise, new 𝛃 will be too smaller than the optimal 𝛃 • For that we use a pre-defined very small constant value 𝛂 known as the Learning Rate • So we reduce the multiplication of these values: 𝛂 𝛛𝐉 𝛃 𝛛𝛃
  • 9. Gradient Descent • Now we have the Gradient Descent’s parameter updating formula, to be applied in each of the iteration (epoch), 𝛃 ≔ 𝛃 - 𝛂 𝛛𝐉 𝛃 𝛛𝛃 Where 𝛂 is a small value like 0.01 • Once we have initialized the value for 𝛃 we can iteratively update the value of it until the cost functions shows no significant reduction • Finally, we can use the value of 𝛃 as the solution of Linear Regression • The same formula can be used when there are more than one value for 𝛃, taking 𝛃 as the vector of all parameters β0 ,β2 … βn
  • 10. Gradient Descent – Derivative of Cost In Linear Regression (is what we discuss in this lesson), we use a slightly different version of Mean Square of Errors (MSE) as the Cost Function, 𝐉 𝛃 J β = 1 2 ෍ i=1 n ෡ Yi − Yi 2 Where, n is the number of data points (This is why you get a convex bowl like shape for Simple Linear Regression when there are 2 parameters) Let’s find the derivative of Cost related to any parameter, 𝛃𝐣 𝜕J β 𝜕βj = 2 * 1 2 ෌i=1 n ෡ Yi − Yi * 𝜕 ෡ Yi−Yi 𝜕βj (from chain rule of derivation) = ෌i=1 n ෡ Yi − Yi * 𝜕 𝜕βj β0 + β1∗X1 + β2∗X2 + … + βj∗Xj + … + βn∗Xn − Yi = ෌i=1 n ෡ Yi − Yi * Xi,j
  • 11. Gradient Descent – Update Rule • Parameter update rule for parameter 𝛃𝐣, where n is the total number of data points, βj ≔ βj - α 𝜕J β 𝜕βj βj ≔ βj - α ෌i=1 n ෡ Yi − Yi Xi,j
  • 12. Gradient Descent – Algorithm (Summary) • Initialize 𝛃𝐣 parameters • Assign a small value to the Learning Rate 𝜶. (e.g.: 0.01) • Apply the parameter update rule for parameter 𝛃𝐣, (where n is the number of data points) in each epoch, βj ≔ βj - α ෌i=1 n ෡ Yi − Yi Xi,j • Stop the repetition when the cost function reduction is very little • Now you can use 𝛃𝐣 values to predict ෡ 𝐘 values for new X values
  • 13. Gradient Descent – Convergence • In each epoch the cost is reduced with a reducing rate if the process is Convergent (going to a certain lesser error level) • After large number of epochs the cost reduction becomes insignificant and stables around a certain value • Linear Regression is always Convergent when proper learning rate is used, as there are no multiple local minima (i.e. no more than one point where the cost is minimized)
  • 14. Cost Function – 2D Visualization • As the cost function of Simple Linear Regression, 𝐉 𝛃 needs 3D visualization, it needs a way to visualize it as a 2D image • Contour Curves is a way of converting a 3D visualization to 2D
  • 15. Cost Function – Effect of Learning Rate • Learning rate is a hyperparameter that has to be manually set making sure, • The model converge to a solution • i.e.: Should not diverge • Training time should be lesser • Final cost should be lesser • Too large Learning Rates have a higher tendency of diverging • Too lower Learning Rates train slower • Hence, have to find an optimum rate
  • 16. Cost Function – Effect of Learning Rate • Learning Rate is like a compromise between high risk for faster convergence • Depending on the situation, higher learning rates may converge faster, or convergence may even get slower down due to higher oscillation, or even diverge • On the other hand lower learning rate is slow at converging but is highly probable at converging
  • 17. Batch Gradient Descent • The iteration step we already learned is Batch Gradient Descent • The update rule is, • In each iteration (epoch) • βj ≔ βj - α ෌i=1 n ෡ Yi − Yi Xi,j • Here we use the whole dataset (Batch) of size n in each epoch • Very good at updating to the correct direction in each epoch • But very computationally expensive as the whole batch of size n is iterated inside each epoch
  • 18. Stochastic Gradient Descent (SGD) • Instead of the batch, each data point is used to update 𝛃𝐣 at a time • The update rule is, • In each epoch, • For each data point i • βj ≔ βj - α ෡ Yi − Yi Xi,j • As each data point is used for updating, convergence is faster for larger datasets (e.g.: 100000 data points) • As each data point is highly different from the distribution, each update may not be happening on the correct direction • Will not be that stable on a certain minimum cost as the cost gets changed during each of the update
  • 19. Mini-Batch Gradient Descent • This is a balance between Batch Gradient Descent and the Stochastic Gradient Descent • The update rule is, • In each iteration (epoch) • For all the mini batches (i.e.: n/m) • βj ≔ βj - α ෌i=1 m ෡ Yi − Yi Xi,j • Here n is the batch size and m is the minibatch size • Where m in general is 64, 128, 256, 512 or 1024 • As m >> 1, gradient changes in a much correct direction in each epoch, and stables much closer to the optimum point than in SGD
  • 21. One Hour Homework • Officially we have one more hour to do after the end of the lectures • Therefore, for this week’s extra hour you have a homework • Gradient Descent is the core learning algorithm in almost all the ML ahead including in Deep Learning related subject modules • Go through the slides until you clearly understand Gradient Descent • Refer external sources to clarify all the ambiguities related to it • Good Luck!