SlideShare a Scribd company logo
1 of 28
Download to read offline
Machine Learning Programming
BDA712-00
Lecturer: Josué Obregón PhD
Kyung Hee University
Department of Big Data Analytics
September 28, 2022
Linear Regression II:
Gradient Descent and Multiple Linear Regression
1
Machine Learning Programming, KHU
Your first learning
program
Building a tiny
supervised learning
program
Hyperspace!
Multiple linear
regression
Getting real
Recognize a single digit
using MNSIT
A discerning
machine
From regression to
classification
Walking the
gradient
Gradient descent
algorithm
Previously, in our course…
Your first learning
program
Building a tiny
supervised learning
program
Hyperspace!
Multiple linear
regression
Getting real
Recognize a single digit
using MNSIT
A discerning
machine
From regression to
classification
Walking the
gradient
Gradient descent
algorithm
And today…
Today's agenda
• What’s wrong with our current train() function?
• Gradient descent
• Multiple linear regression implementation
• Interpreting a linear regression model
Machine Learning Programming, KHU 4
What's wrong with our current train() function?
Machine Learning Programming, KHU 5
• We are learning just one
parameter on each
iteration
• How can we learn both
parameters at the same
time?
• Find all possible
combinations
• 3𝑛𝑛 with 𝑛𝑛 =number of
parameters)
• We call loss() on every
combination!!
Enter Gradient Descent
• Brief review of the intuition of our loss/cost function
• Intuition behind Gradient Descent
• Gradient Descent for linear regression
• Implement Gradient Descent in our code
Machine Learning Programming, KHU 6
Cost function
Machine Learning Programming, KHU 7
Training Set
Function:
𝛽𝛽‘s: Parameters
How to choose 𝛽𝛽‘s ?
Size in feet2 (𝑋𝑋) Price ($) in 1000's (𝑦𝑦)
2104 460
1416 232
1534 315
852 178
… …
𝑦𝑦(𝑥𝑥) = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1
𝛽𝛽0 = 𝑏𝑏
𝛽𝛽1 = 𝑤𝑤
Cost function
Machine Learning Programming, KHU 8
3
2
1
0
0 1 2 3
𝑦𝑦(𝑥𝑥) = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1
3
2
1
0
0 1 2 3
3
2
1
0
0 1 2 3
𝛽𝛽0 = 1.5
𝛽𝛽1 = 0
𝛽𝛽0 = 0
𝛽𝛽1 = 0.5
𝛽𝛽0 = 1
𝛽𝛽1 = 0.5
𝛽𝛽0 = 𝑏𝑏
𝛽𝛽1 = 𝑤𝑤
Cost function
Machine Learning Programming, KHU 9
y
x
Idea: Choose 𝛽𝛽0, 𝛽𝛽1 so that our
function 𝑦𝑦(𝑥𝑥) is close to 𝑦𝑦 for
training examples (𝑥𝑥, 𝑦𝑦)
𝐿𝐿 =
1
𝑛𝑛
�
𝑖𝑖=1
𝑛𝑛
�
𝑦𝑦𝑖𝑖 − 𝑦𝑦 2
𝛽𝛽0 = 𝑏𝑏
𝛽𝛽1 = 𝑤𝑤
Residual sum of squares (RSS)
Mean squared error
argmin
𝛽𝛽0,𝛽𝛽1
𝐿𝐿(𝛽𝛽0, 𝛽𝛽1)
𝐿𝐿(𝛽𝛽0, 𝛽𝛽1) =
1
𝑛𝑛
�
𝑖𝑖=1
𝑛𝑛
𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 − 𝑦𝑦
2
Cost function (with 𝛽𝛽0 =0)
Machine Learning Programming, KHU 10
0
1
2
3
-‐0.5 0 0.5 1 1.5 2 2.5
y
x
(for fixed 𝛽𝛽1, this is a function of 𝑥𝑥) (function of the parameter 𝛽𝛽1 )
0
1
2
3
0 1 2 3
𝛽𝛽1
𝑦𝑦(𝑥𝑥) 𝐿𝐿(𝛽𝛽1)
𝜷𝜷𝟏𝟏 = 𝟐𝟐
𝜷𝜷𝟏𝟏 = 𝟏𝟏
𝜷𝜷𝟏𝟏 = 𝟎𝟎. 𝟓𝟓
𝐿𝐿
Cost function
Machine Learning Programming, KHU
(for fixed 𝛽𝛽0, 𝛽𝛽1 this is a function of 𝑥𝑥) (function of the parameter 𝛽𝛽0, 𝛽𝛽1)
𝑦𝑦(𝑥𝑥) 𝐿𝐿(𝛽𝛽0, 𝛽𝛽1)
𝛽𝛽
1
𝛽𝛽0
𝐿𝐿(𝛽𝛽
0
,
𝛽𝛽
1
)
𝛽𝛽0
Gradient Descent
Machine Learning Programming, KHU 12
Repeat until convergence {
𝛽𝛽𝑗𝑗 = 𝛽𝛽𝑗𝑗 − 𝛼𝛼
𝜕𝜕
𝜕𝜕𝛽𝛽𝑗𝑗
𝐿𝐿(𝛽𝛽0, 𝛽𝛽1)
}
Loss/cost function
Derivative/gradient
Learning rate
Gradient Descent Intuition
Machine Learning Programming, KHU 13
Gradient Descent
Machine Learning Programming, KHU 14
If α is too small, gradient descent
can be slow.
If α is too large, gradient descent
can overshoot the minimum. It may
fail to converge, or even diverge.
𝛽𝛽1 = 𝛽𝛽1 − 𝛼𝛼
𝜕𝜕
𝜕𝜕𝛽𝛽1
𝐿𝐿(𝛽𝛽1)
Gradient Descent for Linear Regression
Machine Learning Programming, KHU 15
𝜕𝜕𝐿𝐿
𝜕𝜕𝛽𝛽0
=
𝜕𝜕𝐿𝐿
𝜕𝜕𝛽𝛽1
=
𝐿𝐿(𝛽𝛽0, 𝛽𝛽1) =
1
𝑛𝑛
�
𝑖𝑖=1
𝑛𝑛
𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 − 𝑦𝑦
2
Repeat until convergence {
𝛽𝛽𝑗𝑗 = 𝛽𝛽𝑗𝑗 − 𝛼𝛼
𝜕𝜕
𝜕𝜕𝛽𝛽𝑗𝑗
𝐿𝐿(𝛽𝛽0, 𝛽𝛽1)
(for 𝑗𝑗 = 0 and 𝑗𝑗 = 1)
}
Lab session 04
• Link
• https://classroom.github.com/a/Tg1rlOGQ
Machine Learning Programming, KHU 16
Let’s go back to some theory…
Machine Learning Programming, KHU 17
Linear regression
• Linear regression is a supervised learning approach that models the
dependence of Y on the covariates X 1, X 2, . . . , X p as being linear:
error
• The true regression function E(Y | X = x) might not be linear (it
almost never is)
• Linear regression aims to estimate 𝑓𝑓𝐿𝐿(𝑋𝑋): the best linear
approximation to the true regression function
𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋1 + 𝛽𝛽2𝑋𝑋2 + ⋯ + 𝛽𝛽𝑝𝑝𝑋𝑋𝑝𝑝 + 𝜖𝜖
= 𝛽𝛽0 + �
𝑗𝑗=1
𝑝𝑝
𝛽𝛽𝑗𝑗𝑋𝑋𝑗𝑗 + 𝜖𝜖
𝑓𝑓𝐿𝐿(𝑋𝑋)
Machine Learning Programming, KHU 18
𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋1 + 𝜖𝜖 Simple linear regression (single predictor)
Multiple linear regression
(multiple predictors)
𝛽𝛽0 = 𝑏𝑏
𝛽𝛽1 = 𝑤𝑤
Best linear approximation
−
10
0
10
20
30
0 2 4 6
x
y
true regression function 𝒇𝒇(𝒙𝒙)
linear regression estimate �
𝒇𝒇(𝒙𝒙)
Machine Learning Programming, KHU 19
Another linear
regression estimate �
𝒇𝒇(𝒙𝒙)
Linear regression
• Here’s the linear regression model again:
• The βj , j = 0, . . . , p are called model coefficients or parameters
• Given estimates ̂
𝛽𝛽𝑗𝑗 for the model coefficients, we can predict the
response at a value 𝑥𝑥 = (𝑥𝑥1, … , 𝑥𝑥𝑝𝑝) via
• The hat symbol denotes values estimated from the data
𝑌𝑌 = 𝛽𝛽0 + �
𝑗𝑗=1
𝑝𝑝
𝛽𝛽𝑗𝑗𝑋𝑋𝑗𝑗 + 𝜖𝜖
�
𝑦𝑦 = ̂
𝛽𝛽0 + �
𝑗𝑗=1
𝑝𝑝
̂
𝛽𝛽𝑗𝑗𝑋𝑋𝑗𝑗 + 𝜖𝜖
Machine Learning Programming, KHU 20
Linear regression estimates in 1-dimension
0 50 100 150 200 250 300
5
10
15
20
25
TV
Sales
Figure: 3.1 from ISLR. Blue line shows the best fit for the regression of Sales onto TV.
Lines from observed points to the regression line illustrate the residuals. For any other
choice of slope or intercept, the sum of squared vertical distances between that line and
the observed data would be larger than that of the line shown here.
Machine Learning Programming, KHU 21
Linear regression estimates in 2-dimensions
X1
X2
Y
Figure: 3.4 from ISLR.The 2-dimensional place is the best fit of Y onto the
predictors 𝑋𝑋1 and 𝑋𝑋2 . If you tilt this plane in any way, you would get a larger
sum of squared vertical distances between the plane and the observed data.
Machine Learning Programming, KHU 22
Linear
Regression
• Linear regression aims to predict the response Y by estimating the
best linear predictor: the linear function that is closest to the true
regression function f .
• The parameter estimates ̂
𝛽𝛽0, ̂
𝛽𝛽1, … , ̂
𝛽𝛽𝑝𝑝 are obtained by minimizing
the residual sum of squares
Machine Learning Programming, KHU 23
RSS( �
𝛽𝛽) = �
𝑖𝑖=1
𝑛𝑛
𝑦𝑦𝑖𝑖 − �
𝛽𝛽0 + �
𝑗𝑗=1
𝑝𝑝
�
𝛽𝛽𝑗𝑗𝑥𝑥𝑖𝑖𝑗𝑗
2
• Once we have our parameter estimates, we can predict y at a new
value of 𝑥𝑥 = (𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑝𝑝) with:
�
𝑦𝑦 = ̂
𝛽𝛽0 + �
𝑗𝑗=1
𝑝𝑝
̂
𝛽𝛽𝑗𝑗𝑋𝑋𝑗𝑗
Linear regression is easily∗ interpretable
( ∗As long as the # of predictors is small)
• In the Advertising data,our model is
sales = β0 + β1× TV + β2× radio + β3× newspaper + ϵ
• The coefficient β1 tells us the expected change in sales per unit
change of the TVbudget, with all other predictors held fixed
• Using the ols function in python, we get:
Coefficient Std. Error t-statistic p-value
Intercept 2.939 0.3119 9.42 < 0.0001
TV 0.046 0.0014 32.81 < 0.0001
radio 0.189 0.0086 21.89 < 0.0001
newspaper -0.001 0.0059 -0.18 0.8599
• So,holding the other budgets fixed, for every $1000 spent on TV
advertising,sales on average increase by (1000 × 0.046) = 46 units
sold 2
2
sales is recorded in 1000’s of units sold
Machine Learning Programming, KHU 24
* Hypothesis tests on the
coefficients  Page 67
The perils of over-interpreting regression coefficients
• A regression coefficient βj estimates the expected change in Y per
unit change in X j , assuming all other predictors are held fixed
• But predictors typically change together!
• Example:A firm might not be able to increase the TVad budget
without reallocating funds from the newspaper or radio budgets
• Example:3 Y = total amount of money in your pocket; X1 = # of
coins; X2 = # pennies, nickels and dimes.
◦ By itself, a regression of Y ∼ β0 + β2X2 would have βˆ2 > 0. But how
about if we add X1 to the model?
3
Data Analysis and Regression, Mosteller andTukey 1977
Machine Learning Programming, KHU 25
In the words of a famous statistician…
“Essentially, all models are wrong, but some are useful.”
—George Box
• As an analyst,you can make your models more useful by
1
2
Making sure you’re solving useful problems
Carefully interpreting your models in meaningful, practical terms
• So that just leaves one question…
How can we make our models less wrong?
Machine Learning Programming, KHU 26
Making linear regression great (again)
• Linear regression imposes two key restrictions on the model:We
assume the relationship between the response Y and the predictors
X 1, . . . , X p is:
1
2
Linear
Additive
• The truth is almost never linear; but often the linearity and additivity
assumptions are good enough
• When we think linearity might not hold, we can try…
◦ Polynomials
◦ Step functions
◦ Splines
◦ Local regression
◦ Generalized additive models
• When we think the additivity assumption doesn’t hold, we can
incorporate interaction terms
• These variants offer increased flexibility, while retaining much of the
ease and interpretability of ordinary linear regression
Machine Learning Programming, KHU 27
Acknowledgements
Some of the lectures notes for this class feature content borrowed with
or without modification from the following sources:
• 95-791Data Mining Carneige Mellon University, Lecture notes (Prof.
Alexandra Chouldechova)
• An Introduction to Statistical Learning, with applications in R (Springer, 2013)
with permission from the authors: G. James, D. Witten, T. Hastie and R.
Tibshirani
• Machine learning online course from Andrew Ng
Machine Learning Programming, KHU 28

More Related Content

Similar to Machine Learning Programming Guide to Gradient Descent

Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdfanandsimple
 
Kickstart ML.pptx
Kickstart ML.pptxKickstart ML.pptx
Kickstart ML.pptxGDSCVJTI
 
Design and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsDesign and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsAjay Bidyarthy
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonChun-Ming Chang
 
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...IJERA Editor
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackarogozhnikov
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
Linear regression
Linear regression Linear regression
Linear regression mohamed Naas
 
Machine learning introduction lecture notes
Machine learning introduction lecture notesMachine learning introduction lecture notes
Machine learning introduction lecture notesUmeshJagga1
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...AmirParnianifard1
 
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjdArjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd12345arjitcs
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksStratio
 
Design of Second Order Digital Differentiator and Integrator Using Forward Di...
Design of Second Order Digital Differentiator and Integrator Using Forward Di...Design of Second Order Digital Differentiator and Integrator Using Forward Di...
Design of Second Order Digital Differentiator and Integrator Using Forward Di...inventionjournals
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descentRevanth Kumar
 
Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTandrewmart11
 
Bresenham circlesandpolygons
Bresenham circlesandpolygonsBresenham circlesandpolygons
Bresenham circlesandpolygonsaa11bb11
 
Bresenham circles and polygons derication
Bresenham circles and polygons dericationBresenham circles and polygons derication
Bresenham circles and polygons dericationKumar
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017Masa Kato
 

Similar to Machine Learning Programming Guide to Gradient Descent (20)

Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdf
 
Kickstart ML.pptx
Kickstart ML.pptxKickstart ML.pptx
Kickstart ML.pptx
 
Design and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsDesign and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation Algorithms
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
 
3ml.pdf
3ml.pdf3ml.pdf
3ml.pdf
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
 
Linear regression
Linear regression Linear regression
Linear regression
 
Machine learning introduction lecture notes
Machine learning introduction lecture notesMachine learning introduction lecture notes
Machine learning introduction lecture notes
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...
 
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjdArjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
Design of Second Order Digital Differentiator and Integrator Using Forward Di...
Design of Second Order Digital Differentiator and Integrator Using Forward Di...Design of Second Order Digital Differentiator and Integrator Using Forward Di...
Design of Second Order Digital Differentiator and Integrator Using Forward Di...
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPT
 
Bresenham circlesandpolygons
Bresenham circlesandpolygonsBresenham circlesandpolygons
Bresenham circlesandpolygons
 
Bresenham circles and polygons derication
Bresenham circles and polygons dericationBresenham circles and polygons derication
Bresenham circles and polygons derication
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017
 

Recently uploaded

Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 

Recently uploaded (20)

Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 

Machine Learning Programming Guide to Gradient Descent

  • 1. Machine Learning Programming BDA712-00 Lecturer: Josué Obregón PhD Kyung Hee University Department of Big Data Analytics September 28, 2022 Linear Regression II: Gradient Descent and Multiple Linear Regression 1 Machine Learning Programming, KHU
  • 2. Your first learning program Building a tiny supervised learning program Hyperspace! Multiple linear regression Getting real Recognize a single digit using MNSIT A discerning machine From regression to classification Walking the gradient Gradient descent algorithm Previously, in our course…
  • 3. Your first learning program Building a tiny supervised learning program Hyperspace! Multiple linear regression Getting real Recognize a single digit using MNSIT A discerning machine From regression to classification Walking the gradient Gradient descent algorithm And today…
  • 4. Today's agenda • What’s wrong with our current train() function? • Gradient descent • Multiple linear regression implementation • Interpreting a linear regression model Machine Learning Programming, KHU 4
  • 5. What's wrong with our current train() function? Machine Learning Programming, KHU 5 • We are learning just one parameter on each iteration • How can we learn both parameters at the same time? • Find all possible combinations • 3𝑛𝑛 with 𝑛𝑛 =number of parameters) • We call loss() on every combination!!
  • 6. Enter Gradient Descent • Brief review of the intuition of our loss/cost function • Intuition behind Gradient Descent • Gradient Descent for linear regression • Implement Gradient Descent in our code Machine Learning Programming, KHU 6
  • 7. Cost function Machine Learning Programming, KHU 7 Training Set Function: 𝛽𝛽‘s: Parameters How to choose 𝛽𝛽‘s ? Size in feet2 (𝑋𝑋) Price ($) in 1000's (𝑦𝑦) 2104 460 1416 232 1534 315 852 178 … … 𝑦𝑦(𝑥𝑥) = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 𝛽𝛽0 = 𝑏𝑏 𝛽𝛽1 = 𝑤𝑤
  • 8. Cost function Machine Learning Programming, KHU 8 3 2 1 0 0 1 2 3 𝑦𝑦(𝑥𝑥) = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 3 2 1 0 0 1 2 3 3 2 1 0 0 1 2 3 𝛽𝛽0 = 1.5 𝛽𝛽1 = 0 𝛽𝛽0 = 0 𝛽𝛽1 = 0.5 𝛽𝛽0 = 1 𝛽𝛽1 = 0.5 𝛽𝛽0 = 𝑏𝑏 𝛽𝛽1 = 𝑤𝑤
  • 9. Cost function Machine Learning Programming, KHU 9 y x Idea: Choose 𝛽𝛽0, 𝛽𝛽1 so that our function 𝑦𝑦(𝑥𝑥) is close to 𝑦𝑦 for training examples (𝑥𝑥, 𝑦𝑦) 𝐿𝐿 = 1 𝑛𝑛 � 𝑖𝑖=1 𝑛𝑛 � 𝑦𝑦𝑖𝑖 − 𝑦𝑦 2 𝛽𝛽0 = 𝑏𝑏 𝛽𝛽1 = 𝑤𝑤 Residual sum of squares (RSS) Mean squared error argmin 𝛽𝛽0,𝛽𝛽1 𝐿𝐿(𝛽𝛽0, 𝛽𝛽1) 𝐿𝐿(𝛽𝛽0, 𝛽𝛽1) = 1 𝑛𝑛 � 𝑖𝑖=1 𝑛𝑛 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 − 𝑦𝑦 2
  • 10. Cost function (with 𝛽𝛽0 =0) Machine Learning Programming, KHU 10 0 1 2 3 -‐0.5 0 0.5 1 1.5 2 2.5 y x (for fixed 𝛽𝛽1, this is a function of 𝑥𝑥) (function of the parameter 𝛽𝛽1 ) 0 1 2 3 0 1 2 3 𝛽𝛽1 𝑦𝑦(𝑥𝑥) 𝐿𝐿(𝛽𝛽1) 𝜷𝜷𝟏𝟏 = 𝟐𝟐 𝜷𝜷𝟏𝟏 = 𝟏𝟏 𝜷𝜷𝟏𝟏 = 𝟎𝟎. 𝟓𝟓 𝐿𝐿
  • 11. Cost function Machine Learning Programming, KHU (for fixed 𝛽𝛽0, 𝛽𝛽1 this is a function of 𝑥𝑥) (function of the parameter 𝛽𝛽0, 𝛽𝛽1) 𝑦𝑦(𝑥𝑥) 𝐿𝐿(𝛽𝛽0, 𝛽𝛽1) 𝛽𝛽 1 𝛽𝛽0 𝐿𝐿(𝛽𝛽 0 , 𝛽𝛽 1 ) 𝛽𝛽0
  • 12. Gradient Descent Machine Learning Programming, KHU 12 Repeat until convergence { 𝛽𝛽𝑗𝑗 = 𝛽𝛽𝑗𝑗 − 𝛼𝛼 𝜕𝜕 𝜕𝜕𝛽𝛽𝑗𝑗 𝐿𝐿(𝛽𝛽0, 𝛽𝛽1) } Loss/cost function Derivative/gradient Learning rate
  • 13. Gradient Descent Intuition Machine Learning Programming, KHU 13
  • 14. Gradient Descent Machine Learning Programming, KHU 14 If α is too small, gradient descent can be slow. If α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge. 𝛽𝛽1 = 𝛽𝛽1 − 𝛼𝛼 𝜕𝜕 𝜕𝜕𝛽𝛽1 𝐿𝐿(𝛽𝛽1)
  • 15. Gradient Descent for Linear Regression Machine Learning Programming, KHU 15 𝜕𝜕𝐿𝐿 𝜕𝜕𝛽𝛽0 = 𝜕𝜕𝐿𝐿 𝜕𝜕𝛽𝛽1 = 𝐿𝐿(𝛽𝛽0, 𝛽𝛽1) = 1 𝑛𝑛 � 𝑖𝑖=1 𝑛𝑛 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 − 𝑦𝑦 2 Repeat until convergence { 𝛽𝛽𝑗𝑗 = 𝛽𝛽𝑗𝑗 − 𝛼𝛼 𝜕𝜕 𝜕𝜕𝛽𝛽𝑗𝑗 𝐿𝐿(𝛽𝛽0, 𝛽𝛽1) (for 𝑗𝑗 = 0 and 𝑗𝑗 = 1) }
  • 16. Lab session 04 • Link • https://classroom.github.com/a/Tg1rlOGQ Machine Learning Programming, KHU 16
  • 17. Let’s go back to some theory… Machine Learning Programming, KHU 17
  • 18. Linear regression • Linear regression is a supervised learning approach that models the dependence of Y on the covariates X 1, X 2, . . . , X p as being linear: error • The true regression function E(Y | X = x) might not be linear (it almost never is) • Linear regression aims to estimate 𝑓𝑓𝐿𝐿(𝑋𝑋): the best linear approximation to the true regression function 𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋1 + 𝛽𝛽2𝑋𝑋2 + ⋯ + 𝛽𝛽𝑝𝑝𝑋𝑋𝑝𝑝 + 𝜖𝜖 = 𝛽𝛽0 + � 𝑗𝑗=1 𝑝𝑝 𝛽𝛽𝑗𝑗𝑋𝑋𝑗𝑗 + 𝜖𝜖 𝑓𝑓𝐿𝐿(𝑋𝑋) Machine Learning Programming, KHU 18 𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋1 + 𝜖𝜖 Simple linear regression (single predictor) Multiple linear regression (multiple predictors) 𝛽𝛽0 = 𝑏𝑏 𝛽𝛽1 = 𝑤𝑤
  • 19. Best linear approximation − 10 0 10 20 30 0 2 4 6 x y true regression function 𝒇𝒇(𝒙𝒙) linear regression estimate � 𝒇𝒇(𝒙𝒙) Machine Learning Programming, KHU 19 Another linear regression estimate � 𝒇𝒇(𝒙𝒙)
  • 20. Linear regression • Here’s the linear regression model again: • The βj , j = 0, . . . , p are called model coefficients or parameters • Given estimates ̂ 𝛽𝛽𝑗𝑗 for the model coefficients, we can predict the response at a value 𝑥𝑥 = (𝑥𝑥1, … , 𝑥𝑥𝑝𝑝) via • The hat symbol denotes values estimated from the data 𝑌𝑌 = 𝛽𝛽0 + � 𝑗𝑗=1 𝑝𝑝 𝛽𝛽𝑗𝑗𝑋𝑋𝑗𝑗 + 𝜖𝜖 � 𝑦𝑦 = ̂ 𝛽𝛽0 + � 𝑗𝑗=1 𝑝𝑝 ̂ 𝛽𝛽𝑗𝑗𝑋𝑋𝑗𝑗 + 𝜖𝜖 Machine Learning Programming, KHU 20
  • 21. Linear regression estimates in 1-dimension 0 50 100 150 200 250 300 5 10 15 20 25 TV Sales Figure: 3.1 from ISLR. Blue line shows the best fit for the regression of Sales onto TV. Lines from observed points to the regression line illustrate the residuals. For any other choice of slope or intercept, the sum of squared vertical distances between that line and the observed data would be larger than that of the line shown here. Machine Learning Programming, KHU 21
  • 22. Linear regression estimates in 2-dimensions X1 X2 Y Figure: 3.4 from ISLR.The 2-dimensional place is the best fit of Y onto the predictors 𝑋𝑋1 and 𝑋𝑋2 . If you tilt this plane in any way, you would get a larger sum of squared vertical distances between the plane and the observed data. Machine Learning Programming, KHU 22
  • 23. Linear Regression • Linear regression aims to predict the response Y by estimating the best linear predictor: the linear function that is closest to the true regression function f . • The parameter estimates ̂ 𝛽𝛽0, ̂ 𝛽𝛽1, … , ̂ 𝛽𝛽𝑝𝑝 are obtained by minimizing the residual sum of squares Machine Learning Programming, KHU 23 RSS( � 𝛽𝛽) = � 𝑖𝑖=1 𝑛𝑛 𝑦𝑦𝑖𝑖 − � 𝛽𝛽0 + � 𝑗𝑗=1 𝑝𝑝 � 𝛽𝛽𝑗𝑗𝑥𝑥𝑖𝑖𝑗𝑗 2 • Once we have our parameter estimates, we can predict y at a new value of 𝑥𝑥 = (𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑝𝑝) with: � 𝑦𝑦 = ̂ 𝛽𝛽0 + � 𝑗𝑗=1 𝑝𝑝 ̂ 𝛽𝛽𝑗𝑗𝑋𝑋𝑗𝑗
  • 24. Linear regression is easily∗ interpretable ( ∗As long as the # of predictors is small) • In the Advertising data,our model is sales = β0 + β1× TV + β2× radio + β3× newspaper + ϵ • The coefficient β1 tells us the expected change in sales per unit change of the TVbudget, with all other predictors held fixed • Using the ols function in python, we get: Coefficient Std. Error t-statistic p-value Intercept 2.939 0.3119 9.42 < 0.0001 TV 0.046 0.0014 32.81 < 0.0001 radio 0.189 0.0086 21.89 < 0.0001 newspaper -0.001 0.0059 -0.18 0.8599 • So,holding the other budgets fixed, for every $1000 spent on TV advertising,sales on average increase by (1000 × 0.046) = 46 units sold 2 2 sales is recorded in 1000’s of units sold Machine Learning Programming, KHU 24 * Hypothesis tests on the coefficients  Page 67
  • 25. The perils of over-interpreting regression coefficients • A regression coefficient βj estimates the expected change in Y per unit change in X j , assuming all other predictors are held fixed • But predictors typically change together! • Example:A firm might not be able to increase the TVad budget without reallocating funds from the newspaper or radio budgets • Example:3 Y = total amount of money in your pocket; X1 = # of coins; X2 = # pennies, nickels and dimes. ◦ By itself, a regression of Y ∼ β0 + β2X2 would have βˆ2 > 0. But how about if we add X1 to the model? 3 Data Analysis and Regression, Mosteller andTukey 1977 Machine Learning Programming, KHU 25
  • 26. In the words of a famous statistician… “Essentially, all models are wrong, but some are useful.” —George Box • As an analyst,you can make your models more useful by 1 2 Making sure you’re solving useful problems Carefully interpreting your models in meaningful, practical terms • So that just leaves one question… How can we make our models less wrong? Machine Learning Programming, KHU 26
  • 27. Making linear regression great (again) • Linear regression imposes two key restrictions on the model:We assume the relationship between the response Y and the predictors X 1, . . . , X p is: 1 2 Linear Additive • The truth is almost never linear; but often the linearity and additivity assumptions are good enough • When we think linearity might not hold, we can try… ◦ Polynomials ◦ Step functions ◦ Splines ◦ Local regression ◦ Generalized additive models • When we think the additivity assumption doesn’t hold, we can incorporate interaction terms • These variants offer increased flexibility, while retaining much of the ease and interpretability of ordinary linear regression Machine Learning Programming, KHU 27
  • 28. Acknowledgements Some of the lectures notes for this class feature content borrowed with or without modification from the following sources: • 95-791Data Mining Carneige Mellon University, Lecture notes (Prof. Alexandra Chouldechova) • An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani • Machine learning online course from Andrew Ng Machine Learning Programming, KHU 28