This document discusses an upcoming lecture on linear regression and gradient descent. The lecture will cover gradient descent for linear regression, implementing gradient descent in code, and interpreting models from multiple linear regression. It will review cost functions and the intuition behind gradient descent, then demonstrate gradient descent for linear regression.
Machine Learning Programming Guide to Gradient Descent
1. Machine Learning Programming
BDA712-00
Lecturer: Josué Obregón PhD
Kyung Hee University
Department of Big Data Analytics
September 28, 2022
Linear Regression II:
Gradient Descent and Multiple Linear Regression
1
Machine Learning Programming, KHU
2. Your first learning
program
Building a tiny
supervised learning
program
Hyperspace!
Multiple linear
regression
Getting real
Recognize a single digit
using MNSIT
A discerning
machine
From regression to
classification
Walking the
gradient
Gradient descent
algorithm
Previously, in our course…
3. Your first learning
program
Building a tiny
supervised learning
program
Hyperspace!
Multiple linear
regression
Getting real
Recognize a single digit
using MNSIT
A discerning
machine
From regression to
classification
Walking the
gradient
Gradient descent
algorithm
And today…
4. Today's agenda
• What’s wrong with our current train() function?
• Gradient descent
• Multiple linear regression implementation
• Interpreting a linear regression model
Machine Learning Programming, KHU 4
5. What's wrong with our current train() function?
Machine Learning Programming, KHU 5
• We are learning just one
parameter on each
iteration
• How can we learn both
parameters at the same
time?
• Find all possible
combinations
• 3𝑛𝑛 with 𝑛𝑛 =number of
parameters)
• We call loss() on every
combination!!
6. Enter Gradient Descent
• Brief review of the intuition of our loss/cost function
• Intuition behind Gradient Descent
• Gradient Descent for linear regression
• Implement Gradient Descent in our code
Machine Learning Programming, KHU 6
7. Cost function
Machine Learning Programming, KHU 7
Training Set
Function:
𝛽𝛽‘s: Parameters
How to choose 𝛽𝛽‘s ?
Size in feet2 (𝑋𝑋) Price ($) in 1000's (𝑦𝑦)
2104 460
1416 232
1534 315
852 178
… …
𝑦𝑦(𝑥𝑥) = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1
𝛽𝛽0 = 𝑏𝑏
𝛽𝛽1 = 𝑤𝑤
9. Cost function
Machine Learning Programming, KHU 9
y
x
Idea: Choose 𝛽𝛽0, 𝛽𝛽1 so that our
function 𝑦𝑦(𝑥𝑥) is close to 𝑦𝑦 for
training examples (𝑥𝑥, 𝑦𝑦)
𝐿𝐿 =
1
𝑛𝑛
�
𝑖𝑖=1
𝑛𝑛
�
𝑦𝑦𝑖𝑖 − 𝑦𝑦 2
𝛽𝛽0 = 𝑏𝑏
𝛽𝛽1 = 𝑤𝑤
Residual sum of squares (RSS)
Mean squared error
argmin
𝛽𝛽0,𝛽𝛽1
𝐿𝐿(𝛽𝛽0, 𝛽𝛽1)
𝐿𝐿(𝛽𝛽0, 𝛽𝛽1) =
1
𝑛𝑛
�
𝑖𝑖=1
𝑛𝑛
𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 − 𝑦𝑦
2
10. Cost function (with 𝛽𝛽0 =0)
Machine Learning Programming, KHU 10
0
1
2
3
-‐0.5 0 0.5 1 1.5 2 2.5
y
x
(for fixed 𝛽𝛽1, this is a function of 𝑥𝑥) (function of the parameter 𝛽𝛽1 )
0
1
2
3
0 1 2 3
𝛽𝛽1
𝑦𝑦(𝑥𝑥) 𝐿𝐿(𝛽𝛽1)
𝜷𝜷𝟏𝟏 = 𝟐𝟐
𝜷𝜷𝟏𝟏 = 𝟏𝟏
𝜷𝜷𝟏𝟏 = 𝟎𝟎. 𝟓𝟓
𝐿𝐿
11. Cost function
Machine Learning Programming, KHU
(for fixed 𝛽𝛽0, 𝛽𝛽1 this is a function of 𝑥𝑥) (function of the parameter 𝛽𝛽0, 𝛽𝛽1)
𝑦𝑦(𝑥𝑥) 𝐿𝐿(𝛽𝛽0, 𝛽𝛽1)
𝛽𝛽
1
𝛽𝛽0
𝐿𝐿(𝛽𝛽
0
,
𝛽𝛽
1
)
𝛽𝛽0
12. Gradient Descent
Machine Learning Programming, KHU 12
Repeat until convergence {
𝛽𝛽𝑗𝑗 = 𝛽𝛽𝑗𝑗 − 𝛼𝛼
𝜕𝜕
𝜕𝜕𝛽𝛽𝑗𝑗
𝐿𝐿(𝛽𝛽0, 𝛽𝛽1)
}
Loss/cost function
Derivative/gradient
Learning rate
14. Gradient Descent
Machine Learning Programming, KHU 14
If α is too small, gradient descent
can be slow.
If α is too large, gradient descent
can overshoot the minimum. It may
fail to converge, or even diverge.
𝛽𝛽1 = 𝛽𝛽1 − 𝛼𝛼
𝜕𝜕
𝜕𝜕𝛽𝛽1
𝐿𝐿(𝛽𝛽1)
16. Lab session 04
• Link
• https://classroom.github.com/a/Tg1rlOGQ
Machine Learning Programming, KHU 16
17. Let’s go back to some theory…
Machine Learning Programming, KHU 17
18. Linear regression
• Linear regression is a supervised learning approach that models the
dependence of Y on the covariates X 1, X 2, . . . , X p as being linear:
error
• The true regression function E(Y | X = x) might not be linear (it
almost never is)
• Linear regression aims to estimate 𝑓𝑓𝐿𝐿(𝑋𝑋): the best linear
approximation to the true regression function
𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋1 + 𝛽𝛽2𝑋𝑋2 + ⋯ + 𝛽𝛽𝑝𝑝𝑋𝑋𝑝𝑝 + 𝜖𝜖
= 𝛽𝛽0 + �
𝑗𝑗=1
𝑝𝑝
𝛽𝛽𝑗𝑗𝑋𝑋𝑗𝑗 + 𝜖𝜖
𝑓𝑓𝐿𝐿(𝑋𝑋)
Machine Learning Programming, KHU 18
𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋1 + 𝜖𝜖 Simple linear regression (single predictor)
Multiple linear regression
(multiple predictors)
𝛽𝛽0 = 𝑏𝑏
𝛽𝛽1 = 𝑤𝑤
19. Best linear approximation
−
10
0
10
20
30
0 2 4 6
x
y
true regression function 𝒇𝒇(𝒙𝒙)
linear regression estimate �
𝒇𝒇(𝒙𝒙)
Machine Learning Programming, KHU 19
Another linear
regression estimate �
𝒇𝒇(𝒙𝒙)
20. Linear regression
• Here’s the linear regression model again:
• The βj , j = 0, . . . , p are called model coefficients or parameters
• Given estimates ̂
𝛽𝛽𝑗𝑗 for the model coefficients, we can predict the
response at a value 𝑥𝑥 = (𝑥𝑥1, … , 𝑥𝑥𝑝𝑝) via
• The hat symbol denotes values estimated from the data
𝑌𝑌 = 𝛽𝛽0 + �
𝑗𝑗=1
𝑝𝑝
𝛽𝛽𝑗𝑗𝑋𝑋𝑗𝑗 + 𝜖𝜖
�
𝑦𝑦 = ̂
𝛽𝛽0 + �
𝑗𝑗=1
𝑝𝑝
̂
𝛽𝛽𝑗𝑗𝑋𝑋𝑗𝑗 + 𝜖𝜖
Machine Learning Programming, KHU 20
21. Linear regression estimates in 1-dimension
0 50 100 150 200 250 300
5
10
15
20
25
TV
Sales
Figure: 3.1 from ISLR. Blue line shows the best fit for the regression of Sales onto TV.
Lines from observed points to the regression line illustrate the residuals. For any other
choice of slope or intercept, the sum of squared vertical distances between that line and
the observed data would be larger than that of the line shown here.
Machine Learning Programming, KHU 21
22. Linear regression estimates in 2-dimensions
X1
X2
Y
Figure: 3.4 from ISLR.The 2-dimensional place is the best fit of Y onto the
predictors 𝑋𝑋1 and 𝑋𝑋2 . If you tilt this plane in any way, you would get a larger
sum of squared vertical distances between the plane and the observed data.
Machine Learning Programming, KHU 22
23. Linear
Regression
• Linear regression aims to predict the response Y by estimating the
best linear predictor: the linear function that is closest to the true
regression function f .
• The parameter estimates ̂
𝛽𝛽0, ̂
𝛽𝛽1, … , ̂
𝛽𝛽𝑝𝑝 are obtained by minimizing
the residual sum of squares
Machine Learning Programming, KHU 23
RSS( �
𝛽𝛽) = �
𝑖𝑖=1
𝑛𝑛
𝑦𝑦𝑖𝑖 − �
𝛽𝛽0 + �
𝑗𝑗=1
𝑝𝑝
�
𝛽𝛽𝑗𝑗𝑥𝑥𝑖𝑖𝑗𝑗
2
• Once we have our parameter estimates, we can predict y at a new
value of 𝑥𝑥 = (𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑝𝑝) with:
�
𝑦𝑦 = ̂
𝛽𝛽0 + �
𝑗𝑗=1
𝑝𝑝
̂
𝛽𝛽𝑗𝑗𝑋𝑋𝑗𝑗
24. Linear regression is easily∗ interpretable
( ∗As long as the # of predictors is small)
• In the Advertising data,our model is
sales = β0 + β1× TV + β2× radio + β3× newspaper + ϵ
• The coefficient β1 tells us the expected change in sales per unit
change of the TVbudget, with all other predictors held fixed
• Using the ols function in python, we get:
Coefficient Std. Error t-statistic p-value
Intercept 2.939 0.3119 9.42 < 0.0001
TV 0.046 0.0014 32.81 < 0.0001
radio 0.189 0.0086 21.89 < 0.0001
newspaper -0.001 0.0059 -0.18 0.8599
• So,holding the other budgets fixed, for every $1000 spent on TV
advertising,sales on average increase by (1000 × 0.046) = 46 units
sold 2
2
sales is recorded in 1000’s of units sold
Machine Learning Programming, KHU 24
* Hypothesis tests on the
coefficients Page 67
25. The perils of over-interpreting regression coefficients
• A regression coefficient βj estimates the expected change in Y per
unit change in X j , assuming all other predictors are held fixed
• But predictors typically change together!
• Example:A firm might not be able to increase the TVad budget
without reallocating funds from the newspaper or radio budgets
• Example:3 Y = total amount of money in your pocket; X1 = # of
coins; X2 = # pennies, nickels and dimes.
◦ By itself, a regression of Y ∼ β0 + β2X2 would have βˆ2 > 0. But how
about if we add X1 to the model?
3
Data Analysis and Regression, Mosteller andTukey 1977
Machine Learning Programming, KHU 25
26. In the words of a famous statistician…
“Essentially, all models are wrong, but some are useful.”
—George Box
• As an analyst,you can make your models more useful by
1
2
Making sure you’re solving useful problems
Carefully interpreting your models in meaningful, practical terms
• So that just leaves one question…
How can we make our models less wrong?
Machine Learning Programming, KHU 26
27. Making linear regression great (again)
• Linear regression imposes two key restrictions on the model:We
assume the relationship between the response Y and the predictors
X 1, . . . , X p is:
1
2
Linear
Additive
• The truth is almost never linear; but often the linearity and additivity
assumptions are good enough
• When we think linearity might not hold, we can try…
◦ Polynomials
◦ Step functions
◦ Splines
◦ Local regression
◦ Generalized additive models
• When we think the additivity assumption doesn’t hold, we can
incorporate interaction terms
• These variants offer increased flexibility, while retaining much of the
ease and interpretability of ordinary linear regression
Machine Learning Programming, KHU 27
28. Acknowledgements
Some of the lectures notes for this class feature content borrowed with
or without modification from the following sources:
• 95-791Data Mining Carneige Mellon University, Lecture notes (Prof.
Alexandra Chouldechova)
• An Introduction to Statistical Learning, with applications in R (Springer, 2013)
with permission from the authors: G. James, D. Witten, T. Hastie and R.
Tibshirani
• Machine learning online course from Andrew Ng
Machine Learning Programming, KHU 28