SlideShare a Scribd company logo
3. Linear Methods for Regression
Contents
 Least Squares Regression
 QR decomposition for Multiple Regression
 Subset Selection
 Coefficient Shrinkage
1. Introduction
• Outline
• The simple linear regression model
• Multiple linear regression
• Model selection and shrinkage—the state of
the art
Regression
0
2
4
6
8
10
12
14
16
0 1 2 3 4 5 6 7 8 9 10
X
Y
How can we model the generative process for this data?
Linear Assumption
 A linear model assumes the regression
function E(Y | X) is reasonably approximated
as linear
i.e.
• The regression function f(x) = E(Y | X=x) was the result
of minimizing squared expected prediction error
• Making the above assumption has high bias, but low
variance
)
,...
,
(
,
)
( 2
1
1
0 p
p
j
j
j X
X
X
X
X
X
f 

 



Least Squares Regression
 Estimate the parameters  based on a set of
training data: (x1, y1)…(xN, yN)
 Minimize residual sum of squares
• Training samples are random, independent draws
• OR, yi’s are conditionally independent given xi
2
1 1
0
)
(  
 











N
i
p
j
j
ij
i x
y
RSS 


Reasonable criterion when…
Matrix Notation
 X is N  (p+1) of
input vectors
 y is the N-vector of
outputs (labels)
  is the (p+1)-
vector of
parameters


























Np
N
N
p
p
T
N
T
T
x
x
x
x
x
x
x
x
x
x
x
x
...
1
...
...
1
...
1
1
...
1
1
2
1
2
22
21
1
12
11
2
1
X











N
y
y
y
y
...
2
1











p



 ...
1
0
Perfectly Linear Data
 When the data is exactly linear, there exists
 s.t.
 (linear regression model in matrix form)
 Usually the data is not an exact fit, so…

X

y
Finding the Best Fit?
-4
0
4
8
12
16
20
0 2 4 6 8 10
X
Y
Fitting Data from Y=1.5X+.35+N(0,1.2)
Minimize the RSS
 We can rewrite the RSS in Matrix form
 Getting a least squares fit involves
minimizing the RSS
 Solve for the parameters for which the
first derivative of the RSS is zero
   


 X
X 

 y
y
RSS
T
)
(
Solving Least Squares
 Derivative of a Quadratic Product
       
b
x
e
x
e
x
b
x
dx
d T
T
T
T





 A
C
D
D
C
A
D
C
A
   
   
 







X
X
X
X
X
X
X
X



















y
y
I
y
I
y
I
y
RSS
T
T
N
T
N
T
N
T
2
  y
y
y
T
T
T
T
T
T
X
X
X
X
X
X
X
X
X
1
0








 Then,
 Setting the First Derivative to Zero:
Least Squares Solution
Y
X
X)
(X T
1
T 

β̂ Y
X
X)
X(X
β
X
Y T
1
T 

 ˆ
ˆ
1



p
N
)
ˆ
(
RSS 
•Least Squares Coefficients •Least Squares Predictions
•Estimated Variance
 






N
i
i
i ŷ
y
p
N
ˆ
1
2
2
1
1

The N-dimensional Geometry of Least
Squares Regression
Statistics of Least Squares
 We can draw inferences about the parameters,
, by assuming the true model is linear with
noise, i.e.
 Then,
)
,
0
(
~
, 2
1
0 



 N
X
Y
p
j
j
j





 
 
2
1
,
~
ˆ 



X
XT
N
  )
1
(
χ
~
ˆ
1 2
2
2



 p
N
p
N 

Significance of One Parameter
 Can we eliminate one parameter, Xj
(j=0)?
 Look at the standardized coefficient
),
1
(
~
ˆ
ˆ


 p
N
t
v
z
j
j
j


vj is the jth diagonal element of (XTX)-1
Significance of Many Parameters
 We may want to test many features at
once
 Comparing model M1 with p1+1 parameters to
model M0 with p0+1 parameters from M1 (p0<p1)
 Use the F statistic:
   
 
)
1
,
(
~
1
1
0
1
1
1
0
1
1
0







 p
N
p
p
F
p
N
RSS
p
p
RSS
RSS
F
Confidence Interval for Beta
 We can find a confidence interval for j
 Confidence Interval for single parameter (1-2
confidence interval for j )
 Confidence Interval for entire parameter
(Bounds on )
 
σ̂
v
z
β̂
,
σ̂
v
z
β̂ /
j
α
j
/
j
α
j
2
1
1
2
1
1 
 

     












1
2
1
2
p
T
T
ˆ
ˆ
ˆ X
X
2.1 : Prostate cancer < Example>
 Data
• lcavol: log cancer volume
• lweight: log prostate weight
• age: age
• lbph: log of benign prostatic
hyperplasia amount
• svi: seminal vesicle invasion
• lcp: log of capsular penetration
• Gleason: gleason scores
• Pgg45: percent Gleason scores 4
or 5
Technique for Multiple Regression
 Computing directly has poor
numeric properties
 QR Decomposition of X
 Decompose X = QR where
• Q is N  (p+1) orthogonal vector (QTQ = I(p+1))
• R is an (p+1)  (p+1) upper triangular matrix
 Then
  y
T
T
X
X
X
1
ˆ 


      y
y
y
y
ˆ T
T
T
T
T
T
T
T
T
T
T
Q
R
Q
R
R
R
Q
R
R
R
Q
R
QR
Q
R 1
1
1
1
1 









y
ŷ T
QQ

1
1 q
x 11
r

2
22
12
2 q
q
x 1 r
r 

3
33
2
23
13
3 q
q
q
x 1 r
r
r 


…
Gram-Schmidt Procedure
1) Initialize z0 = x0 = 1
2) For j = 1 to p
For k = 0 to j-1, regress xj on the zk’s so that
Then compute the next residual
3) Let Z = [z0 z1 … zp] and  be upper triangular with
entries kj
X = Z  = ZD-1D  = QR
where D is diagonal with Djj = || zj ||
k
k
j
k
kj
z
z
x
z









1
0
j
k
k
kj
j
j z
x
z 
(univariate least squares estimates)
Subset Selection
 We want to eliminate unnecessary features
 Best subset regression
• Choose the subset of size k with lowest RSS
• Leaps and Bounds procedure works with p up to 40
 Forward Stepwise Selection
• Continually add features to  with the largest F-ratio
 Backward Stepwise Selection
• Remove features from  with small F-ratio
Greedy techniques – not guaranteed to find the best model
 
)
1
,
1
(
~
1
1
1
1
1
0






p
N
F
p
N
RSS
RSS
RSS
F
Coefficient Shrinkage
 Use additional penalties to reduce
coefficients
 Ridge Regression
• Minimize least squares s.t.
 The Lasso
• Minimize least squares s.t.
 Principal Components Regression
• Regress on M < p principal components of X
 Partial Least Squares
• Regress on M < p directions of X weighted by y



p
j
j s
1
|
| 



p
j
j s
1
2

4.2 Prostate Cancer Data Example-
Continued
Error Comparison
Shrinkage Methods (Ridge Regression)
 Minimize RSS() + T
• Use centered data, so 0 is not penalized
• xj are of length p, no longer including the initial 1
 The Ridge estimates are:
N
x
x
x
N
y
N
i
ij
ij
ij
N
i
i /
,
/
ˆ
1
1
0 
 





   
 
 
 
  y
y
y
y
RSS
y
y
RSS
T
p
T
p
T
T
T
T
T
T
X
I
X
X
I
X
X
X
X
X
X
X
X
X
1
0
0
2
2
)
(



































Shrinkage Methods (Ridge Regression)
The Lasso
 Use centered data, as before
 The L1 penalty makes solutions nonlinear
in yi
• Quadratic programming are used to compute them
s
x
y
RSS
p
j
j
N
i
p
j
j
ij
i 










 
  
  1
2
1 1
0 |
|
)
( 


 subject to
Shrinkage Methods (Lasso Regression)
Principal Components Regression
 Singular Value Decomposition (SVD) of X
• U is N  p, V is p  p; both are orthogonal
• D is a p  p diagonal matrix
 Use linear combinations (v) of X as new features
• vj is the principal component (column of V) corresponding to
the jth largest element of D
• vj are the directions of maximal sample variance
• use only M < p features, [z1…zM] replaces X
T
UDV
X 
M
j
v
z j
j ...
1

 X
m
M
m
m
pcr
z
ˆ
y
ŷ 



1
 m
m
m
m z
,
z
/
y
,
z
ˆ 

Partial Least Squares
 Construct linear combinations of inputs
incorporating y
 Finds directions with maximum variance
and correlation with the output
 The variance aspect seems to dominate
and partial least squares operates like
principal component regression
4.4 Methods Using Derived Input Directions
(PLS)
• Partial Least Squares
Discussion :
a comparison of the selection and shrinkage
methods
4.5 Discussion :
a comparison of the selection and shrinkage
methods
A Unifying View
 We can view all the linear regression
techniques under a common framework
  includes bias, q indicates a prior distribution
on 
• =0: least squares
• >0, q=0: subset selection (counts number of nonzero parameters)
• >0, q=1: the lasso
• >0, q=2: ridge regression





















 
  
 
p
j
q
j
N
i
p
j
j
ij
i x
y
1
2
1 1
0 |
|
min
arg
ˆ 



 
Discussion :
a comparison of the selection and shrinkage methods
• Family of Shrinkage Regression

More Related Content

Similar to ch3.ppt

Modeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationModeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential Equation
Mark Chang
 
Optimization tutorial
Optimization tutorialOptimization tutorial
Optimization tutorial
Northwestern University
 
UNIT I_5.pdf
UNIT I_5.pdfUNIT I_5.pdf
UNIT I_5.pdf
Muthukumar P
 
Output primitives in Computer Graphics
Output primitives in Computer GraphicsOutput primitives in Computer Graphics
Output primitives in Computer Graphics
Kamal Acharya
 
5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf
Rahul926331
 
2 random variables notes 2p3
2 random variables notes 2p32 random variables notes 2p3
2 random variables notes 2p3
MuhannadSaleh
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
Fabian Pedregosa
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
Elvis DOHMATOB
 
Bresenham circlesandpolygons
Bresenham circlesandpolygonsBresenham circlesandpolygons
Bresenham circlesandpolygonsaa11bb11
 
Bresenham circles and polygons derication
Bresenham circles and polygons dericationBresenham circles and polygons derication
Bresenham circles and polygons derication
Kumar
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
Dr. C.V. Suresh Babu
 
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Michael Lie
 
Journey to structure from motion
Journey to structure from motionJourney to structure from motion
Journey to structure from motion
Ja-Keoung Koo
 
Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPT
andrewmart11
 
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
The Statistical and Applied Mathematical Sciences Institute
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
Stratio
 
A lambda calculus for density matrices with classical and probabilistic controls
A lambda calculus for density matrices with classical and probabilistic controlsA lambda calculus for density matrices with classical and probabilistic controls
A lambda calculus for density matrices with classical and probabilistic controls
Alejandro Díaz-Caro
 

Similar to ch3.ppt (20)

Modeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationModeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential Equation
 
ML unit2.pptx
ML unit2.pptxML unit2.pptx
ML unit2.pptx
 
Optimization tutorial
Optimization tutorialOptimization tutorial
Optimization tutorial
 
UNIT I_5.pdf
UNIT I_5.pdfUNIT I_5.pdf
UNIT I_5.pdf
 
Output primitives in Computer Graphics
Output primitives in Computer GraphicsOutput primitives in Computer Graphics
Output primitives in Computer Graphics
 
5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf
 
2 random variables notes 2p3
2 random variables notes 2p32 random variables notes 2p3
2 random variables notes 2p3
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
sada_pres
sada_pressada_pres
sada_pres
 
Bresenham circlesandpolygons
Bresenham circlesandpolygonsBresenham circlesandpolygons
Bresenham circlesandpolygons
 
Bresenham circles and polygons derication
Bresenham circles and polygons dericationBresenham circles and polygons derication
Bresenham circles and polygons derication
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
 
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
 
Journey to structure from motion
Journey to structure from motionJourney to structure from motion
Journey to structure from motion
 
Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPT
 
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
A lambda calculus for density matrices with classical and probabilistic controls
A lambda calculus for density matrices with classical and probabilistic controlsA lambda calculus for density matrices with classical and probabilistic controls
A lambda calculus for density matrices with classical and probabilistic controls
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 

Recently uploaded (20)

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 

ch3.ppt

  • 1. 3. Linear Methods for Regression
  • 2. Contents  Least Squares Regression  QR decomposition for Multiple Regression  Subset Selection  Coefficient Shrinkage
  • 3. 1. Introduction • Outline • The simple linear regression model • Multiple linear regression • Model selection and shrinkage—the state of the art
  • 4. Regression 0 2 4 6 8 10 12 14 16 0 1 2 3 4 5 6 7 8 9 10 X Y How can we model the generative process for this data?
  • 5. Linear Assumption  A linear model assumes the regression function E(Y | X) is reasonably approximated as linear i.e. • The regression function f(x) = E(Y | X=x) was the result of minimizing squared expected prediction error • Making the above assumption has high bias, but low variance ) ,... , ( , ) ( 2 1 1 0 p p j j j X X X X X X f       
  • 6. Least Squares Regression  Estimate the parameters  based on a set of training data: (x1, y1)…(xN, yN)  Minimize residual sum of squares • Training samples are random, independent draws • OR, yi’s are conditionally independent given xi 2 1 1 0 ) (                N i p j j ij i x y RSS    Reasonable criterion when…
  • 7. Matrix Notation  X is N  (p+1) of input vectors  y is the N-vector of outputs (labels)   is the (p+1)- vector of parameters                           Np N N p p T N T T x x x x x x x x x x x x ... 1 ... ... 1 ... 1 1 ... 1 1 2 1 2 22 21 1 12 11 2 1 X            N y y y y ... 2 1            p     ... 1 0
  • 8. Perfectly Linear Data  When the data is exactly linear, there exists  s.t.  (linear regression model in matrix form)  Usually the data is not an exact fit, so…  X  y
  • 9. Finding the Best Fit? -4 0 4 8 12 16 20 0 2 4 6 8 10 X Y Fitting Data from Y=1.5X+.35+N(0,1.2)
  • 10. Minimize the RSS  We can rewrite the RSS in Matrix form  Getting a least squares fit involves minimizing the RSS  Solve for the parameters for which the first derivative of the RSS is zero        X X    y y RSS T ) (
  • 11. Solving Least Squares  Derivative of a Quadratic Product         b x e x e x b x dx d T T T T       A C D D C A D C A                  X X X X X X X X                    y y I y I y I y RSS T T N T N T N T 2   y y y T T T T T T X X X X X X X X X 1 0          Then,  Setting the First Derivative to Zero:
  • 12. Least Squares Solution Y X X) (X T 1 T   β̂ Y X X) X(X β X Y T 1 T    ˆ ˆ 1    p N ) ˆ ( RSS  •Least Squares Coefficients •Least Squares Predictions •Estimated Variance         N i i i ŷ y p N ˆ 1 2 2 1 1 
  • 13. The N-dimensional Geometry of Least Squares Regression
  • 14. Statistics of Least Squares  We can draw inferences about the parameters, , by assuming the true model is linear with noise, i.e.  Then, ) , 0 ( ~ , 2 1 0      N X Y p j j j          2 1 , ~ ˆ     X XT N   ) 1 ( χ ~ ˆ 1 2 2 2     p N p N  
  • 15. Significance of One Parameter  Can we eliminate one parameter, Xj (j=0)?  Look at the standardized coefficient ), 1 ( ~ ˆ ˆ    p N t v z j j j   vj is the jth diagonal element of (XTX)-1
  • 16. Significance of Many Parameters  We may want to test many features at once  Comparing model M1 with p1+1 parameters to model M0 with p0+1 parameters from M1 (p0<p1)  Use the F statistic:       ) 1 , ( ~ 1 1 0 1 1 1 0 1 1 0         p N p p F p N RSS p p RSS RSS F
  • 17. Confidence Interval for Beta  We can find a confidence interval for j  Confidence Interval for single parameter (1-2 confidence interval for j )  Confidence Interval for entire parameter (Bounds on )   σ̂ v z β̂ , σ̂ v z β̂ / j α j / j α j 2 1 1 2 1 1                       1 2 1 2 p T T ˆ ˆ ˆ X X
  • 18. 2.1 : Prostate cancer < Example>  Data • lcavol: log cancer volume • lweight: log prostate weight • age: age • lbph: log of benign prostatic hyperplasia amount • svi: seminal vesicle invasion • lcp: log of capsular penetration • Gleason: gleason scores • Pgg45: percent Gleason scores 4 or 5
  • 19. Technique for Multiple Regression  Computing directly has poor numeric properties  QR Decomposition of X  Decompose X = QR where • Q is N  (p+1) orthogonal vector (QTQ = I(p+1)) • R is an (p+1)  (p+1) upper triangular matrix  Then   y T T X X X 1 ˆ          y y y y ˆ T T T T T T T T T T T Q R Q R R R Q R R R Q R QR Q R 1 1 1 1 1           y ŷ T QQ  1 1 q x 11 r  2 22 12 2 q q x 1 r r   3 33 2 23 13 3 q q q x 1 r r r    …
  • 20. Gram-Schmidt Procedure 1) Initialize z0 = x0 = 1 2) For j = 1 to p For k = 0 to j-1, regress xj on the zk’s so that Then compute the next residual 3) Let Z = [z0 z1 … zp] and  be upper triangular with entries kj X = Z  = ZD-1D  = QR where D is diagonal with Djj = || zj || k k j k kj z z x z          1 0 j k k kj j j z x z  (univariate least squares estimates)
  • 21. Subset Selection  We want to eliminate unnecessary features  Best subset regression • Choose the subset of size k with lowest RSS • Leaps and Bounds procedure works with p up to 40  Forward Stepwise Selection • Continually add features to  with the largest F-ratio  Backward Stepwise Selection • Remove features from  with small F-ratio Greedy techniques – not guaranteed to find the best model   ) 1 , 1 ( ~ 1 1 1 1 1 0       p N F p N RSS RSS RSS F
  • 22. Coefficient Shrinkage  Use additional penalties to reduce coefficients  Ridge Regression • Minimize least squares s.t.  The Lasso • Minimize least squares s.t.  Principal Components Regression • Regress on M < p principal components of X  Partial Least Squares • Regress on M < p directions of X weighted by y    p j j s 1 | |     p j j s 1 2 
  • 23. 4.2 Prostate Cancer Data Example- Continued
  • 25. Shrinkage Methods (Ridge Regression)  Minimize RSS() + T • Use centered data, so 0 is not penalized • xj are of length p, no longer including the initial 1  The Ridge estimates are: N x x x N y N i ij ij ij N i i / , / ˆ 1 1 0                     y y y y RSS y y RSS T p T p T T T T T T X I X X I X X X X X X X X X 1 0 0 2 2 ) (                                   
  • 27. The Lasso  Use centered data, as before  The L1 penalty makes solutions nonlinear in yi • Quadratic programming are used to compute them s x y RSS p j j N i p j j ij i                   1 2 1 1 0 | | ) (     subject to
  • 29. Principal Components Regression  Singular Value Decomposition (SVD) of X • U is N  p, V is p  p; both are orthogonal • D is a p  p diagonal matrix  Use linear combinations (v) of X as new features • vj is the principal component (column of V) corresponding to the jth largest element of D • vj are the directions of maximal sample variance • use only M < p features, [z1…zM] replaces X T UDV X  M j v z j j ... 1   X m M m m pcr z ˆ y ŷ     1  m m m m z , z / y , z ˆ  
  • 30. Partial Least Squares  Construct linear combinations of inputs incorporating y  Finds directions with maximum variance and correlation with the output  The variance aspect seems to dominate and partial least squares operates like principal component regression
  • 31. 4.4 Methods Using Derived Input Directions (PLS) • Partial Least Squares
  • 32. Discussion : a comparison of the selection and shrinkage methods
  • 33. 4.5 Discussion : a comparison of the selection and shrinkage methods
  • 34. A Unifying View  We can view all the linear regression techniques under a common framework   includes bias, q indicates a prior distribution on  • =0: least squares • >0, q=0: subset selection (counts number of nonzero parameters) • >0, q=1: the lasso • >0, q=2: ridge regression                             p j q j N i p j j ij i x y 1 2 1 1 0 | | min arg ˆ      
  • 35. Discussion : a comparison of the selection and shrinkage methods • Family of Shrinkage Regression