ch3.ppt

3. Linear Methods for Regression

Contents
 Least Squares Regression
 QR decomposition for Multiple Regression
 Subset Selection
 Coefficient Shrinkage

1. Introduction
• Outline
• The simple linear regression model
• Multiple linear regression
• Model selection and shrinkage—the state of
the art

Regression
0
2
4
6
8
10
12
14
16
0 1 2 3 4 5 6 7 8 9 10
X
Y
How can we model the generative process for this data?

Linear Assumption
 A linear model assumes the regression
function E(Y | X) is reasonably approximated
as linear
i.e.
• The regression function f(x) = E(Y | X=x) was the result
of minimizing squared expected prediction error
• Making the above assumption has high bias, but low
variance
)
,...
,
(
,
)
( 2
1
1
0 p
p
j
j
j X
X
X
X
X
X
f 

 




Least Squares Regression
 Estimate the parameters  based on a set of
training data: (x1, y1)…(xN, yN)
 Minimize residual sum of squares
• Training samples are random, independent draws
• OR, yi’s are conditionally independent given xi
2
1 1
0
)
(  
 











N
i
p
j
j
ij
i x
y
RSS 


Reasonable criterion when…

Matrix Notation
 X is N  (p+1) of
input vectors
 y is the N-vector of
outputs (labels)
  is the (p+1)-
vector of
parameters


























Np
N
N
p
p
T
N
T
T
x
x
x
x
x
x
x
x
x
x
x
x
...
1
...
...
1
...
1
1
...
1
1
2
1
2
22
21
1
12
11
2
1
X











N
y
y
y
y
...
2
1











p



 ...
1
0

Perfectly Linear Data
 When the data is exactly linear, there exists
 s.t.
 (linear regression model in matrix form)
 Usually the data is not an exact fit, so…

X

y

Finding the Best Fit?
-4
0
4
8
12
16
20
0 2 4 6 8 10
X
Y
Fitting Data from Y=1.5X+.35+N(0,1.2)

Minimize the RSS
 We can rewrite the RSS in Matrix form
 Getting a least squares fit involves
minimizing the RSS
 Solve for the parameters for which the
first derivative of the RSS is zero
   


 X
X 

 y
y
RSS
T
)
(

Solving Least Squares
 Derivative of a Quadratic Product
       
b
x
e
x
e
x
b
x
dx
d T
T
T
T





 A
C
D
D
C
A
D
C
A
   
   
 







X
X
X
X
X
X
X
X



















y
y
I
y
I
y
I
y
RSS
T
T
N
T
N
T
N
T
2
  y
y
y
T
T
T
T
T
T
X
X
X
X
X
X
X
X
X
1
0








 Then,
 Setting the First Derivative to Zero:

Least Squares Solution
Y
X
X)
(X T
1
T 

β̂ Y
X
X)
X(X
β
X
Y T
1
T 

 ˆ
ˆ
1



p
N
)
ˆ
(
RSS 
•Least Squares Coefficients •Least Squares Predictions
•Estimated Variance
 






N
i
i
i ŷ
y
p
N
ˆ
1
2
2
1
1


The N-dimensional Geometry of Least
Squares Regression

Statistics of Least Squares
 We can draw inferences about the parameters,
, by assuming the true model is linear with
noise, i.e.
 Then,
)
,
0
(
~
, 2
1
0 



 N
X
Y
p
j
j
j





 
 
2
1
,
~
ˆ 



X
XT
N
  )
1
(
χ
~
ˆ
1 2
2
2



 p
N
p
N 


Significance of One Parameter
 Can we eliminate one parameter, Xj
(j=0)?
 Look at the standardized coefficient
),
1
(
~
ˆ
ˆ


 p
N
t
v
z
j
j
j


vj is the jth diagonal element of (XTX)-1

Significance of Many Parameters
 We may want to test many features at
once
 Comparing model M1 with p1+1 parameters to
model M0 with p0+1 parameters from M1 (p0<p1)
 Use the F statistic:
   
 
)
1
,
(
~
1
1
0
1
1
1
0
1
1
0







 p
N
p
p
F
p
N
RSS
p
p
RSS
RSS
F

Confidence Interval for Beta
 We can find a confidence interval for j
 Confidence Interval for single parameter (1-2
confidence interval for j )
 Confidence Interval for entire parameter
(Bounds on )
 
σ̂
v
z
β̂
,
σ̂
v
z
β̂ /
j
α
j
/
j
α
j
2
1
1
2
1
1 
 

     












1
2
1
2
p
T
T
ˆ
ˆ
ˆ X
X

2.1 : Prostate cancer < Example>
 Data
• lcavol: log cancer volume
• lweight: log prostate weight
• age: age
• lbph: log of benign prostatic
hyperplasia amount
• svi: seminal vesicle invasion
• lcp: log of capsular penetration
• Gleason: gleason scores
• Pgg45: percent Gleason scores 4
or 5

Technique for Multiple Regression
 Computing directly has poor
numeric properties
 QR Decomposition of X
 Decompose X = QR where
• Q is N  (p+1) orthogonal vector (QTQ = I(p+1))
• R is an (p+1)  (p+1) upper triangular matrix
 Then
  y
T
T
X
X
X
1
ˆ 


      y
y
y
y
ˆ T
T
T
T
T
T
T
T
T
T
T
Q
R
Q
R
R
R
Q
R
R
R
Q
R
QR
Q
R 1
1
1
1
1 









y
ŷ T
QQ

1
1 q
x 11
r

2
22
12
2 q
q
x 1 r
r 

3
33
2
23
13
3 q
q
q
x 1 r
r
r 


…

Gram-Schmidt Procedure
1) Initialize z0 = x0 = 1
2) For j = 1 to p
For k = 0 to j-1, regress xj on the zk’s so that
Then compute the next residual
3) Let Z = [z0 z1 … zp] and  be upper triangular with
entries kj
X = Z  = ZD-1D  = QR
where D is diagonal with Djj = || zj ||
k
k
j
k
kj
z
z
x
z









1
0
j
k
k
kj
j
j z
x
z 
(univariate least squares estimates)

Subset Selection
 We want to eliminate unnecessary features
 Best subset regression
• Choose the subset of size k with lowest RSS
• Leaps and Bounds procedure works with p up to 40
 Forward Stepwise Selection
• Continually add features to  with the largest F-ratio
 Backward Stepwise Selection
• Remove features from  with small F-ratio
Greedy techniques – not guaranteed to find the best model
 
)
1
,
1
(
~
1
1
1
1
1
0






p
N
F
p
N
RSS
RSS
RSS
F

Coefficient Shrinkage
 Use additional penalties to reduce
coefficients
 Ridge Regression
• Minimize least squares s.t.
 The Lasso
• Minimize least squares s.t.
 Principal Components Regression
• Regress on M < p principal components of X
 Partial Least Squares
• Regress on M < p directions of X weighted by y



p
j
j s
1
|
| 



p
j
j s
1
2


4.2 Prostate Cancer Data Example-
Continued

Shrinkage Methods (Ridge Regression)
 Minimize RSS() + T
• Use centered data, so 0 is not penalized
• xj are of length p, no longer including the initial 1
 The Ridge estimates are:
N
x
x
x
N
y
N
i
ij
ij
ij
N
i
i /
,
/
ˆ
1
1
0 
 





   
 
 
 
  y
y
y
y
RSS
y
y
RSS
T
p
T
p
T
T
T
T
T
T
X
I
X
X
I
X
X
X
X
X
X
X
X
X
1
0
0
2
2
)
(




































Shrinkage Methods (Ridge Regression)

The Lasso
 Use centered data, as before
 The L1 penalty makes solutions nonlinear
in yi
• Quadratic programming are used to compute them
s
x
y
RSS
p
j
j
N
i
p
j
j
ij
i 










 
  
  1
2
1 1
0 |
|
)
( 


 subject to

Shrinkage Methods (Lasso Regression)

Principal Components Regression
 Singular Value Decomposition (SVD) of X
• U is N  p, V is p  p; both are orthogonal
• D is a p  p diagonal matrix
 Use linear combinations (v) of X as new features
• vj is the principal component (column of V) corresponding to
the jth largest element of D
• vj are the directions of maximal sample variance
• use only M < p features, [z1…zM] replaces X
T
UDV
X 
M
j
v
z j
j ...
1

 X
m
M
m
m
pcr
z
ˆ
y
ŷ 



1
 m
m
m
m z
,
z
/
y
,
z
ˆ 


Partial Least Squares
 Construct linear combinations of inputs
incorporating y
 Finds directions with maximum variance
and correlation with the output
 The variance aspect seems to dominate
and partial least squares operates like
principal component regression

4.4 Methods Using Derived Input Directions
(PLS)
• Partial Least Squares

Discussion :
a comparison of the selection and shrinkage
methods

4.5 Discussion :
a comparison of the selection and shrinkage
methods

A Unifying View
 We can view all the linear regression
techniques under a common framework
  includes bias, q indicates a prior distribution
on 
• =0: least squares
• >0, q=0: subset selection (counts number of nonzero parameters)
• >0, q=1: the lasso
• >0, q=2: ridge regression





















 
  
 
p
j
q
j
N
i
p
j
j
ij
i x
y
1
2
1 1
0 |
|
min
arg
ˆ 



 

Discussion :
a comparison of the selection and shrinkage methods
• Family of Shrinkage Regression

ch3.ppt

Recommended

Recommended

More Related Content

Similar to ch3.ppt

Similar to ch3.ppt (20)

Recently uploaded

Recently uploaded (20)

ch3.ppt