C2.4

Machine
Learning
Specializa0on

Ridge Regression:
Regulating overﬁtting when
using many features
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on

Overﬁtting of
polynomial regression
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
3

Flexibility of high-order polynomials
©2015
Emily
Fox
&
Carlos
Guestrin

square feet (sq.ft.)
price($)
x
fŵ
price($)
x
y
fŵ
y
yi = w0 + w1 xi+ w2 xi
2 + … + wp xi
p + εi

Machine
Learning
Specializa0on
4

Flexibility of high-order polynomials
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
fŵ
price($)
x
y
fŵ
y
OVERFIT
yi = w0 + w1 xi+ w2 xi
2 + … + wp xi
p + εi

Machine
Learning
Specializa0on
5

Symptom of overﬁtting
Often, overﬁtting associated with very
large estimated parameters ŵ
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on

Overﬁtting of linear regression
models more generically
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
7

Overﬁtting with many features
Not unique to polynomial regression,
but also if lots of inputs (d large)
Or, generically,
lots of features (D large)
yi = wj hj(xi) + εi
©2015
Emily
Fox
&
Carlos
Guestrin

DX
j=0
- Square feet
- # bathrooms
- # bedrooms
- Lot size
- Year built
- …

Machine
Learning
Specializa0on
8

How does # of observations influence overfitting?
Few observations (N small)
à rapidly overfit as model complexity increases
Many observations (N very large)
à harder to overfit
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
fŵ
y
price($)
x
fŵy

Machine
Learning
Specializa0on
9

How does # of inputs influence overfitting?
1 input (e.g., sq.ft.):
Data must include representative examples of
all possible (sq.ft., $) pairs to avoid overfitting
©2015
Emily
Fox
&
Carlos
Guestrin

HARD
price($)
x
fŵ
y

Machine
Learning
Specializa0on
10

How does # of inputs influence overfitting?
d inputs (e.g., sq.ft., #bath, #bed, lot size, year,…):
Data must include examples of all possible
(sq.ft., #bath, #bed, lot size, year,…., $) combos
to avoid overfitting
©2015
Emily
Fox
&
Carlos
Guestrin

MUCH!!!
HARDER
price($)
x[1]
y
x[2]

Machine
Learning
Specializa0on

Adding term to cost-of-ﬁt
to prefer small coeﬃcients
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
12
©2015
Emily
Fox
&
Carlos
Guestrin

Training
Data
Feature
extraction
ML
model
Quality
metric
ML algorithm
y
h(x) ŷ
ŵ

Machine
Learning
Specializa0on
13

Desired total cost format
Want to balance:
i.  How well function fits data
ii.  Magnitude of coefficients
Total cost =
measure of fit + measure of magnitude
of coefficients
©2015
Emily
Fox
&
Carlos
Guestrin

small # = good fit to
training data small # = not overfit
want to balancemeasure quality of fit

Machine
Learning
Specializa0on
14

Measure of ﬁt to training data
©2015
Emily
Fox
&
Carlos
Guestrin

RSS(w) = (yi-h(xi)Tw)2
= (yi-ŷi(w))2
NX
i=1
NX
i=1
price($)
x[1]
y
x[2]

Machine
Learning
Specializa0on
15

What summary # is indicative of
size of regression coeﬃcients?
-  Sum?
-  Sum of absolute value?
-  Sum of squares (L2 norm)
©2015
Emily
Fox
&
Carlos
Guestrin

Measure of magnitude of
regression coeﬃcient

Machine
Learning
Specializa0on
16

Consider speciﬁc total cost
Total cost =
of coeﬃcients
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
17

Consider speciﬁc total cost
Total cost =
of coeﬃcients
©2015
Emily
Fox
&
Carlos
Guestrin

RSS(w)
||w||2
2

Machine
Learning
Specializa0on
18

Consider resulting objective
What if ŵ selected to minimize
If λ=0:
If λ=∞:

If λ in between:
RSS(w) + ||w||2
tuning parameter = balance of ﬁt and magnitude
©2015
Emily
Fox
&
Carlos
Guestrin

λ
2

Machine
Learning
Specializa0on
19

Consider resulting objective
What if ŵ selected to minimize
RSS(w) + ||w||2
©2015
Emily
Fox
&
Carlos
Guestrin

λ
Ridge regression
(a.k.a L2 regularization)
tuning parameter = balance of ﬁt and magnitude
2

Machine
Learning
Specializa0on
20

Bias-variance tradeoﬀ
Large λ:
high bias, low variance
(e.g., ŵ =0 for λ=∞)
Small λ:
low bias, high variance
(e.g., standard least squares (RSS) ﬁt of
high-order polynomial for λ=0)
©2015
Emily
Fox
&
Carlos
Guestrin

In essence, λ
controls model
complexity

Machine
Learning
Specializa0on
21

Revisit polynomial ﬁt demo
What happens if we reﬁt our high-order
polynomial, but now using ridge regression?
Will consider a few settings of λ …
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
22

0 50000 100000 150000 200000
−1000000100000200000 coefficient paths −− Ridge
coefficient
bedrooms
bathrooms
sqft_living
sqft_lot
floors
yr_built
yr_renovat
waterfront
Coeﬃcient path
©2015
Emily
Fox
&
Carlos
Guestrin

λ
coeﬃcientsŵj

Machine
Learning
Specializa0on

Fitting the ridge regression model
(for given λ value)
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
24
©2015
Emily
Fox
&
Carlos
Guestrin

Training
Data
Feature
extraction
ML
model
Quality
metric
ML algorithm
ŵy
ŷh(x)

Machine
Learning
Specializa0on

Step 1:
Rewrite total cost in matrix notation
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
26

Recall matrix form of RSS
Model for all N observations together
©2015
Emily
Fox
&
Carlos
Guestrin

= +y H ε

Machine
Learning
Specializa0on
27

Recall matrix form of RSS
©2015
Emily
Fox
&
Carlos
Guestrin

RSS(w) = (yi- h(xi)Tw)2
= (y-Hw)T(y-Hw)
NX
i=1

Machine
Learning
Specializa0on
28

Rewrite magnitude of coeﬃcients
in vector notation
©2015
Emily
Fox
&
Carlos
Guestrin

||w||2 = w0
2 + w1
2 + w2
2 + … + wD
2
=
2

Machine
Learning
Specializa0on
29

Putting it all together
©2015
Emily
Fox
&
Carlos
Guestrin

In matrix form, ridge regression cost is:
RSS(w) + λ||w||2
= (y-Hw)T(y-Hw) + λwTw
2

Machine
Learning
Specializa0on

Step 2:
Compute the gradient
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
31

Why? By analogy to 1d case…
wTw analogous to w2 and derivative of w2=2w
Gradient of ridge regression cost
©2015
Emily
Fox
&
Carlos
Guestrin

[RSS(w) + λ||w||2] = [(y-Hw)T(y-Hw) + λwTw]
[(y-Hw)T(y-Hw)] + λ [wTw]
Δ
Δ Δ
-2HT(y-Hw)
Why?
Δ
=
2

2w

Machine
Learning
Specializa0on

Step 3, Approach 1:
Set the gradient = 0
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
33

Aside:
Refresher on identity matrics
©2015
Emily
Fox
&
Carlos
Guestrin

cost(w) = -2HT(y-Hw) +2λw
= -2HT(y-Hw) +2λIw
Δ
Fun facts:
Iv = v IA = A A-1A = I AA-1 = I

Machine
Learning
Specializa0on
34

Ridge closed-form solution
©2015
Emily
Fox
&
Carlos
Guestrin

cost(w) = -2HT(y-Hw) +2λIw= 0
Δ
Solve for w:

Machine
Learning
Specializa0on
35

Interpreting ridge
closed-form solution
©2015
Emily
Fox
&
Carlos
Guestrin

ŵ = ( HTH + λI)-1 HTy
If λ=0:
If λ=∞:

Machine
Learning
Specializa0on
36

Recall discussion on
previous closed-form solution
©2015
Emily
Fox
&
Carlos
Guestrin

ŵ = ( HTH )-1 HTy
Invertible if:
In general,
(# linearly independent obs)
N > D
Complexity of inverse:
O(D3)

Machine
Learning
Specializa0on
37

Discussion of
ridge closed-form solution
©2015
Emily
Fox
&
Carlos
Guestrin

ŵ = ( HTH + λI)-1 HTy
+
Invertible if:
Always if λ>0,
even if N < D
Complexity of
inverse:
O(D3)…
big for large D!

Machine
Learning
Specializa0on

Step 3, Approach 2:
Gradient descent
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
39

Elementwise ridge regression
gradient descent algorithm
©2015
Emily
Fox
&
Carlos
Guestrin

wj
(t+1) ß wj
(t) – η *
[-2 hj(xi)(yi-ŷi(w(t)))
+2λwj
(t) ]
NX
i=1
Update to jth feature weight:
Δ

Machine
Learning
Specializa0on
40

©2015
Emily
Fox
&
Carlos
Guestrin

wj
(t+1) ß (1-2ηλ)wj
(t)
+2η hj(xi)(yi-ŷi(w(t)))
NX
i=1
Equivalently:
Δ

Machine
Learning
Specializa0on
41

©2015
Emily
Fox
&
Carlos
Guestrin

wj
(t+1) ß (1-2ηλ)wj
(t)
+2η hj(xi)(yi-ŷi(w(t)))
NX
i=1
Equivalently:
Δ

Machine
Learning
Specializa0on
42

Recall previous algorithm
©2015
Emily
Fox
&
Carlos
Guestrin

init w(1)=0 (or randomly, or smartly), t=1
while || RSS(w(t))|| > ε
for j=0,…,D
partial[j] =-2 hj(xi)(yi-ŷi(w(t)))
wj
(t+1) ß wj
(t) – η partial[j]
t ß t + 1
Δ
NX
i=1

Machine
Learning
Specializa0on
43

Summary of ridge regression algorithm
©2015
Emily
Fox
&
Carlos
Guestrin

init w(1)=0 (or randomly, or smartly), t=1
for j=0,…,D
wj
(t+1) ß (1-2ηλ)wj
t ß t + 1
Δ
NX
i=1

Machine
Learning
Specializa0on

How to choose λ
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
45

If suﬃcient amount of data…
©2015
Emily
Fox
&
Carlos
Guestrin

Validation
set
Training set
Test
set
ﬁt ŵλ
test performance
of ŵλ to select λ*
assess
generalization
error of ŵλ*

Machine
Learning
Specializa0on
46

Start with smallish dataset
©2015
Emily
Fox
&
Carlos
Guestrin

All data

Machine
Learning
Specializa0on
47

Still form test set and hold out
©2015
Emily
Fox
&
Carlos
Guestrin

Rest of data
Test
set

Machine
Learning
Specializa0on
48

How do we use the other data?
©2015
Emily
Fox
&
Carlos
Guestrin

Rest of data
use for both training and
validation, but not so naively

Machine
Learning
Specializa0on
49

Recall naïve approach
Is validation set enough to compare
performance of ŵλ across λ values?
©2015
Emily
Fox
&
Carlos
Guestrin

Valid.
set
Training set
small validation set
No

Machine
Learning
Specializa0on
50

Choosing the validation set
Didn’t have to use the last data points
tabulated to form validation set
Can use any data subset
©2015
Emily
Fox
&
Carlos
Guestrin

Valid.
set

Machine
Learning
Specializa0on
51

Choosing the validation set
©2015
Emily
Fox
&
Carlos
Guestrin

Valid.
set
Which subset should I use?
ALL!
average
performance
over all
choices

Machine
Learning
Specializa0on
52

(use same split of data for all other steps)
Preprocessing:
Randomly assign data to K groups
©2015
Emily
Fox
&
Carlos
Guestrin

N
K

N
K

N
K

N
K

N
K

Rest of data
K-fold cross validation

Machine
Learning
Specializa0on
53

For k=1,…,K
1.  Estimate ŵλ
(k) on the training blocks
2.  Compute error on validation block: errork(λ)
©2015
Emily
Fox
&
Carlos
Guestrin

Valid
set
ŵλ
(1)
error1(λ)

Machine
Learning
Specializa0on
54

For k=1,…,K
1.  Estimate ŵλ
©2015
Emily
Fox
&
Carlos
Guestrin

Valid
set
ŵλ
(2)
error2(λ)

Machine
Learning
Specializa0on
55

For k=1,…,K
1.  Estimate ŵλ
©2015
Emily
Fox
&
Carlos
Guestrin

Valid
set
ŵλ
(3)
error3(λ)

Machine
Learning
Specializa0on
56

For k=1,…,K
1.  Estimate ŵλ
©2015
Emily
Fox
&
Carlos
Guestrin

Valid
set
ŵλ
(4)
error4(λ)

Machine
Learning
Specializa0on
57

For k=1,…,K
1.  Estimate ŵλ
Compute average error: CV(λ) = errork(λ)
©2015
Emily
Fox
&
Carlos
Guestrin

Valid
set
ŵλ
(5)
error5(λ)
1
K
KX
k=1

Machine
Learning
Specializa0on
58

Repeat procedure for each choice of λ
Choose λ* to minimize CV(λ)
©2015
Emily
Fox
&
Carlos
Guestrin

Valid
set

Machine
Learning
Specializa0on
59

What value of K?
Formally, the best approximation occurs
for validation sets of size 1 (K=N)
Computationally intensive
-  requires computing N ﬁts of model per λ
Typically, K=5 or 10
©2015
Emily
Fox
&
Carlos
Guestrin

leave-one-out
cross validation
5-fold CV
10-fold CV

Machine
Learning
Specializa0on
61

Recall multiple regression model
©2015
Emily
Fox
&
Carlos
Guestrin

Model:
yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi
= wj hj(xi) + εi
feature 1 = h0(x)…often 1 (constant)
feature 2 = h1(x)… e.g., x[1]
feature 3 = h2(x)… e.g., x[2]
…
feature D+1 = hD(x)… e.g., x[d]
DX
j=0

Machine
Learning
Specializa0on
62

If constant feature…
©2015
Emily
Fox
&
Carlos
Guestrin

yi = w0 + w1 h1(xi) + … + wD hD(xi)+ εi
In matrix notation for N observations:
w0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

Machine
Learning
Specializa0on
63

Do we penalize intercept?
Standard ridge regression cost:
Encourages intercept w0 to also be small
Do we want a small intercept?
Conceptually, not indicative of overﬁtting…
©2015
Emily
Fox
&
Carlos
Guestrin

RSS(w) + ||w||2λ
2

strength of penalty

Machine
Learning
Specializa0on
64

Option 1: Don’t penalize intercept
Modiﬁed ridge regression cost:
How to implement this in practice?
©2015
Emily
Fox
&
Carlos
Guestrin

RSS(w0,wrest) + ||wrest||2λ
2

Machine
Learning
Specializa0on
65

– Closed-form solution –
©2015
Emily
Fox
&
Carlos
Guestrin

ŵ = ( HTH + λImod)-1 HTy
+
0

Machine
Learning
Specializa0on
66

– Gradient descent algorithm –
©2015
Emily
Fox
&
Carlos
Guestrin

for j=0,…,D
if j==0
w0
(t+1) ß w0
else
wj
(t+1) ß (1-2ηλ)wj
t ß t + 1
Δ
NX
i=1

Machine
Learning
Specializa0on
67

Option 2: Center data ﬁrst
If data are ﬁrst centered about 0, then
favoring small intercept not so worrisome
Step 1: Transform y to have 0 mean
Step 2: Run ridge regression as normal
(closed-form or gradient algorithms)
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
69

What you can do now…
•  Describe what happens to magnitude of estimated
coefficients when model is overfit
•  Motivate form of ridge regression cost function
•  Describe what happens to estimated coefficients of
ridge regression as tuning parameter λ is varied
•  Interpret coefficient path plot
•  Estimate ridge regression parameters:
-  In closed form
-  Using an iterative gradient descent algorithm
•  Implement K-fold cross validation to select the
ridge regression tuning parameter λ
©2015
Emily
Fox
&
Carlos
Guestrin

C2.4

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (8)

Similar to C2.4

Similar to C2.4 (20)

More from Daniel LIAO

More from Daniel LIAO (14)

Recently uploaded

Recently uploaded (20)

C2.4