More Related Content
More from Daniel LIAO (14)
C2.4
- 1. Machine
Learning
Specializa0on
Ridge Regression:
Regulating overfitting when
using many features
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2015
Emily
Fox
&
Carlos
Guestrin
- 3. Machine
Learning
Specializa0on
3
Flexibility of high-order polynomials
©2015
Emily
Fox
&
Carlos
Guestrin
square feet (sq.ft.)
price($)
x
fŵ
square feet (sq.ft.)
price($)
x
y
fŵ
y
yi = w0 + w1 xi+ w2 xi
2 + … + wp xi
p + εi
- 4. Machine
Learning
Specializa0on
4
Flexibility of high-order polynomials
©2015
Emily
Fox
&
Carlos
Guestrin
square feet (sq.ft.)
price($)
x
fŵ
square feet (sq.ft.)
price($)
x
y
fŵ
y
OVERFIT
yi = w0 + w1 xi+ w2 xi
2 + … + wp xi
p + εi
- 5. Machine
Learning
Specializa0on
5
Symptom of overfitting
Often, overfitting associated with very
large estimated parameters ŵ
©2015
Emily
Fox
&
Carlos
Guestrin
- 7. Machine
Learning
Specializa0on
7
Overfitting with many features
Not unique to polynomial regression,
but also if lots of inputs (d large)
Or, generically,
lots of features (D large)
yi = wj hj(xi) + εi
©2015
Emily
Fox
&
Carlos
Guestrin
DX
j=0
- Square feet
- # bathrooms
- # bedrooms
- Lot size
- Year built
- …
- 8. Machine
Learning
Specializa0on
8
How does # of observations influence overfitting?
Few observations (N small)
à rapidly overfit as model complexity increases
Many observations (N very large)
à harder to overfit
©2015
Emily
Fox
&
Carlos
Guestrin
square feet (sq.ft.)
price($)
x
fŵ
y
square feet (sq.ft.)
price($)
x
fŵy
- 9. Machine
Learning
Specializa0on
9
How does # of inputs influence overfitting?
1 input (e.g., sq.ft.):
Data must include representative examples of
all possible (sq.ft., $) pairs to avoid overfitting
©2015
Emily
Fox
&
Carlos
Guestrin
HARD
square feet (sq.ft.)
price($)
x
fŵ
y
- 10. Machine
Learning
Specializa0on
10
How does # of inputs influence overfitting?
d inputs (e.g., sq.ft., #bath, #bed, lot size, year,…):
Data must include examples of all possible
(sq.ft., #bath, #bed, lot size, year,…., $) combos
to avoid overfitting
©2015
Emily
Fox
&
Carlos
Guestrin
MUCH!!!
HARDER
square feet (sq.ft.)
price($)
x[1]
y
x[2]
- 12. Machine
Learning
Specializa0on
12
©2015
Emily
Fox
&
Carlos
Guestrin
Training
Data
Feature
extraction
ML
model
Quality
metric
ML algorithm
y
h(x) ŷ
ŵ
- 13. Machine
Learning
Specializa0on
13
Desired total cost format
Want to balance:
i. How well function fits data
ii. Magnitude of coefficients
Total cost =
measure of fit + measure of magnitude
of coefficients
©2015
Emily
Fox
&
Carlos
Guestrin
small # = good fit to
training data small # = not overfit
want to balancemeasure quality of fit
- 14. Machine
Learning
Specializa0on
14
Measure of fit to training data
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w) = (yi-h(xi)Tw)2
= (yi-ŷi(w))2
NX
i=1
NX
i=1
square feet (sq.ft.)
price($)
x[1]
y
x[2]
- 15. Machine
Learning
Specializa0on
15
What summary # is indicative of
size of regression coefficients?
- Sum?
- Sum of absolute value?
- Sum of squares (L2 norm)
©2015
Emily
Fox
&
Carlos
Guestrin
Measure of magnitude of
regression coefficient
- 16. Machine
Learning
Specializa0on
16
Consider specific total cost
Total cost =
measure of fit + measure of magnitude
of coefficients
©2015
Emily
Fox
&
Carlos
Guestrin
- 17. Machine
Learning
Specializa0on
17
Consider specific total cost
Total cost =
measure of fit + measure of magnitude
of coefficients
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w)
||w||2
2
- 18. Machine
Learning
Specializa0on
18
Consider resulting objective
What if ŵ selected to minimize
If λ=0:
If λ=∞:
If λ in between:
RSS(w) + ||w||2
tuning parameter = balance of fit and magnitude
©2015
Emily
Fox
&
Carlos
Guestrin
λ
2
- 19. Machine
Learning
Specializa0on
19
Consider resulting objective
What if ŵ selected to minimize
RSS(w) + ||w||2
©2015
Emily
Fox
&
Carlos
Guestrin
λ
Ridge regression
(a.k.a L2 regularization)
tuning parameter = balance of fit and magnitude
2
- 20. Machine
Learning
Specializa0on
20
Bias-variance tradeoff
Large λ:
high bias, low variance
(e.g., ŵ =0 for λ=∞)
Small λ:
low bias, high variance
(e.g., standard least squares (RSS) fit of
high-order polynomial for λ=0)
©2015
Emily
Fox
&
Carlos
Guestrin
In essence, λ
controls model
complexity
- 21. Machine
Learning
Specializa0on
21
Revisit polynomial fit demo
What happens if we refit our high-order
polynomial, but now using ridge regression?
Will consider a few settings of λ …
©2015
Emily
Fox
&
Carlos
Guestrin
- 22. Machine
Learning
Specializa0on
22
0 50000 100000 150000 200000
−1000000100000200000 coefficient paths −− Ridge
coefficient
bedrooms
bathrooms
sqft_living
sqft_lot
floors
yr_built
yr_renovat
waterfront
Coefficient path
©2015
Emily
Fox
&
Carlos
Guestrin
λ
coefficientsŵj
- 24. Machine
Learning
Specializa0on
24
©2015
Emily
Fox
&
Carlos
Guestrin
Training
Data
Feature
extraction
ML
model
Quality
metric
ML algorithm
ŵy
ŷh(x)
- 28. Machine
Learning
Specializa0on
28
Rewrite magnitude of coefficients
in vector notation
©2015
Emily
Fox
&
Carlos
Guestrin
||w||2 = w0
2 + w1
2 + w2
2 + … + wD
2
=
2
- 29. Machine
Learning
Specializa0on
29
Putting it all together
©2015
Emily
Fox
&
Carlos
Guestrin
In matrix form, ridge regression cost is:
RSS(w) + λ||w||2
= (y-Hw)T(y-Hw) + λwTw
2
- 31. Machine
Learning
Specializa0on
31
Why? By analogy to 1d case…
wTw analogous to w2 and derivative of w2=2w
Gradient of ridge regression cost
©2015
Emily
Fox
&
Carlos
Guestrin
[RSS(w) + λ||w||2] = [(y-Hw)T(y-Hw) + λwTw]
[(y-Hw)T(y-Hw)] + λ [wTw]
Δ
Δ Δ
-2HT(y-Hw)
Why?
Δ
=
2
2w
- 33. Machine
Learning
Specializa0on
33
Aside:
Refresher on identity matrics
©2015
Emily
Fox
&
Carlos
Guestrin
cost(w) = -2HT(y-Hw) +2λw
= -2HT(y-Hw) +2λIw
Δ
Fun facts:
Iv = v IA = A A-1A = I AA-1 = I
- 36. Machine
Learning
Specializa0on
36
Recall discussion on
previous closed-form solution
©2015
Emily
Fox
&
Carlos
Guestrin
ŵ = ( HTH )-1 HTy
Invertible if:
In general,
(# linearly independent obs)
N > D
Complexity of inverse:
O(D3)
- 37. Machine
Learning
Specializa0on
37
Discussion of
ridge closed-form solution
©2015
Emily
Fox
&
Carlos
Guestrin
ŵ = ( HTH + λI)-1 HTy
+
Invertible if:
Always if λ>0,
even if N < D
Complexity of
inverse:
O(D3)…
big for large D!
- 39. Machine
Learning
Specializa0on
39
Elementwise ridge regression
gradient descent algorithm
©2015
Emily
Fox
&
Carlos
Guestrin
wj
(t+1) ß wj
(t) – η *
[-2 hj(xi)(yi-ŷi(w(t)))
+2λwj
(t) ]
NX
i=1
Update to jth feature weight:
cost(w) = -2HT(y-Hw) +2λw
Δ
- 40. Machine
Learning
Specializa0on
40
Elementwise ridge regression
gradient descent algorithm
©2015
Emily
Fox
&
Carlos
Guestrin
wj
(t+1) ß (1-2ηλ)wj
(t)
+2η hj(xi)(yi-ŷi(w(t)))
NX
i=1
Equivalently:
cost(w) = -2HT(y-Hw) +2λw
Δ
- 41. Machine
Learning
Specializa0on
41
Elementwise ridge regression
gradient descent algorithm
©2015
Emily
Fox
&
Carlos
Guestrin
wj
(t+1) ß (1-2ηλ)wj
(t)
+2η hj(xi)(yi-ŷi(w(t)))
NX
i=1
Equivalently:
cost(w) = -2HT(y-Hw) +2λw
Δ
- 42. Machine
Learning
Specializa0on
42
Recall previous algorithm
©2015
Emily
Fox
&
Carlos
Guestrin
init w(1)=0 (or randomly, or smartly), t=1
while || RSS(w(t))|| > ε
for j=0,…,D
partial[j] =-2 hj(xi)(yi-ŷi(w(t)))
wj
(t+1) ß wj
(t) – η partial[j]
t ß t + 1
Δ
NX
i=1
- 43. Machine
Learning
Specializa0on
43
Summary of ridge regression algorithm
©2015
Emily
Fox
&
Carlos
Guestrin
init w(1)=0 (or randomly, or smartly), t=1
while || RSS(w(t))|| > ε
for j=0,…,D
partial[j] =-2 hj(xi)(yi-ŷi(w(t)))
wj
(t+1) ß (1-2ηλ)wj
(t) – η partial[j]
t ß t + 1
Δ
NX
i=1
- 45. Machine
Learning
Specializa0on
45
If sufficient amount of data…
©2015
Emily
Fox
&
Carlos
Guestrin
Validation
set
Training set
Test
set
fit ŵλ
test performance
of ŵλ to select λ*
assess
generalization
error of ŵλ*
- 48. Machine
Learning
Specializa0on
48
How do we use the other data?
©2015
Emily
Fox
&
Carlos
Guestrin
Rest of data
use for both training and
validation, but not so naively
- 49. Machine
Learning
Specializa0on
49
Recall naïve approach
Is validation set enough to compare
performance of ŵλ across λ values?
©2015
Emily
Fox
&
Carlos
Guestrin
Valid.
set
Training set
small validation set
No
- 50. Machine
Learning
Specializa0on
50
Choosing the validation set
Didn’t have to use the last data points
tabulated to form validation set
Can use any data subset
©2015
Emily
Fox
&
Carlos
Guestrin
Valid.
set
small validation set
- 51. Machine
Learning
Specializa0on
51
Choosing the validation set
©2015
Emily
Fox
&
Carlos
Guestrin
Valid.
set
small validation set
Which subset should I use?
ALL!
average
performance
over all
choices
- 52. Machine
Learning
Specializa0on
52
(use same split of data for all other steps)
Preprocessing:
Randomly assign data to K groups
©2015
Emily
Fox
&
Carlos
Guestrin
N
K
N
K
N
K
N
K
N
K
Rest of data
K-fold cross validation
- 53. Machine
Learning
Specializa0on
53
For k=1,…,K
1. Estimate ŵλ
(k) on the training blocks
2. Compute error on validation block: errork(λ)
©2015
Emily
Fox
&
Carlos
Guestrin
Valid
set
ŵλ
(1)
error1(λ)
K-fold cross validation
- 54. Machine
Learning
Specializa0on
54
For k=1,…,K
1. Estimate ŵλ
(k) on the training blocks
2. Compute error on validation block: errork(λ)
©2015
Emily
Fox
&
Carlos
Guestrin
Valid
set
ŵλ
(2)
error2(λ)
K-fold cross validation
- 55. Machine
Learning
Specializa0on
55
For k=1,…,K
1. Estimate ŵλ
(k) on the training blocks
2. Compute error on validation block: errork(λ)
©2015
Emily
Fox
&
Carlos
Guestrin
Valid
set
ŵλ
(3)
error3(λ)
K-fold cross validation
- 56. Machine
Learning
Specializa0on
56
For k=1,…,K
1. Estimate ŵλ
(k) on the training blocks
2. Compute error on validation block: errork(λ)
©2015
Emily
Fox
&
Carlos
Guestrin
Valid
set
ŵλ
(4)
error4(λ)
K-fold cross validation
- 57. Machine
Learning
Specializa0on
57
For k=1,…,K
1. Estimate ŵλ
(k) on the training blocks
2. Compute error on validation block: errork(λ)
Compute average error: CV(λ) = errork(λ)
©2015
Emily
Fox
&
Carlos
Guestrin
Valid
set
ŵλ
(5)
error5(λ)
1
K
KX
k=1
K-fold cross validation
- 58. Machine
Learning
Specializa0on
58
Repeat procedure for each choice of λ
Choose λ* to minimize CV(λ)
©2015
Emily
Fox
&
Carlos
Guestrin
Valid
set
K-fold cross validation
- 59. Machine
Learning
Specializa0on
59
What value of K?
Formally, the best approximation occurs
for validation sets of size 1 (K=N)
Computationally intensive
- requires computing N fits of model per λ
Typically, K=5 or 10
©2015
Emily
Fox
&
Carlos
Guestrin
leave-one-out
cross validation
5-fold CV
10-fold CV
- 61. Machine
Learning
Specializa0on
61
Recall multiple regression model
©2015
Emily
Fox
&
Carlos
Guestrin
Model:
yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi
= wj hj(xi) + εi
feature 1 = h0(x)…often 1 (constant)
feature 2 = h1(x)… e.g., x[1]
feature 3 = h2(x)… e.g., x[2]
…
feature D+1 = hD(x)… e.g., x[d]
DX
j=0
- 62. Machine
Learning
Specializa0on
62
If constant feature…
©2015
Emily
Fox
&
Carlos
Guestrin
yi = w0 + w1 h1(xi) + … + wD hD(xi)+ εi
In matrix notation for N observations:
w0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
- 63. Machine
Learning
Specializa0on
63
Do we penalize intercept?
Standard ridge regression cost:
Encourages intercept w0 to also be small
Do we want a small intercept?
Conceptually, not indicative of overfitting…
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w) + ||w||2λ
2
strength of penalty
- 64. Machine
Learning
Specializa0on
64
Option 1: Don’t penalize intercept
Modified ridge regression cost:
How to implement this in practice?
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w0,wrest) + ||wrest||2λ
2
- 65. Machine
Learning
Specializa0on
65
Option 1: Don’t penalize intercept
– Closed-form solution –
©2015
Emily
Fox
&
Carlos
Guestrin
ŵ = ( HTH + λImod)-1 HTy
+
0
- 66. Machine
Learning
Specializa0on
66
Option 1: Don’t penalize intercept
– Gradient descent algorithm –
©2015
Emily
Fox
&
Carlos
Guestrin
while || RSS(w(t))|| > ε
for j=0,…,D
partial[j] =-2 hj(xi)(yi-ŷi(w(t)))
if j==0
w0
(t+1) ß w0
(t) – η partial[j]
else
wj
(t+1) ß (1-2ηλ)wj
(t) – η partial[j]
t ß t + 1
Δ
NX
i=1
- 67. Machine
Learning
Specializa0on
67
Option 2: Center data first
If data are first centered about 0, then
favoring small intercept not so worrisome
Step 1: Transform y to have 0 mean
Step 2: Run ridge regression as normal
(closed-form or gradient algorithms)
©2015
Emily
Fox
&
Carlos
Guestrin
- 69. Machine
Learning
Specializa0on
69
What you can do now…
• Describe what happens to magnitude of estimated
coefficients when model is overfit
• Motivate form of ridge regression cost function
• Describe what happens to estimated coefficients of
ridge regression as tuning parameter λ is varied
• Interpret coefficient path plot
• Estimate ridge regression parameters:
- In closed form
- Using an iterative gradient descent algorithm
• Implement K-fold cross validation to select the
ridge regression tuning parameter λ
©2015
Emily
Fox
&
Carlos
Guestrin