SlideShare a Scribd company logo
1 of 69
Download to read offline
Machine	
  Learning	
  Specializa0on	
  
Ridge Regression:
Regulating overfitting when
using many features
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  
Overfitting of
polynomial regression
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  3	
  
Flexibility of high-order polynomials
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
square feet (sq.ft.)
price($)
x
fŵ
square feet (sq.ft.)
price($)
x
y
fŵ
y
yi = w0 + w1 xi+ w2 xi
2 + … + wp xi
p + εi
Machine	
  Learning	
  Specializa0on	
  4	
  
Flexibility of high-order polynomials
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
square feet (sq.ft.)
price($)
x
fŵ
square feet (sq.ft.)
price($)
x
y
fŵ
y
OVERFIT
yi = w0 + w1 xi+ w2 xi
2 + … + wp xi
p + εi
Machine	
  Learning	
  Specializa0on	
  5	
  
Symptom of overfitting
Often, overfitting associated with very
large estimated parameters ŵ
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  
Overfitting of linear regression
models more generically
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  7	
  
Overfitting with many features
Not unique to polynomial regression,
but also if lots of inputs (d large)
Or, generically,
lots of features (D large)
yi = wj hj(xi) + εi
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
DX
j=0
- Square feet
- # bathrooms
- # bedrooms
- Lot size
- Year built
- …
Machine	
  Learning	
  Specializa0on	
  8	
  
How does # of observations influence overfitting?
Few observations (N small)
à rapidly overfit as model complexity increases
Many observations (N very large)
à harder to overfit
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
square feet (sq.ft.)
price($)
x
fŵ
y
square feet (sq.ft.)
price($)
x
fŵy
Machine	
  Learning	
  Specializa0on	
  9	
  
How does # of inputs influence overfitting?
1 input (e.g., sq.ft.):
Data must include representative examples of
all possible (sq.ft., $) pairs to avoid overfitting
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
HARD
square feet (sq.ft.)
price($)
x
fŵ
y
Machine	
  Learning	
  Specializa0on	
  10	
  
How does # of inputs influence overfitting?
d inputs (e.g., sq.ft., #bath, #bed, lot size, year,…):
Data must include examples of all possible
(sq.ft., #bath, #bed, lot size, year,…., $) combos
to avoid overfitting
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
MUCH!!!
HARDER
square feet (sq.ft.)
price($)
x[1]
y
x[2]
Machine	
  Learning	
  Specializa0on	
  
Adding term to cost-of-fit
to prefer small coefficients
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  12	
   ©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Training
Data
Feature
extraction
ML
model
Quality
metric
ML algorithm
y
h(x) ŷ
ŵ
Machine	
  Learning	
  Specializa0on	
  13	
  
Desired total cost format
Want to balance:
i.  How well function fits data
ii.  Magnitude of coefficients
Total cost =
measure of fit + measure of magnitude
of coefficients
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
small # = good fit to
training data small # = not overfit
want to balancemeasure quality of fit
Machine	
  Learning	
  Specializa0on	
  14	
  
Measure of fit to training data
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w) = (yi-h(xi)Tw)2
= (yi-ŷi(w))2
NX
i=1
NX
i=1
square feet (sq.ft.)
price($)
x[1]
y
x[2]
Machine	
  Learning	
  Specializa0on	
  15	
  
What summary # is indicative of
size of regression coefficients?
-  Sum?
-  Sum of absolute value?
-  Sum of squares (L2 norm)
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Measure of magnitude of
regression coefficient
Machine	
  Learning	
  Specializa0on	
  16	
  
Consider specific total cost
Total cost =
measure of fit + measure of magnitude
of coefficients
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  17	
  
Consider specific total cost
Total cost =
measure of fit + measure of magnitude
of coefficients
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w)
||w||2
2	
  
Machine	
  Learning	
  Specializa0on	
  18	
  
Consider resulting objective
What if ŵ selected to minimize
If λ=0:
If λ=∞: 	
  
If λ in between:
RSS(w) + ||w||2
tuning parameter = balance of fit and magnitude
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
λ
2	
  
Machine	
  Learning	
  Specializa0on	
  19	
  
Consider resulting objective
What if ŵ selected to minimize
RSS(w) + ||w||2
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
λ
Ridge regression
(a.k.a L2 regularization)
tuning parameter = balance of fit and magnitude
2	
  
Machine	
  Learning	
  Specializa0on	
  20	
  
Bias-variance tradeoff
Large λ:
high bias, low variance
(e.g., ŵ =0 for λ=∞)
Small λ:
low bias, high variance
(e.g., standard least squares (RSS) fit of
high-order polynomial for λ=0)
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
In essence, λ
controls model
complexity
Machine	
  Learning	
  Specializa0on	
  21	
  
Revisit polynomial fit demo
What happens if we refit our high-order
polynomial, but now using ridge regression?
Will consider a few settings of λ …
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  22	
  
0 50000 100000 150000 200000
−1000000100000200000 coefficient paths −− Ridge
coefficient
bedrooms
bathrooms
sqft_living
sqft_lot
floors
yr_built
yr_renovat
waterfront
Coefficient path
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
λ
coefficientsŵj
Machine	
  Learning	
  Specializa0on	
  
Fitting the ridge regression model
(for given λ value)
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  24	
   ©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Training
Data
Feature
extraction
ML
model
Quality
metric
ML algorithm
ŵy
ŷh(x)
Machine	
  Learning	
  Specializa0on	
  
Step 1:
Rewrite total cost in matrix notation
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  26	
  
Recall matrix form of RSS
Model for all N observations together
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
= +y H ε
Machine	
  Learning	
  Specializa0on	
  27	
  
Recall matrix form of RSS
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w) = (yi- h(xi)Tw)2
= (y-Hw)T(y-Hw)
NX
i=1
Machine	
  Learning	
  Specializa0on	
  28	
  
Rewrite magnitude of coefficients
in vector notation
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
||w||2 = w0
2 + w1
2 + w2
2 + … + wD
2
=
2	
  
Machine	
  Learning	
  Specializa0on	
  29	
  
Putting it all together
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
In matrix form, ridge regression cost is:
RSS(w) + λ||w||2
= (y-Hw)T(y-Hw) + λwTw
2	
  
Machine	
  Learning	
  Specializa0on	
  
Step 2:
Compute the gradient
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  31	
  
Why? By analogy to 1d case…
wTw analogous to w2 and derivative of w2=2w
Gradient of ridge regression cost
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
[RSS(w) + λ||w||2] = [(y-Hw)T(y-Hw) + λwTw]
[(y-Hw)T(y-Hw)] + λ [wTw]
Δ
Δ Δ
-2HT(y-Hw)
Why?
Δ
=
2	
  
2w
Machine	
  Learning	
  Specializa0on	
  
Step 3, Approach 1:
Set the gradient = 0
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  33	
  
Aside:
Refresher on identity matrics
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
cost(w) = -2HT(y-Hw) +2λw
= -2HT(y-Hw) +2λIw
Δ
Fun facts:
Iv = v IA = A A-1A = I AA-1 = I
Machine	
  Learning	
  Specializa0on	
  34	
  
Ridge closed-form solution
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
cost(w) = -2HT(y-Hw) +2λIw= 0
Δ
Solve for w:
Machine	
  Learning	
  Specializa0on	
  35	
  
Interpreting ridge
closed-form solution
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
ŵ = ( HTH + λI)-1 HTy
If λ=0:
If λ=∞: 	
  
Machine	
  Learning	
  Specializa0on	
  36	
  
Recall discussion on
previous closed-form solution
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
ŵ = ( HTH )-1 HTy
Invertible if:
In general,
(# linearly independent obs)
N > D
Complexity of inverse:
O(D3)
Machine	
  Learning	
  Specializa0on	
  37	
  
Discussion of
ridge closed-form solution
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
ŵ = ( HTH + λI)-1 HTy
+
Invertible if:
Always if λ>0,
even if N < D
Complexity of
inverse:
O(D3)…
big for large D!
Machine	
  Learning	
  Specializa0on	
  
Step 3, Approach 2:
Gradient descent
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  39	
  
Elementwise ridge regression
gradient descent algorithm
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
wj
(t+1) ß wj
(t) – η *
[-2 hj(xi)(yi-ŷi(w(t)))
+2λwj
(t) ]
NX
i=1
Update to jth feature weight:
cost(w) = -2HT(y-Hw) +2λw
Δ
Machine	
  Learning	
  Specializa0on	
  40	
  
Elementwise ridge regression
gradient descent algorithm
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
wj
(t+1) ß (1-2ηλ)wj
(t)
+2η hj(xi)(yi-ŷi(w(t)))
NX
i=1
Equivalently:
cost(w) = -2HT(y-Hw) +2λw
Δ
Machine	
  Learning	
  Specializa0on	
  41	
  
Elementwise ridge regression
gradient descent algorithm
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
wj
(t+1) ß (1-2ηλ)wj
(t)
+2η hj(xi)(yi-ŷi(w(t)))
NX
i=1
Equivalently:
cost(w) = -2HT(y-Hw) +2λw
Δ
Machine	
  Learning	
  Specializa0on	
  42	
  
Recall previous algorithm
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
init w(1)=0 (or randomly, or smartly), t=1
while || RSS(w(t))|| > ε
for j=0,…,D
partial[j] =-2 hj(xi)(yi-ŷi(w(t)))
wj
(t+1) ß wj
(t) – η partial[j]
t ß t + 1
Δ
NX
i=1
Machine	
  Learning	
  Specializa0on	
  43	
  
Summary of ridge regression algorithm
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
init w(1)=0 (or randomly, or smartly), t=1
while || RSS(w(t))|| > ε
for j=0,…,D
partial[j] =-2 hj(xi)(yi-ŷi(w(t)))
wj
(t+1) ß (1-2ηλ)wj
(t) – η partial[j]
t ß t + 1
Δ
NX
i=1
Machine	
  Learning	
  Specializa0on	
  
How to choose λ
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  45	
  
If sufficient amount of data…
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Validation
set
Training set
Test
set
fit ŵλ
test performance
of ŵλ to select λ*
assess
generalization
error of ŵλ*
Machine	
  Learning	
  Specializa0on	
  46	
  
Start with smallish dataset
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
All data
Machine	
  Learning	
  Specializa0on	
  47	
  
Still form test set and hold out
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Rest of data
Test
set
Machine	
  Learning	
  Specializa0on	
  48	
  
How do we use the other data?
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Rest of data
use for both training and
validation, but not so naively
Machine	
  Learning	
  Specializa0on	
  49	
  
Recall naïve approach
Is validation set enough to compare
performance of ŵλ across λ values?
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Valid.
set
Training set
small validation set
No
Machine	
  Learning	
  Specializa0on	
  50	
  
Choosing the validation set
Didn’t have to use the last data points
tabulated to form validation set
Can use any data subset
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Valid.
set
small validation set
Machine	
  Learning	
  Specializa0on	
  51	
  
Choosing the validation set
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Valid.
set
small validation set
Which subset should I use?
ALL!
average
performance
over all
choices
Machine	
  Learning	
  Specializa0on	
  52	
  
(use same split of data for all other steps)
Preprocessing:
Randomly assign data to K groups
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
N
K	
  
N
K	
  
N
K	
  
N
K	
  
N
K	
  
Rest of data
K-fold cross validation
Machine	
  Learning	
  Specializa0on	
  53	
  
For k=1,…,K
1.  Estimate ŵλ
(k) on the training blocks
2.  Compute error on validation block: errork(λ)
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Valid
set
ŵλ
(1)	
  error1(λ)
K-fold cross validation
Machine	
  Learning	
  Specializa0on	
  54	
  
For k=1,…,K
1.  Estimate ŵλ
(k) on the training blocks
2.  Compute error on validation block: errork(λ)
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Valid
set
ŵλ
(2)	
  error2(λ)
K-fold cross validation
Machine	
  Learning	
  Specializa0on	
  55	
  
For k=1,…,K
1.  Estimate ŵλ
(k) on the training blocks
2.  Compute error on validation block: errork(λ)
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Valid
set
ŵλ
(3)	
   error3(λ)
K-fold cross validation
Machine	
  Learning	
  Specializa0on	
  56	
  
For k=1,…,K
1.  Estimate ŵλ
(k) on the training blocks
2.  Compute error on validation block: errork(λ)
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Valid
set
ŵλ
(4)	
   error4(λ)
K-fold cross validation
Machine	
  Learning	
  Specializa0on	
  57	
  
For k=1,…,K
1.  Estimate ŵλ
(k) on the training blocks
2.  Compute error on validation block: errork(λ)
Compute average error: CV(λ) = errork(λ)
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Valid
set
ŵλ
(5)	
   error5(λ)
1
K
KX
k=1
K-fold cross validation
Machine	
  Learning	
  Specializa0on	
  58	
  
Repeat procedure for each choice of λ
Choose λ* to minimize CV(λ)
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Valid
set
K-fold cross validation
Machine	
  Learning	
  Specializa0on	
  59	
  
What value of K?
Formally, the best approximation occurs
for validation sets of size 1 (K=N)
Computationally intensive
-  requires computing N fits of model per λ
Typically, K=5 or 10
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
leave-one-out
cross validation
5-fold CV
10-fold CV
Machine	
  Learning	
  Specializa0on	
  
How to handle the intercept
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  61	
  
Recall multiple regression model
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Model:
yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi
= wj hj(xi) + εi
feature 1 = h0(x)…often 1 (constant)
feature 2 = h1(x)… e.g., x[1]
feature 3 = h2(x)… e.g., x[2]
…
feature D+1 = hD(x)… e.g., x[d]
DX
j=0
Machine	
  Learning	
  Specializa0on	
  62	
  
If constant feature…
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
yi = w0 + w1 h1(xi) + … + wD hD(xi)+ εi
In matrix notation for N observations:
w0	
  1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Machine	
  Learning	
  Specializa0on	
  63	
  
Do we penalize intercept?
Standard ridge regression cost:
Encourages intercept w0 to also be small
Do we want a small intercept?
Conceptually, not indicative of overfitting…
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w) + ||w||2λ
2	
  
strength of penalty
Machine	
  Learning	
  Specializa0on	
  64	
  
Option 1: Don’t penalize intercept
Modified ridge regression cost:
How to implement this in practice?
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w0,wrest) + ||wrest||2λ
2	
  
Machine	
  Learning	
  Specializa0on	
  65	
  
Option 1: Don’t penalize intercept
– Closed-form solution –
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
ŵ = ( HTH + λImod)-1 HTy
+
0
Machine	
  Learning	
  Specializa0on	
  66	
  
Option 1: Don’t penalize intercept
– Gradient descent algorithm –
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
while || RSS(w(t))|| > ε
for j=0,…,D
partial[j] =-2 hj(xi)(yi-ŷi(w(t)))
if j==0
w0
(t+1) ß w0
(t) – η partial[j]
else
wj
(t+1) ß (1-2ηλ)wj
(t) – η partial[j]
t ß t + 1
Δ
NX
i=1
Machine	
  Learning	
  Specializa0on	
  67	
  
Option 2: Center data first
If data are first centered about 0, then
favoring small intercept not so worrisome
Step 1: Transform y to have 0 mean
Step 2: Run ridge regression as normal
(closed-form or gradient algorithms)
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  
Summary for
ridge regression
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  69	
  
What you can do now…
•  Describe what happens to magnitude of estimated
coefficients when model is overfit
•  Motivate form of ridge regression cost function
•  Describe what happens to estimated coefficients of
ridge regression as tuning parameter λ is varied
•  Interpret coefficient path plot
•  Estimate ridge regression parameters:
-  In closed form
-  Using an iterative gradient descent algorithm
•  Implement K-fold cross validation to select the
ridge regression tuning parameter λ
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  

More Related Content

What's hot

Reciprocal Allocation Method
Reciprocal Allocation MethodReciprocal Allocation Method
Reciprocal Allocation MethodPaulino Silva
 
7th PreAlg - L74--May4
7th PreAlg - L74--May47th PreAlg - L74--May4
7th PreAlg - L74--May4jdurst65
 
Extract information from pie
Extract information from pieExtract information from pie
Extract information from pieNadeem Uddin
 
1. teoría del consumidor problemas de aplicación teoría microeconómica (64° c...
1. teoría del consumidor problemas de aplicación teoría microeconómica (64° c...1. teoría del consumidor problemas de aplicación teoría microeconómica (64° c...
1. teoría del consumidor problemas de aplicación teoría microeconómica (64° c...JamesWilliamsHuancas
 
Day 4 evaluating with add and subtract
Day 4 evaluating with add and subtractDay 4 evaluating with add and subtract
Day 4 evaluating with add and subtractErik Tjersland
 
Tutorials--Solving One-Step Equations
Tutorials--Solving One-Step EquationsTutorials--Solving One-Step Equations
Tutorials--Solving One-Step EquationsMedia4math
 
Percentages for competetive exams
Percentages for  competetive examsPercentages for  competetive exams
Percentages for competetive examsavdheshtripathi2
 
Applications of calculus in commerce and economics
Applications of calculus in commerce and economicsApplications of calculus in commerce and economics
Applications of calculus in commerce and economicssumanmathews
 
Applications of calculus in commerce and economics ii
Applications of calculus in commerce and economics ii Applications of calculus in commerce and economics ii
Applications of calculus in commerce and economics ii sumanmathews
 
Alg II 3-4 Linear Programming
Alg II 3-4 Linear ProgrammingAlg II 3-4 Linear Programming
Alg II 3-4 Linear Programmingjtentinger
 

What's hot (14)

Reciprocal Allocation Method
Reciprocal Allocation MethodReciprocal Allocation Method
Reciprocal Allocation Method
 
Algebra
AlgebraAlgebra
Algebra
 
7th PreAlg - L74--May4
7th PreAlg - L74--May47th PreAlg - L74--May4
7th PreAlg - L74--May4
 
Funções 1
Funções 1Funções 1
Funções 1
 
Extract information from pie
Extract information from pieExtract information from pie
Extract information from pie
 
1. teoría del consumidor problemas de aplicación teoría microeconómica (64° c...
1. teoría del consumidor problemas de aplicación teoría microeconómica (64° c...1. teoría del consumidor problemas de aplicación teoría microeconómica (64° c...
1. teoría del consumidor problemas de aplicación teoría microeconómica (64° c...
 
Gm
GmGm
Gm
 
Day 4 evaluating with add and subtract
Day 4 evaluating with add and subtractDay 4 evaluating with add and subtract
Day 4 evaluating with add and subtract
 
Tutorials--Solving One-Step Equations
Tutorials--Solving One-Step EquationsTutorials--Solving One-Step Equations
Tutorials--Solving One-Step Equations
 
Percentages for competetive exams
Percentages for  competetive examsPercentages for  competetive exams
Percentages for competetive exams
 
Applications of calculus in commerce and economics
Applications of calculus in commerce and economicsApplications of calculus in commerce and economics
Applications of calculus in commerce and economics
 
Applications of calculus in commerce and economics ii
Applications of calculus in commerce and economics ii Applications of calculus in commerce and economics ii
Applications of calculus in commerce and economics ii
 
Alg II 3-4 Linear Programming
Alg II 3-4 Linear ProgrammingAlg II 3-4 Linear Programming
Alg II 3-4 Linear Programming
 
TALLER N°2
TALLER N°2TALLER N°2
TALLER N°2
 

Viewers also liked

Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic netVivian S. Zhang
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsDerek Kane
 
Tutorial on Robustness of Recommender Systems
Tutorial on Robustness of Recommender SystemsTutorial on Robustness of Recommender Systems
Tutorial on Robustness of Recommender Systemsneilhurley
 
USE OF DATA MINING IN BANKING SECTOR
USE OF DATA MINING IN BANKING SECTORUSE OF DATA MINING IN BANKING SECTOR
USE OF DATA MINING IN BANKING SECTORarpit bhadoriya
 
Apprentissage automatique, Régression Ridge et LASSO
Apprentissage automatique, Régression Ridge et LASSOApprentissage automatique, Régression Ridge et LASSO
Apprentissage automatique, Régression Ridge et LASSOPierre-Hugues Carmichael
 

Viewers also liked (8)

Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic net
 
Ridge regression
Ridge regressionRidge regression
Ridge regression
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 
Welcome to our lab 2016
Welcome to our lab 2016Welcome to our lab 2016
Welcome to our lab 2016
 
Lasso regression
Lasso regressionLasso regression
Lasso regression
 
Tutorial on Robustness of Recommender Systems
Tutorial on Robustness of Recommender SystemsTutorial on Robustness of Recommender Systems
Tutorial on Robustness of Recommender Systems
 
USE OF DATA MINING IN BANKING SECTOR
USE OF DATA MINING IN BANKING SECTORUSE OF DATA MINING IN BANKING SECTOR
USE OF DATA MINING IN BANKING SECTOR
 
Apprentissage automatique, Régression Ridge et LASSO
Apprentissage automatique, Régression Ridge et LASSOApprentissage automatique, Régression Ridge et LASSO
Apprentissage automatique, Régression Ridge et LASSO
 

Similar to C2.4

Banque de France's Workshop on Granularity: Basile Grassi's slides, June 2016
Banque de France's Workshop on Granularity: Basile Grassi's slides, June 2016 Banque de France's Workshop on Granularity: Basile Grassi's slides, June 2016
Banque de France's Workshop on Granularity: Basile Grassi's slides, June 2016 Soledad Zignago
 
Blame Assignment for Higher-Order Contracts with Intersection and Union
Blame Assignment for Higher-Order Contracts with Intersection and UnionBlame Assignment for Higher-Order Contracts with Intersection and Union
Blame Assignment for Higher-Order Contracts with Intersection and UnionMatthias Keil
 
Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...
Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...
Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...Flink Forward
 
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Edureka!
 
Actors for Behavioural Simulation
Actors for Behavioural SimulationActors for Behavioural Simulation
Actors for Behavioural SimulationClarkTony
 
Distributed Coordinate Descent for Logistic Regression with Regularization
Distributed Coordinate Descent for Logistic Regression with RegularizationDistributed Coordinate Descent for Logistic Regression with Regularization
Distributed Coordinate Descent for Logistic Regression with RegularizationИлья Трофимов
 
Is the Macroeconomy Locally Unstable and Why Should We Care?
Is the Macroeconomy Locally Unstable and Why Should We Care?Is the Macroeconomy Locally Unstable and Why Should We Care?
Is the Macroeconomy Locally Unstable and Why Should We Care?ADEMU_Project
 
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Mario Fusco
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsDmitriy Selivanov
 
Stochastic optimization from mirror descent to recent algorithms
Stochastic optimization from mirror descent to recent algorithmsStochastic optimization from mirror descent to recent algorithms
Stochastic optimization from mirror descent to recent algorithmsSeonho Park
 
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Codemotion
 
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...Magnify Analytic Solutions
 
Linear programming in matlab
Linear programming in matlabLinear programming in matlab
Linear programming in matlabIAEME Publication
 

Similar to C2.4 (20)

C3.2.1
C3.2.1C3.2.1
C3.2.1
 
C3.7.1
C3.7.1C3.7.1
C3.7.1
 
Banque de France's Workshop on Granularity: Basile Grassi's slides, June 2016
Banque de France's Workshop on Granularity: Basile Grassi's slides, June 2016 Banque de France's Workshop on Granularity: Basile Grassi's slides, June 2016
Banque de France's Workshop on Granularity: Basile Grassi's slides, June 2016
 
C3.2.2
C3.2.2C3.2.2
C3.2.2
 
Blame Assignment for Higher-Order Contracts with Intersection and Union
Blame Assignment for Higher-Order Contracts with Intersection and UnionBlame Assignment for Higher-Order Contracts with Intersection and Union
Blame Assignment for Higher-Order Contracts with Intersection and Union
 
MUMS: Agent-based Modeling Workshop - Practical Bayesian Optimization for Age...
MUMS: Agent-based Modeling Workshop - Practical Bayesian Optimization for Age...MUMS: Agent-based Modeling Workshop - Practical Bayesian Optimization for Age...
MUMS: Agent-based Modeling Workshop - Practical Bayesian Optimization for Age...
 
Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...
Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...
Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...
 
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
 
Actors for Behavioural Simulation
Actors for Behavioural SimulationActors for Behavioural Simulation
Actors for Behavioural Simulation
 
Distributed Coordinate Descent for Logistic Regression with Regularization
Distributed Coordinate Descent for Logistic Regression with RegularizationDistributed Coordinate Descent for Logistic Regression with Regularization
Distributed Coordinate Descent for Logistic Regression with Regularization
 
Is the Macroeconomy Locally Unstable and Why Should We Care?
Is the Macroeconomy Locally Unstable and Why Should We Care?Is the Macroeconomy Locally Unstable and Why Should We Care?
Is the Macroeconomy Locally Unstable and Why Should We Care?
 
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender Systems
 
Stochastic optimization from mirror descent to recent algorithms
Stochastic optimization from mirror descent to recent algorithmsStochastic optimization from mirror descent to recent algorithms
Stochastic optimization from mirror descent to recent algorithms
 
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...
 
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
 
Slides simplexe
Slides simplexeSlides simplexe
Slides simplexe
 
Lecture4 xing
Lecture4 xingLecture4 xing
Lecture4 xing
 
V -linear_programming
V  -linear_programmingV  -linear_programming
V -linear_programming
 
Linear programming in matlab
Linear programming in matlabLinear programming in matlab
Linear programming in matlab
 

More from Daniel LIAO (14)

C4.5
C4.5C4.5
C4.5
 
C4.4
C4.4C4.4
C4.4
 
C4.3.1
C4.3.1C4.3.1
C4.3.1
 
Lime
LimeLime
Lime
 
C4.1.2
C4.1.2C4.1.2
C4.1.2
 
C4.1.1
C4.1.1C4.1.1
C4.1.1
 
C3.6.1
C3.6.1C3.6.1
C3.6.1
 
C3.5.1
C3.5.1C3.5.1
C3.5.1
 
C3.4.2
C3.4.2C3.4.2
C3.4.2
 
C3.4.1
C3.4.1C3.4.1
C3.4.1
 
C3.3.1
C3.3.1C3.3.1
C3.3.1
 
C3.1.2
C3.1.2C3.1.2
C3.1.2
 
C3.1.logistic intro
C3.1.logistic introC3.1.logistic intro
C3.1.logistic intro
 
C3.1 intro
C3.1 introC3.1 intro
C3.1 intro
 

Recently uploaded

MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 

Recently uploaded (20)

MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 

C2.4

  • 1. Machine  Learning  Specializa0on   Ridge Regression: Regulating overfitting when using many features Emily Fox & Carlos Guestrin Machine Learning Specialization University of Washington 1 ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 2. Machine  Learning  Specializa0on   Overfitting of polynomial regression ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 3. Machine  Learning  Specializa0on  3   Flexibility of high-order polynomials ©2015  Emily  Fox  &  Carlos  Guestrin   square feet (sq.ft.) price($) x fŵ square feet (sq.ft.) price($) x y fŵ y yi = w0 + w1 xi+ w2 xi 2 + … + wp xi p + εi
  • 4. Machine  Learning  Specializa0on  4   Flexibility of high-order polynomials ©2015  Emily  Fox  &  Carlos  Guestrin   square feet (sq.ft.) price($) x fŵ square feet (sq.ft.) price($) x y fŵ y OVERFIT yi = w0 + w1 xi+ w2 xi 2 + … + wp xi p + εi
  • 5. Machine  Learning  Specializa0on  5   Symptom of overfitting Often, overfitting associated with very large estimated parameters ŵ ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 6. Machine  Learning  Specializa0on   Overfitting of linear regression models more generically ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 7. Machine  Learning  Specializa0on  7   Overfitting with many features Not unique to polynomial regression, but also if lots of inputs (d large) Or, generically, lots of features (D large) yi = wj hj(xi) + εi ©2015  Emily  Fox  &  Carlos  Guestrin   DX j=0 - Square feet - # bathrooms - # bedrooms - Lot size - Year built - …
  • 8. Machine  Learning  Specializa0on  8   How does # of observations influence overfitting? Few observations (N small) à rapidly overfit as model complexity increases Many observations (N very large) à harder to overfit ©2015  Emily  Fox  &  Carlos  Guestrin   square feet (sq.ft.) price($) x fŵ y square feet (sq.ft.) price($) x fŵy
  • 9. Machine  Learning  Specializa0on  9   How does # of inputs influence overfitting? 1 input (e.g., sq.ft.): Data must include representative examples of all possible (sq.ft., $) pairs to avoid overfitting ©2015  Emily  Fox  &  Carlos  Guestrin   HARD square feet (sq.ft.) price($) x fŵ y
  • 10. Machine  Learning  Specializa0on  10   How does # of inputs influence overfitting? d inputs (e.g., sq.ft., #bath, #bed, lot size, year,…): Data must include examples of all possible (sq.ft., #bath, #bed, lot size, year,…., $) combos to avoid overfitting ©2015  Emily  Fox  &  Carlos  Guestrin   MUCH!!! HARDER square feet (sq.ft.) price($) x[1] y x[2]
  • 11. Machine  Learning  Specializa0on   Adding term to cost-of-fit to prefer small coefficients ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 12. Machine  Learning  Specializa0on  12   ©2015  Emily  Fox  &  Carlos  Guestrin   Training Data Feature extraction ML model Quality metric ML algorithm y h(x) ŷ ŵ
  • 13. Machine  Learning  Specializa0on  13   Desired total cost format Want to balance: i.  How well function fits data ii.  Magnitude of coefficients Total cost = measure of fit + measure of magnitude of coefficients ©2015  Emily  Fox  &  Carlos  Guestrin   small # = good fit to training data small # = not overfit want to balancemeasure quality of fit
  • 14. Machine  Learning  Specializa0on  14   Measure of fit to training data ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) = (yi-h(xi)Tw)2 = (yi-ŷi(w))2 NX i=1 NX i=1 square feet (sq.ft.) price($) x[1] y x[2]
  • 15. Machine  Learning  Specializa0on  15   What summary # is indicative of size of regression coefficients? -  Sum? -  Sum of absolute value? -  Sum of squares (L2 norm) ©2015  Emily  Fox  &  Carlos  Guestrin   Measure of magnitude of regression coefficient
  • 16. Machine  Learning  Specializa0on  16   Consider specific total cost Total cost = measure of fit + measure of magnitude of coefficients ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 17. Machine  Learning  Specializa0on  17   Consider specific total cost Total cost = measure of fit + measure of magnitude of coefficients ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) ||w||2 2  
  • 18. Machine  Learning  Specializa0on  18   Consider resulting objective What if ŵ selected to minimize If λ=0: If λ=∞:   If λ in between: RSS(w) + ||w||2 tuning parameter = balance of fit and magnitude ©2015  Emily  Fox  &  Carlos  Guestrin   λ 2  
  • 19. Machine  Learning  Specializa0on  19   Consider resulting objective What if ŵ selected to minimize RSS(w) + ||w||2 ©2015  Emily  Fox  &  Carlos  Guestrin   λ Ridge regression (a.k.a L2 regularization) tuning parameter = balance of fit and magnitude 2  
  • 20. Machine  Learning  Specializa0on  20   Bias-variance tradeoff Large λ: high bias, low variance (e.g., ŵ =0 for λ=∞) Small λ: low bias, high variance (e.g., standard least squares (RSS) fit of high-order polynomial for λ=0) ©2015  Emily  Fox  &  Carlos  Guestrin   In essence, λ controls model complexity
  • 21. Machine  Learning  Specializa0on  21   Revisit polynomial fit demo What happens if we refit our high-order polynomial, but now using ridge regression? Will consider a few settings of λ … ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 22. Machine  Learning  Specializa0on  22   0 50000 100000 150000 200000 −1000000100000200000 coefficient paths −− Ridge coefficient bedrooms bathrooms sqft_living sqft_lot floors yr_built yr_renovat waterfront Coefficient path ©2015  Emily  Fox  &  Carlos  Guestrin   λ coefficientsŵj
  • 23. Machine  Learning  Specializa0on   Fitting the ridge regression model (for given λ value) ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 24. Machine  Learning  Specializa0on  24   ©2015  Emily  Fox  &  Carlos  Guestrin   Training Data Feature extraction ML model Quality metric ML algorithm ŵy ŷh(x)
  • 25. Machine  Learning  Specializa0on   Step 1: Rewrite total cost in matrix notation ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 26. Machine  Learning  Specializa0on  26   Recall matrix form of RSS Model for all N observations together ©2015  Emily  Fox  &  Carlos  Guestrin   = +y H ε
  • 27. Machine  Learning  Specializa0on  27   Recall matrix form of RSS ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) = (yi- h(xi)Tw)2 = (y-Hw)T(y-Hw) NX i=1
  • 28. Machine  Learning  Specializa0on  28   Rewrite magnitude of coefficients in vector notation ©2015  Emily  Fox  &  Carlos  Guestrin   ||w||2 = w0 2 + w1 2 + w2 2 + … + wD 2 = 2  
  • 29. Machine  Learning  Specializa0on  29   Putting it all together ©2015  Emily  Fox  &  Carlos  Guestrin   In matrix form, ridge regression cost is: RSS(w) + λ||w||2 = (y-Hw)T(y-Hw) + λwTw 2  
  • 30. Machine  Learning  Specializa0on   Step 2: Compute the gradient ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 31. Machine  Learning  Specializa0on  31   Why? By analogy to 1d case… wTw analogous to w2 and derivative of w2=2w Gradient of ridge regression cost ©2015  Emily  Fox  &  Carlos  Guestrin   [RSS(w) + λ||w||2] = [(y-Hw)T(y-Hw) + λwTw] [(y-Hw)T(y-Hw)] + λ [wTw] Δ Δ Δ -2HT(y-Hw) Why? Δ = 2   2w
  • 32. Machine  Learning  Specializa0on   Step 3, Approach 1: Set the gradient = 0 ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 33. Machine  Learning  Specializa0on  33   Aside: Refresher on identity matrics ©2015  Emily  Fox  &  Carlos  Guestrin   cost(w) = -2HT(y-Hw) +2λw = -2HT(y-Hw) +2λIw Δ Fun facts: Iv = v IA = A A-1A = I AA-1 = I
  • 34. Machine  Learning  Specializa0on  34   Ridge closed-form solution ©2015  Emily  Fox  &  Carlos  Guestrin   cost(w) = -2HT(y-Hw) +2λIw= 0 Δ Solve for w:
  • 35. Machine  Learning  Specializa0on  35   Interpreting ridge closed-form solution ©2015  Emily  Fox  &  Carlos  Guestrin   ŵ = ( HTH + λI)-1 HTy If λ=0: If λ=∞:  
  • 36. Machine  Learning  Specializa0on  36   Recall discussion on previous closed-form solution ©2015  Emily  Fox  &  Carlos  Guestrin   ŵ = ( HTH )-1 HTy Invertible if: In general, (# linearly independent obs) N > D Complexity of inverse: O(D3)
  • 37. Machine  Learning  Specializa0on  37   Discussion of ridge closed-form solution ©2015  Emily  Fox  &  Carlos  Guestrin   ŵ = ( HTH + λI)-1 HTy + Invertible if: Always if λ>0, even if N < D Complexity of inverse: O(D3)… big for large D!
  • 38. Machine  Learning  Specializa0on   Step 3, Approach 2: Gradient descent ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 39. Machine  Learning  Specializa0on  39   Elementwise ridge regression gradient descent algorithm ©2015  Emily  Fox  &  Carlos  Guestrin   wj (t+1) ß wj (t) – η * [-2 hj(xi)(yi-ŷi(w(t))) +2λwj (t) ] NX i=1 Update to jth feature weight: cost(w) = -2HT(y-Hw) +2λw Δ
  • 40. Machine  Learning  Specializa0on  40   Elementwise ridge regression gradient descent algorithm ©2015  Emily  Fox  &  Carlos  Guestrin   wj (t+1) ß (1-2ηλ)wj (t) +2η hj(xi)(yi-ŷi(w(t))) NX i=1 Equivalently: cost(w) = -2HT(y-Hw) +2λw Δ
  • 41. Machine  Learning  Specializa0on  41   Elementwise ridge regression gradient descent algorithm ©2015  Emily  Fox  &  Carlos  Guestrin   wj (t+1) ß (1-2ηλ)wj (t) +2η hj(xi)(yi-ŷi(w(t))) NX i=1 Equivalently: cost(w) = -2HT(y-Hw) +2λw Δ
  • 42. Machine  Learning  Specializa0on  42   Recall previous algorithm ©2015  Emily  Fox  &  Carlos  Guestrin   init w(1)=0 (or randomly, or smartly), t=1 while || RSS(w(t))|| > ε for j=0,…,D partial[j] =-2 hj(xi)(yi-ŷi(w(t))) wj (t+1) ß wj (t) – η partial[j] t ß t + 1 Δ NX i=1
  • 43. Machine  Learning  Specializa0on  43   Summary of ridge regression algorithm ©2015  Emily  Fox  &  Carlos  Guestrin   init w(1)=0 (or randomly, or smartly), t=1 while || RSS(w(t))|| > ε for j=0,…,D partial[j] =-2 hj(xi)(yi-ŷi(w(t))) wj (t+1) ß (1-2ηλ)wj (t) – η partial[j] t ß t + 1 Δ NX i=1
  • 44. Machine  Learning  Specializa0on   How to choose λ ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 45. Machine  Learning  Specializa0on  45   If sufficient amount of data… ©2015  Emily  Fox  &  Carlos  Guestrin   Validation set Training set Test set fit ŵλ test performance of ŵλ to select λ* assess generalization error of ŵλ*
  • 46. Machine  Learning  Specializa0on  46   Start with smallish dataset ©2015  Emily  Fox  &  Carlos  Guestrin   All data
  • 47. Machine  Learning  Specializa0on  47   Still form test set and hold out ©2015  Emily  Fox  &  Carlos  Guestrin   Rest of data Test set
  • 48. Machine  Learning  Specializa0on  48   How do we use the other data? ©2015  Emily  Fox  &  Carlos  Guestrin   Rest of data use for both training and validation, but not so naively
  • 49. Machine  Learning  Specializa0on  49   Recall naïve approach Is validation set enough to compare performance of ŵλ across λ values? ©2015  Emily  Fox  &  Carlos  Guestrin   Valid. set Training set small validation set No
  • 50. Machine  Learning  Specializa0on  50   Choosing the validation set Didn’t have to use the last data points tabulated to form validation set Can use any data subset ©2015  Emily  Fox  &  Carlos  Guestrin   Valid. set small validation set
  • 51. Machine  Learning  Specializa0on  51   Choosing the validation set ©2015  Emily  Fox  &  Carlos  Guestrin   Valid. set small validation set Which subset should I use? ALL! average performance over all choices
  • 52. Machine  Learning  Specializa0on  52   (use same split of data for all other steps) Preprocessing: Randomly assign data to K groups ©2015  Emily  Fox  &  Carlos  Guestrin   N K   N K   N K   N K   N K   Rest of data K-fold cross validation
  • 53. Machine  Learning  Specializa0on  53   For k=1,…,K 1.  Estimate ŵλ (k) on the training blocks 2.  Compute error on validation block: errork(λ) ©2015  Emily  Fox  &  Carlos  Guestrin   Valid set ŵλ (1)  error1(λ) K-fold cross validation
  • 54. Machine  Learning  Specializa0on  54   For k=1,…,K 1.  Estimate ŵλ (k) on the training blocks 2.  Compute error on validation block: errork(λ) ©2015  Emily  Fox  &  Carlos  Guestrin   Valid set ŵλ (2)  error2(λ) K-fold cross validation
  • 55. Machine  Learning  Specializa0on  55   For k=1,…,K 1.  Estimate ŵλ (k) on the training blocks 2.  Compute error on validation block: errork(λ) ©2015  Emily  Fox  &  Carlos  Guestrin   Valid set ŵλ (3)   error3(λ) K-fold cross validation
  • 56. Machine  Learning  Specializa0on  56   For k=1,…,K 1.  Estimate ŵλ (k) on the training blocks 2.  Compute error on validation block: errork(λ) ©2015  Emily  Fox  &  Carlos  Guestrin   Valid set ŵλ (4)   error4(λ) K-fold cross validation
  • 57. Machine  Learning  Specializa0on  57   For k=1,…,K 1.  Estimate ŵλ (k) on the training blocks 2.  Compute error on validation block: errork(λ) Compute average error: CV(λ) = errork(λ) ©2015  Emily  Fox  &  Carlos  Guestrin   Valid set ŵλ (5)   error5(λ) 1 K KX k=1 K-fold cross validation
  • 58. Machine  Learning  Specializa0on  58   Repeat procedure for each choice of λ Choose λ* to minimize CV(λ) ©2015  Emily  Fox  &  Carlos  Guestrin   Valid set K-fold cross validation
  • 59. Machine  Learning  Specializa0on  59   What value of K? Formally, the best approximation occurs for validation sets of size 1 (K=N) Computationally intensive -  requires computing N fits of model per λ Typically, K=5 or 10 ©2015  Emily  Fox  &  Carlos  Guestrin   leave-one-out cross validation 5-fold CV 10-fold CV
  • 60. Machine  Learning  Specializa0on   How to handle the intercept ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 61. Machine  Learning  Specializa0on  61   Recall multiple regression model ©2015  Emily  Fox  &  Carlos  Guestrin   Model: yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi = wj hj(xi) + εi feature 1 = h0(x)…often 1 (constant) feature 2 = h1(x)… e.g., x[1] feature 3 = h2(x)… e.g., x[2] … feature D+1 = hD(x)… e.g., x[d] DX j=0
  • 62. Machine  Learning  Specializa0on  62   If constant feature… ©2015  Emily  Fox  &  Carlos  Guestrin   yi = w0 + w1 h1(xi) + … + wD hD(xi)+ εi In matrix notation for N observations: w0  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  • 63. Machine  Learning  Specializa0on  63   Do we penalize intercept? Standard ridge regression cost: Encourages intercept w0 to also be small Do we want a small intercept? Conceptually, not indicative of overfitting… ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) + ||w||2λ 2   strength of penalty
  • 64. Machine  Learning  Specializa0on  64   Option 1: Don’t penalize intercept Modified ridge regression cost: How to implement this in practice? ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w0,wrest) + ||wrest||2λ 2  
  • 65. Machine  Learning  Specializa0on  65   Option 1: Don’t penalize intercept – Closed-form solution – ©2015  Emily  Fox  &  Carlos  Guestrin   ŵ = ( HTH + λImod)-1 HTy + 0
  • 66. Machine  Learning  Specializa0on  66   Option 1: Don’t penalize intercept – Gradient descent algorithm – ©2015  Emily  Fox  &  Carlos  Guestrin   while || RSS(w(t))|| > ε for j=0,…,D partial[j] =-2 hj(xi)(yi-ŷi(w(t))) if j==0 w0 (t+1) ß w0 (t) – η partial[j] else wj (t+1) ß (1-2ηλ)wj (t) – η partial[j] t ß t + 1 Δ NX i=1
  • 67. Machine  Learning  Specializa0on  67   Option 2: Center data first If data are first centered about 0, then favoring small intercept not so worrisome Step 1: Transform y to have 0 mean Step 2: Run ridge regression as normal (closed-form or gradient algorithms) ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 68. Machine  Learning  Specializa0on   Summary for ridge regression ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 69. Machine  Learning  Specializa0on  69   What you can do now… •  Describe what happens to magnitude of estimated coefficients when model is overfit •  Motivate form of ridge regression cost function •  Describe what happens to estimated coefficients of ridge regression as tuning parameter λ is varied •  Interpret coefficient path plot •  Estimate ridge regression parameters: -  In closed form -  Using an iterative gradient descent algorithm •  Implement K-fold cross validation to select the ridge regression tuning parameter λ ©2015  Emily  Fox  &  Carlos  Guestrin