C2.3

Machine
Learning
Specializa0on

Assessing
Performance
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
2

Make predictions, get $, right??
©2015
Emily
Fox
&
Carlos
Guestrin

Model
Algorithm
Fit f
Model + algorithm
à ﬁtted function
Predictions
à decisions à outcome

Machine
Learning
Specializa0on
3

Or, how much am I losing?
Example: Lost $ due to inaccurate listing price
- Too low à low oﬀers
- Too high à few lookers + no/low oﬀers
How much am I losing compared to perfection?
Perfect predictions: Loss = 0
My predictions: Loss = ???
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
4

Measuring loss
Loss function:
L(y,fŵ(x))
Examples:
(assuming loss for underpredicting = overpredicting)
Absolute error: L(y,fŵ(x)) = |y-fŵ(x)|
Squared error: L(y,fŵ(x)) = (y-fŵ(x))2
©2015
Emily
Fox
&
Carlos
Guestrin

actual
value
Cost of using ŵ at x
when y is true
= predicted value ŷf(x)

Machine
Learning
Specializa0on
©2015
Emily
Fox
&
Carlos
Guestrin

“Remember that all models are
wrong; the practical question is
how wrong do they have to be to
not be useful.” George Box, 1987.

Machine
Learning
Specializa0on

Assessing the loss
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on

Assessing the loss
Part 1: Training error
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
8

Deﬁne training data
©2015
Emily
Fox
&
Carlos
Guestrin

square feet (sq.ft.)
price($)
x
y

Machine
Learning
Specializa0on
9

Deﬁne training data
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y

Machine
Learning
Specializa0on
10

Example:
Fit quadratic to minimize RSS
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
ŵ minimizes
RSS of
training data

Machine
Learning
Specializa0on
11

Compute training error
1.  Deﬁne a loss function L(y,fŵ(x))
-  E.g., squared error, absolute error,…
2.  Training error
= avg. loss on houses in training set
= L(yi,fŵ(xi))
©2015
Emily
Fox
&
Carlos
Guestrin

ﬁt using training data
1
N
NX
i=1

Machine
Learning
Specializa0on
12

Example:
Use squared error loss (y-fŵ(x))2
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
Training error (ŵ) = 1/N *
[($train 1-fŵ(sq.ft.train 1))2
+ ($train 2-fŵ(sq.ft.train 2))2
+ ($train 3-fŵ(sq.ft.train 3))2
+ … include all
training houses]

Machine
Learning
Specializa0on
13

Example:
Use squared error loss (y-fŵ(x))2
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
Training error (ŵ) =
(yi-fŵ(xi))2
RMSE =
(yi-fŵ(xi))2
1
N
NX
i=1
v
u
u
t 1
N
NX
i=1

Machine
Learning
Specializa0on
14

Training error vs.
model complexity
©2015
Emily
Fox
&
Carlos
Guestrin

Model complexity
Error
price($)
x
y

Machine
Learning
Specializa0on
15

Training error vs.
model complexity
©2015
Emily
Fox
&
Carlos
Guestrin

Model complexity
Error
price($)
x
y

Machine
Learning
Specializa0on
16

Training error vs.
model complexity
©2015
Emily
Fox
&
Carlos
Guestrin

Model complexity
Error
price($)
x
y

Machine
Learning
Specializa0on
17

Training error vs.
model complexity
©2015
Emily
Fox
&
Carlos
Guestrin

Model complexity
Error
price($)
x
y

Machine
Learning
Specializa0on
18

Training error vs.
model complexity
©2015
Emily
Fox
&
Carlos
Guestrin

Model complexity
Error
x
y
x
y

Machine
Learning
Specializa0on
19

Is training error a good measure
of predictive performance?
How do we expect to perform on
a new house?
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y

Machine
Learning
Specializa0on
20

Is there something particularly bad
about having xt square feet???
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
xt

Machine
Learning
Specializa0on
21

Issue: Training error is overly optimistic
because ŵ was ﬁt to training data
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
xt
Small training error ≠> good predictions
unless training data includes everything you
might ever see

Machine
Learning
Specializa0on

Assessing the loss
Part 2: Generalization (true) error
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
23

Generalization error
Really want estimate of loss
over all possible ( ,$) pairs
©2015
Emily
Fox
&
Carlos
Guestrin

Lots of houses
in neighborhood,
but not in dataset

Machine
Learning
Specializa0on
24

Distribution over houses
In our neighborhood, houses of what
# sq.ft. ( ) are we likely to see?
©2015
Emily
Fox
&
Carlos
Guestrin


Machine
Learning
Specializa0on
25

Distribution over sales prices
For houses with a given # sq.ft. ( ),
what house prices $ are we likely to see?
©2015
Emily
Fox
&
Carlos
Guestrin

price ($)
For ﬁxed
# sq.ft.

Machine
Learning
Specializa0on
26

Generalization error deﬁnition
Really want estimate of loss
Formally:
generalization error = Ex,y[L(y,fŵ(x))]
©2015
Emily
Fox
&
Carlos
Guestrin

ﬁt using training data
average over all possible
(x,y) pairs weighted by
how likely each is

Machine
Learning
Specializa0on
27

Generalization error vs.
model complexity
©2015
Emily
Fox
&
Carlos
Guestrin

Model complexity
Error
price($)
x
y
fŵ

Machine
Learning
Specializa0on
28

model complexity
©2015
Emily
Fox
&
Carlos
Guestrin

Model complexity
Error
price($)
x
y
fŵ

Machine
Learning
Specializa0on
29

model complexity
©2015
Emily
Fox
&
Carlos
Guestrin

Model complexity
Error
price($)
x
y fŵ

Machine
Learning
Specializa0on
30

model complexity
©2015
Emily
Fox
&
Carlos
Guestrin

Model complexity
Error
price($)
x
y fŵ

Machine
Learning
Specializa0on
31

model complexity
©2015
Emily
Fox
&
Carlos
Guestrin

Model complexity
Error
price($)
x
y
fŵ

Machine
Learning
Specializa0on
32

model complexity
©2015
Emily
Fox
&
Carlos
Guestrin

Model complexity
Error
x
y
x
y
Can’t
compute!

Machine
Learning
Specializa0on

Assessing the loss
Part 3: Test error
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
34

Approximating
generalization error
Wanted estimate of loss
©2015
Emily
Fox
&
Carlos
Guestrin

Approximate by looking at
houses not in training set

Machine
Learning
Specializa0on
35

Forming a test set
Hold out some ( ,$) that are
not used for ﬁtting the model
©2015
Emily
Fox
&
Carlos
Guestrin

Training set
Test set

Machine
Learning
Specializa0on
36

Forming a test set
Hold out some ( ,$) that are
not used for ﬁtting the model
©2015
Emily
Fox
&
Carlos
Guestrin

Training set
Test set
Proxy for “everything you
might see”

Machine
Learning
Specializa0on
37

Compute test error
Test error
= avg. loss on houses in test set
= L(yi,fŵ(xi))
©2015
Emily
Fox
&
Carlos
Guestrin

ﬁt using training data# test points
has never seen
test data!
1
Ntest
X
i in test set

Machine
Learning
Specializa0on
38

Example: As before,
ﬁt quadratic to training data
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
ŵ minimizes
RSS of
training data

Machine
Learning
Specializa0on
39

Example: As before,
use squared error loss (y-fŵ(x))2
©2015
Emily
Fox
&
Carlos
Guestrin

Test error (ŵ) = 1/N *
[($test 1-fŵ(sq.ft.test 1))2
+ ($test 2-fŵ(sq.ft.test 2))2
+ ($test 3-fŵ(sq.ft.test 3))2
+ … include all
test houses]square feet (sq.ft.)
price($)
x
y

Machine
Learning
Specializa0on
40

Training, true, & test error vs.
model complexity
©2015
Emily
Fox
&
Carlos
Guestrin

Model complexity
Error
Overﬁtting if:
x
y
x
y

Machine
Learning
Specializa0on

Training/test split
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
42

Training/test splits
©2015
Emily
Fox
&
Carlos
Guestrin

Training set Test set
how many? how many?vs.

Machine
Learning
Specializa0on
43

©2015
Emily
Fox
&
Carlos
Guestrin

Too few
à
ŵ poorly estimated
Training
set
Test set

Machine
Learning
Specializa0on
44

Too few à test error bad approximation
of generalization error
©2015
Emily
Fox
&
Carlos
Guestrin

Training set
Test
set

Machine
Learning
Specializa0on
45

Typically, just enough test points to form a
reasonable estimate of generalization error
If this leaves too few for training, other
methods like cross validation (will see later…)
©2015
Emily
Fox
&
Carlos
Guestrin


Machine
Learning
Specializa0on

3 sources of error +
the bias-variance tradeoﬀ
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
47

3 sources of error
In forming predictions, there
are 3 sources of error:
1.  Noise
2.  Bias
3.  Variance
©2015
Emily
Fox
&
Carlos
Guestrin
47

Machine
Learning
Specializa0on
48

Data inherently noisy
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y yi = fw(true)(xi)+εi
Irreducible
error
variance
of noise

Machine
Learning
Specializa0on
49

Bias contribution
Assume we ﬁt a constant function
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
price($)
x
y
fŵ(train1)
fŵ(train2)
N house
sales (

,$)

N other house
sales (

,$)

Machine
Learning
Specializa0on
50

Bias contribution
Over all possible size N training sets,
what do I expect my ﬁt to be?
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y fw(true)
fw̄
fŵ(train1)
fŵ(train2)
fŵ(train3)

Machine
Learning
Specializa0on
51

Bias contribution
Bias(x) = fw(true)(x) - fw̄(x)
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
fw̄
low complexity
à
high bias fw(true)
Is our approach ﬂexible
enough to capture fw(true)?
If not, error in predictions.

Machine
Learning
Specializa0on
52

Variance contribution
How much do specific fits
vary from the expected fit?
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
fw̄
fŵ(train1)
fŵ(train2)
fŵ(train3)

Machine
Learning
Specializa0on
53

©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
fw̄
fŵ(train1)
fŵ(train2)
fŵ(train3)

Machine
Learning
Specializa0on
54

©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
low complexity
à
low variance
Can speciﬁc ﬁts
vary widely?
If so, erratic
predictions

Machine
Learning
Specializa0on
55

Variance of high-complexity models
Assume we ﬁt a high-order polynomial
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
fŵ(train1)
fŵ(train2)
y
price($)
x

Machine
Learning
Specializa0on
56

Assume we ﬁt a high-order polynomial
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
fw̄
fŵ(train1)
fŵ(train3)
fŵ(train2)

Machine
Learning
Specializa0on
57

©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
fw̄
high
complexity
à
high variance

Machine
Learning
Specializa0on
58

Bias of high-complexity models
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
fw̄
fw(true)
high
complexity
à
low bias

Machine
Learning
Specializa0on
59

Bias-variance tradeoﬀ
©2015
Emily
Fox
&
Carlos
Guestrin

Model complexity
x
y
x
y

Machine
Learning
Specializa0on
60

Error vs. amount of data
©2015
Emily
Fox
&
Carlos
Guestrin

# data points in
training set
Error

Machine
Learning
Specializa0on

More in depth on the
3 sources of errors…
©2015
Emily
Fox
&
Carlos
Guestrin

OPTIONAL

Machine
Learning
Specializa0on
62

Accounting for training set randomness
Training set was just a random sample of N houses sold
What if N other houses had been sold and recorded?
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y fŵ(1)
price($)
x
y
fŵ(2)

Machine
Learning
Specializa0on
63

Training set was just a random sample of N houses sold
What if N other houses had been sold and recorded?
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
price($)
x
yfŵ(1)
fŵ(2)
generalization error of ŵ(1)

Machine
Learning
Specializa0on
64

Ideally, want performance averaged
over all possible training sets of size N
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
price($)
x
yfŵ(1)
fŵ(2)

Machine
Learning
Specializa0on
65

Expected prediction error
Etraining set[generalization error of ŵ(training set)]
©2015
Emily
Fox
&
Carlos
Guestrin

averaging over all training sets
(weighted by how likely each is)
parameters ﬁt
on a speciﬁc
training set
price($)
x
y fŵ(training set)

Machine
Learning
Specializa0on
66

Start by considering:
1.  Loss at target xt (e.g. 2640 sq.ft.)
2.  Squared error loss L(y,fŵ(x)) = (y-fŵ(x))2
Prediction error at target input
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y fŵ(training set)
xt

Machine
Learning
Specializa0on
67

Sum of 3 sources of error
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y fŵ(training set)
xt
Average prediction error at xt
= σ2 + [bias(fŵ(xt))]2 + var(fŵ(xt))

Machine
Learning
Specializa0on
68

Error variance of the model
©2015
Emily
Fox
&
Carlos
Guestrin

σ2 =
“variance”
y = fw(true)(x)+ε
price($)
x
y
xt
Irreducible
error

Machine
Learning
Specializa0on
69

Bias of function estimator
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
price($)
x
y
fŵ(train1) fŵ(train2)

Machine
Learning
Specializa0on
70

Average estimated function = fw̄(x)
True function = fw(x)
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
fw̄
fw
xt
Etrain[fŵ(train)(x)]
over all training
sets of size N
fŵ(train1)fŵ(train2)

Machine
Learning
Specializa0on
71

Average estimated function = fw̄(x)
True function = fw(x)
©2015
Emily
Fox
&
Carlos
Guestrin

price($)
x
y
fw̄
fw
bias(fŵ(xt)) = fw(xt) - fw̄(xt)
xt

Machine
Learning
Specializa0on
73

price($)
x
y
Variance of function estimator
©2015
Emily
Fox
&
Carlos
Guestrin

fw̄
fŵ(train1)
fŵ(train2)
fŵ(train3)

Machine
Learning
Specializa0on
75

var(fŵ(xt)) = Etrain[(fŵ(train)(xt)-fw̄(xt))2]
©2015
Emily
Fox
&
Carlos
Guestrin

fw̄
price($)
x
y
xt
over all training
sets of size N
deviation of
specific fit from
expected fit at xt
fit on a specific
training dataset
what I expect to learn
over all training sets

Machine
Learning
Specializa0on
77

Deriving expected
prediction error
Expected prediction error
= Etrain [generalization error of ŵ(train)]
= Etrain [Ex,y[L(y,fŵ(train)(x))]]
1.  Look at speciﬁc xt
2.  Consider L(y,fŵ(x)) = (y-fŵ(x))2
Expected prediction error at xt
= Etrain, [(yt-fŵ(train)(xt))2]
©2015
Emily
Fox
&
Carlos
Guestrin

yt

Machine
Learning
Specializa0on
78

Deriving expected
prediction error
= Etrain, [(yt-fŵ(train)(xt))2]
= Etrain, [((yt-fw(true)(xt)) + (fw(true)(xt)-fŵ(train)(xt)))2]
©2015
Emily
Fox
&
Carlos
Guestrin

yt
yt

Machine
Learning
Specializa0on
79

Equating MSE with
bias and variance
MSE[fŵ(train)(xt)]
= Etrain[(fw(true)(xt) – fŵ(train)(xt))2]
= Etrain[((fw(true)(xt)–fw̄(xt)) + (fw̄(xt)–fŵ(train)(xt)))2]
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
80

Putting it all together
= σ2 + MSE[fŵ(xt)]
©2015
Emily
Fox
&
Carlos
Guestrin

3 sources of error

Machine
Learning
Specializa0on
82

The regression/ML workﬂow
1.  Model selection
Often, need to choose tuning
parameters λ controlling model
complexity (e.g. degree of polynomial)
2.  Model assessment
Having selected a model, assess
the generalization error
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
83

Hypothetical implementation
For each considered model complexity λ :
i.  Estimate parameters ŵλ on training data
ii.  Assess performance of ŵλ on test data
iii.  Choose λ* to be λ with lowest test error
Compute test error of ŵλ* (ﬁtted model for selected
complexity λ*) to approx. generalization error
©2015
Emily
Fox
&
Carlos
Guestrin


Machine
Learning
Specializa0on
84

For each considered model complexity λ :
i.  Estimate parameters ŵλ on training data
ii.  Assess performance of ŵλ on test data
iii.  Choose λ* to be λ with lowest test error
Compute test error of ŵλ* (ﬁtted model for selected
complexity λ*) to approx. generalization error
©2015
Emily
Fox
&
Carlos
Guestrin

Overly optimistic!

Machine
Learning
Specializa0on
85

©2015
Emily
Fox
&
Carlos
Guestrin

Issue: Just like ﬁtting ŵ and assessing its
performance both on training data
•  λ* was selected to minimize test error
(i.e., λ* was ﬁt on test data)
•  If test data is not representative of the whole
world, then ŵλ* will typically perform worse than
test error indicates

Machine
Learning
Specializa0on
86

Practical implementation
©2015
Emily
Fox
&
Carlos
Guestrin

Solution: Create two “test” sets!
1.  Select λ* such that ŵλ* minimizes error on
validation set
2.  Approximate generalization error of ŵλ* using
test set
Validation
set
Training set
Test
set

Machine
Learning
Specializa0on
87

Practical implementation
©2015
Emily
Fox
&
Carlos
Guestrin

Validation
set
Training set
Test
set
ﬁt ŵλ
test performance
of ŵλ to select λ*
assess
generalization
error of ŵλ*

Machine
Learning
Specializa0on
88

Typical splits
©2015
Emily
Fox
&
Carlos
Guestrin

Validation
set
Training set
Test
set
80% 10% 10%
50% 25% 25%

Machine
Learning
Specializa0on
90

What you can do now…
•  Describe what a loss function is and give examples
•  Contrast training, generalization, and test error
•  Compute training and test error given a loss function
•  Discuss issue of assessing performance on training set
•  Describe tradeoﬀs in forming training/test splits
•  List and interpret the 3 sources of avg. prediction error
-  Irreducible error, bias, and variance
•  Discuss issue of selecting model complexity on test data
and then using test error to assess generalization error
•  Motivate use of a validation set for selecting tuning
parameters (e.g., model complexity)
•  Describe overall regression workﬂow
©2015
Emily
Fox
&
Carlos
Guestrin

C2.3

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to C2.3

Similar to C2.3 (7)

More from Daniel LIAO

More from Daniel LIAO (14)

Recently uploaded

Recently uploaded (20)

C2.3