SlideShare a Scribd company logo
1 of 37
Download to read offline
What is Statistical Learning?
0 50 100 200 300
5
10
15
20
25
TV
Sales
0 10 20 30 40 50
5
10
15
20
25
Radio
Sales
0 20 40 60 80 100
5
10
15
20
25
Newspaper
Sales
Shown are Sales vs TV, Radio and Newspaper, with a blue
linear-regression line fit separately to each.
Can we predict Sales using these three?
Perhaps we can do better using a model
Sales ≈ f(TV, Radio, Newspaper)
1 / 30
Notation
Here Sales is a response or target that we wish to predict. We
generically refer to the response as Y .
TV is a feature, or input, or predictor; we name it X1.
Likewise name Radio as X2, and so on.
We can refer to the input vector collectively as
X =



X1
X2
X3



Now we write our model as
Y = f(X) + 
where  captures measurement errors and other discrepancies.
2 / 30
What is f(X) good for?
• With a good f we can make predictions of Y at new points
X = x.
• We can understand which components of
X = (X1, X2, . . . , Xp) are important in explaining Y , and
which are irrelevant. e.g. Seniority and Years of
Education have a big impact on Income, but Marital
Status typically does not.
• Depending on the complexity of f, we may be able to
understand how each component Xj of X affects Y .
3 / 30
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
● ● ●
●
● ●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1 2 3 4 5 6 7
−2
0
2
4
6
x
y
●
Is there an ideal f(X)? In particular, what is a good value for
f(X) at any selected value of X, say X = 4? There can be
many Y values at X = 4. A good value is
f(4) = E(Y |X = 4)
E(Y |X = 4) means expected value (average) of Y given X = 4.
This ideal f(x) = E(Y |X = x) is called the regression function.
4 / 30
The regression function f(x)
• Is also defined for vector X; e.g.
f(x) = f(x1, x2, x3) = E(Y |X1 = x1, X2 = x2, X3 = x3)
5 / 30
The regression function f(x)
• Is also defined for vector X; e.g.
f(x) = f(x1, x2, x3) = E(Y |X1 = x1, X2 = x2, X3 = x3)
• Is the ideal or optimal predictor of Y with regard to
mean-squared prediction error: f(x) = E(Y |X = x) is the
function that minimizes E[(Y − g(X))2|X = x] over all
functions g at all points X = x.
5 / 30
The regression function f(x)
• Is also defined for vector X; e.g.
f(x) = f(x1, x2, x3) = E(Y |X1 = x1, X2 = x2, X3 = x3)
• Is the ideal or optimal predictor of Y with regard to
mean-squared prediction error: f(x) = E(Y |X = x) is the
function that minimizes E[(Y − g(X))2|X = x] over all
functions g at all points X = x.
•  = Y − f(x) is the irreducible error — i.e. even if we knew
f(x), we would still make errors in prediction, since at each
X = x there is typically a distribution of possible Y values.
5 / 30
The regression function f(x)
• Is also defined for vector X; e.g.
f(x) = f(x1, x2, x3) = E(Y |X1 = x1, X2 = x2, X3 = x3)
• Is the ideal or optimal predictor of Y with regard to
mean-squared prediction error: f(x) = E(Y |X = x) is the
function that minimizes E[(Y − g(X))2|X = x] over all
functions g at all points X = x.
•  = Y − f(x) is the irreducible error — i.e. even if we knew
f(x), we would still make errors in prediction, since at each
X = x there is typically a distribution of possible Y values.
• For any estimate ˆ
f(x) of f(x), we have
E[(Y − ˆ
f(X))2
|X = x] = [f(x) − ˆ
f(x)]2
| {z }
Reducible
+ Var()
| {z }
Irreducible
5 / 30
How to estimate f
• Typically we have few if any data points with X = 4
exactly.
• So we cannot compute E(Y |X = x)!
• Relax the definition and let
ˆ
f(x) = Ave(Y |X ∈ N(x))
where N(x) is some neighborhood of x.
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
1 2 3 4 5 6
−2
−1
0
1
2
3
x
y
●
6 / 30
• Nearest neighbor averaging can be pretty good for small p
— i.e. p ≤ 4 and large-ish N.
• We will discuss smoother versions, such as kernel and
spline smoothing later in the course.
7 / 30
• Nearest neighbor averaging can be pretty good for small p
— i.e. p ≤ 4 and large-ish N.
• We will discuss smoother versions, such as kernel and
spline smoothing later in the course.
• Nearest neighbor methods can be lousy when p is large.
Reason: the curse of dimensionality. Nearest neighbors
tend to be far away in high dimensions.
• We need to get a reasonable fraction of the N values of yi
to average to bring the variance down—e.g. 10%.
• A 10% neighborhood in high dimensions need no longer be
local, so we lose the spirit of estimating E(Y |X = x) by
local averaging.
7 / 30
The curse of dimensionality
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
x1
x2
10% Neighborhood
●
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.0
0.5
1.0
1.5 Fraction of Volume
Radius
p= 1
p= 2
p= 3
p= 5
p= 10
8 / 30
Parametric and structured models
The linear model is an important example of a parametric
model:
fL(X) = β0 + β1X1 + β2X2 + . . . βpXp.
• A linear model is specified in terms of p + 1 parameters
β0, β1, . . . , βp.
• We estimate the parameters by fitting the model to
training data.
• Although it is almost never correct, a linear model often
serves as a good and interpretable approximation to the
unknown true function f(X).
9 / 30
A linear model ˆ
fL(X) = β̂0 + β̂1X gives a reasonable fit here
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
1 2 3 4 5 6
−2
−1
0
1
2
3
x
y
●
A quadratic model ˆ
fQ(X) = β̂0 + β̂1X + β̂2X2 fits slightly
better.
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
1 2 3 4 5 6
−2
−1
0
1
2
3
x
y
●
10 / 30
Years of Education
S
e
n
i
o
r
i
t
y
In
c
o
m
e
Simulated example. Red points are simulated values for income
from the model
income = f(education, seniority) + 
f is the blue surface.
11 / 30
Years of Education
S
e
n
i
o
r
i
t
y
In
c
o
m
e
Linear regression model fit to the simulated data.
ˆ
fL(education, seniority) = β̂0+β̂1×education+β̂2×seniority
12 / 30
Years of Education
S
e
n
i
o
r
i
t
y
In
c
o
m
e
More flexible regression model ˆ
fS(education, seniority) fit to
the simulated data. Here we use a technique called a thin-plate
spline to fit a flexible surface. We control the roughness of the
fit (chapter 7).
13 / 30
Years of Education
S
e
n
i
o
r
i
t
y
In
c
o
m
e
Even more flexible spline regression model
ˆ
fS(education, seniority) fit to the simulated data. Here the
fitted model makes no errors on the training data! Also known
as overfitting.
14 / 30
Some trade-offs
• Prediction accuracy versus interpretability.
— Linear models are easy to interpret; thin-plate splines
are not.
15 / 30
Some trade-offs
• Prediction accuracy versus interpretability.
— Linear models are easy to interpret; thin-plate splines
are not.
• Good fit versus over-fit or under-fit.
— How do we know when the fit is just right?
15 / 30
Some trade-offs
• Prediction accuracy versus interpretability.
— Linear models are easy to interpret; thin-plate splines
are not.
• Good fit versus over-fit or under-fit.
— How do we know when the fit is just right?
• Parsimony versus black-box.
— We often prefer a simpler model involving fewer
variables over a black-box predictor involving them all.
15 / 30
Flexibility
Interpretability
Low High
Low
High
Subset Selection
Lasso
Least Squares
Generalized Additive Models
Trees
Bagging, Boosting
Support Vector Machines
FIGURE 2.7. A representation of the tradeoff between flexibility and inte
pretability, using different statistical learning methods. In general, as the flexib
16 / 30
Assessing Model Accuracy
Suppose we fit a model ˆ
f(x) to some training data
Tr = {xi, yi}N
1 , and we wish to see how well it performs.
• We could compute the average squared prediction error
over Tr:
MSETr = Avei∈Tr[yi − ˆ
f(xi)]2
This may be biased toward more overfit models.
• Instead we should, if possible, compute it using fresh test
data Te = {xi, yi}M
1 :
MSETe = Avei∈Te[yi − ˆ
f(xi)]2
17 / 30
0 20 40 60 80 100
2
4
6
8
10
12
X
Y
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
Mean
Squared
Error
FIGURE 2.9. Left: Data simulated from f, shown in black. Three estimates of
f are shown: the linear regression line (orange curve), and two smoothing spline
fits (blue and green curves). Right: Training MSE (grey curve), test MSE (red
curve), and minimum possible test MSE over all methods (dashed line). Squares
represent the training and test MSEs for the three fits shown in the left-hand
Black curve is truth. Red curve on right is MSETe, grey curve is
MSETr. Orange, blue and green curves/squares correspond to fits of
different flexibility.
18 / 30
0 20 40 60 80 100
2
4
6
8
10
12
X
Y
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
Mean
Squared
Error
FIGURE 2.10. Details are as in Figure 2.9, using a different true f that is
much closer to linear. In this setting, linear regression provides a very good fit to
the data.
Here the truth is smoother, so the smoother fit and linear model do
really well.
19 / 30
the data.
0 20 40 60 80 100
−10
0
10
20
X
Y
2 5 10 20
0
5
10
15
20
Flexibility
Mean
Squared
Error
FIGURE 2.11. Details are as in Figure 2.9, using a different f that is far from
linear. In this setting, linear regression provides a very poor fit to the data.
Here the truth is wiggly and the noise is low, so the more flexible fits
do the best.
20 / 30
Bias-Variance Trade-off
Suppose we have fit a model ˆ
f(x) to some training data Tr, and
let (x0, y0) be a test observation drawn from the population. If
the true model is Y = f(X) +  (with f(x) = E(Y |X = x)),
then
E

y0 − ˆ
f(x0)
2
= Var( ˆ
f(x0)) + [Bias( ˆ
f(x0))]2
+ Var().
The expectation averages over the variability of y0 as well as
the variability in Tr. Note that Bias( ˆ
f(x0))] = E[ ˆ
f(x0)] − f(x0).
Typically as the flexibility of ˆ
f increases, its variance increases,
and its bias decreases. So choosing the flexibility based on
average test error amounts to a bias-variance trade-off.
21 / 30
Bias-variance trade-off for the three examples
36 2. Statistical Learning
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
2 5 10 20
0
5
10
15
20
Flexibility
MSE
Bias
Var
FIGURE 2.12. Squared bias (blue curve), variance (orange curve), Var(ǫ)
(dashed line), and test MSE (red curve) for the three data sets in Figures 2.9–2.11.
The vertical dashed line indicates the flexibility level corresponding to the smallest
test MSE.
22 / 30
Classification Problems
Here the response variable Y is qualitative — e.g. email is one
of C = (spam, ham) (ham=good email), digit class is one of
C = {0, 1, . . . , 9}. Our goals are to:
• Build a classifier C(X) that assigns a class label from C to
a future unlabeled observation X.
• Assess the uncertainty in each classification
• Understand the roles of the different predictors among
X = (X1, X2, . . . , Xp).
23 / 30
|
|
| | | |
|
|
|
| |
| |
|
|
|
| |
|
| | |
|
|
|
|
|
|
||
|
|
|
|
|
| |
|
|
|| |
|
|
| | |
|
|
|
| |
|
| |
|
| ||
|
|
| | |
|
|
|
|
|
| | |
|
|
| |
|
|
|
|
|
|
|
|
| |
| |
| |
|
|
|
| |
| |
|
|
| |
|
|
|
|
| |
|
|
|
|
|
| |
|
| |
| | |
|
|
|
| |
|| |
|
| | |
|
|
|
| | |
|
|
| |
|
| |
|
|
| |
|
| |
| |
|
| | |
|
|
| |
|
| |
|
|
| |
| |
|
| |
| |
|
|
|
|
|
|
| |
|
|
|
|
| |
| | |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
|
| |
|
|
| |
|
|
| | |
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
| |
|
|
| | |
| |
|
|
|
| |
| |
| |
|
|
|
|
|
|
|
| |
| |
|
| |
|
|
|
|
|
|
|
| |
| |
| |
|
|
|
|
|
| | | |
|
|
||
|
| |
|
|
|
|
|
|
|
|
|
|
| |
| |
|
| |
|
|
|
|
|
| |
|
|
|
|
|
|
|
| |
| |
|
|
|
| | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| | | |
|
|
|
| |
|
|
|
|
|
| |
|
|
| | |
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
| |
|
|
| |
| |
|
| |
| |
|
|
|
|
| |
|
| | |
|
|
|
| |
| |
| |
|
| |
| | |
|
|
|
| |
| |
| |
|
| |
|
| |
| |
| |
|
|
|
|
|
| | |
|
| |
|
| | |
| |
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
| |
| |
|
|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| | |
|
|
| |
|
|
| |
|
|
|
| |
|
| |
| |
| | |
|
| |
|
| | |
|
|
| |
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| |
|
|
| |
|
| |
|
|
|
| |
|
|
| |
|
| |
|
|
|
|
|
|
| | |
| |
|
|
|
|
| |
|
|
|
|
|
|
|
| |
|
| |||
|||
| | |
|
|
|
|
| |
|
| |
|
|
| | |
|
|
|
|
|
|
|
|
|
|
| |
|
| |
|
| | |
|
|
| |
|
|
|
|
|
|
|
|
| |
|
|
|
|
| |
|
| | |
|
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
| |
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
| |
| |
| |
|
|
|
| |
| |
|
|
|
|
| | |
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
| |
|
|
|
| | |
|
| |
|
| |
|
|
|
|
|
|
|
| | |
|
|
||
| |
|
|
|
|
|
| |
|
| |
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
| | |
| |
|
|
|
| |
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
|
| |
|
|
| |
| | |
|
|
| | |
|
| |
|
|
|
| |
| |
|
|
| |
|
|
|
|
|
|
|
|
|
| |
|
|
| |
| |
|
| | |
|
| |
|
| |
|
|
|
|
|
| | |
|
|
| |
|
| | |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| |
| |
|
|
|
|
|
|
|
| | |
|
|
|
| |
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
| |
|
|
|
|
| |
| |
|
| |
|
|
| |
|
|
| |
|
|
| |
| |
|
|
| |
|
|
| | |
| |
| |
|
|
|
|
|
|
| | |
|
|
|
|
|
|
| |
|
| |
| |
| |
|
| |
|
| | |
|
|
| |
|
|
|
| |
|
| |
| |
| | |
|
| |
| |
|
|
| |
|
|
|
|
|
|
|
|
| |
|
|
| |
|
|
|
|
| | |
|
|
|
|
|
| |
|
|
|
|
|
|
|
| | |
| |
|
| | |
|
|
|
|
|
| |
| |
|
|
| |
| |
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
| |
|
|
|
|
||
| |
|
|
| |
| | |
|
|
|
|
|
|
|
| |
| |
| |
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
| | |
| |
|
|
| |
|
|
| | |
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
| |
|
|
|
|
| |
| |
| |
|
|
| |
| |
| ||
|
|
|
|
| |
|
|
| | |
|
| | |
| |
|
|
|
|
| |
|
|
|
| |
|
|
| |
|
|
|
| |
|
|
| |
|
|
| |
| |
|
|
|
| |
|
| |
| |
|
| |
| |
|
| |
|
| | |
|
|
| |
| |
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
| |
|
|
| |
|
|
|
|
|
|
|
|
|| |
|
| |
| | | |
|
|
|
|
|
|
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
|
|
|
|
| |
| |
|
|
|
|
|
|
| |
|
| |
| |
|
|
| | |
| |
|
|
|
| |
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
|
|
|
| |
|
|
| |
|
|
| |
|
|
| |
| |
| |
|
|
|
| | |
| |
| |
|
|
|
|
| |
|
|
| |
| |
| |
|
|
| | |
| |
|
|
|
| |
|
|
|
|
|
|
|
| | | |
| |
|
| |
| | |
| |
|
|
|
| |
|
| |
| |
| |
|
|
|
|
|
|
|
|
| |
| |
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
| |
|
| |
| | |
|
|
|
| |
| |
|
|
|
|
| |
|
| |
|
| |
|
|
|
| | |
|
|
| | |
|
|
|
|
| |
|
|
| |
|
|
| |
|
|
|
|
|
|
|
|
| |
|
|
|
| |
|
| | |
|
|
|
|
|
|
| |
| |
|
| |
|
|
| |
|
|
|
|
|
|
| |
|
|
| | |
|
||
|
|
|
|
|
|
| | |
|
|
|
| |
|
|
|
|
|
|
| |
| |
|
| |
| |
|
| | |
| |
|
|
|
| |
| | | | |
|
|
| |
|
|
|
| |
|
|
|
|
| |
|
| |
| |
|
| |
|
|
|
|
|
| |
|
|
|
| |
|
| |
|
|
|
|
| |
|
| |
|
|
|
| |
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
| |
| |
|
|
| |
| |
|
|
|
| | |
| |
| |
|
|
|
|
| |
|
|
|
|
|
|
| |
|
| | |
|
|
|
|
|
|
|
| |
|
|
| |
|
|
|
| |
|
|
| |
|
|
|
| |
|
|
| | |
|
| |
|
|
| |
|
|
|
|
| |
|
| |
|
||
|
| |
|
|
| | | |
|
|
|
|
|
|
|
|
|
| |
| |
|
|
|
|
|
|
| |
|
|
| |
|
| |
| |
|
|
|
|
|
| |
| |
|
| |
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
| |
|
|
| |
|
|
|
|
|
| |
|
1 2 3 4 5 6 7
0.0
0.2
0.4
0.6
0.8
1.0
x
y
Is there an ideal C(X)? Suppose the K elements in C are
numbered 1, 2, . . . , K. Let
pk(x) = Pr(Y = k|X = x), k = 1, 2, . . . , K.
These are the conditional class probabilities at x; e.g. see little
barplot at x = 5. Then the Bayes optimal classifier at x is
C(x) = j if pj(x) = max{p1(x), p2(x), . . . , pK(x)}
24 / 30
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| | |
|
|
|
|
|
|
||
|
| |
|
|
|
| |
|
||
|
| |
| | |
|
|
| |
|
|
|
|
| |
||
| |
| |
| |
|
|
|
|
|
| |
|
|
| | |
|
|
| |
|
|
|
| |
|
|
|
|
|
|
|
|
|
| | |
|
|
2 3 4 5 6
0.0
0.2
0.4
0.6
0.8
1.0
x
y
Nearest-neighbor averaging can be used as before.
Also breaks down as dimension grows. However, the impact on
Ĉ(x) is less than on p̂k(x), k = 1, . . . , K.
25 / 30
Classification: some details
• Typically we measure the performance of Ĉ(x) using the
misclassification error rate:
ErrTe = Avei∈TeI[yi 6= Ĉ(xi)]
• The Bayes classifier (using the true pk(x)) has smallest
error (in the population).
26 / 30
Classification: some details
• Typically we measure the performance of Ĉ(x) using the
misclassification error rate:
ErrTe = Avei∈TeI[yi 6= Ĉ(xi)]
• The Bayes classifier (using the true pk(x)) has smallest
error (in the population).
• Support-vector machines build structured models for C(x).
• We will also build structured models for representing the
pk(x). e.g. Logistic regression, generalized additive models.
26 / 30
Example: K-nearest neighbors in two dimensions
2. Statistical Learning
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
X1
X
2
27 / 30
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
KNN: K=10
X1
X
2
FIGURE 2.15. The black curve indicates the KNN decision boundary on the
28 / 30
FIGURE 2.15. The black curve indicates the KNN decision boundary on the
data from Figure 2.13, using K = 10. The Bayes decision boundary is shown as
a purple dashed line. The KNN and Bayes decision boundaries are very similar.
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
KNN: K=1 KNN: K=100
FIGURE 2.16. A comparison of the KNN decision boundaries (solid black
curves) obtained using K = 1 and K = 100 on the data from Figure 2.13. With
K = 1, the decision boundary is overly flexible, while with K = 100 it is not
sufficiently flexible. The Bayes decision boundary is shown as a purple dashed
29 / 30
0.01 0.02 0.05 0.10 0.20 0.50 1.00
0.00
0.05
0.10
0.15
0.20
1/K
Error
Rate
Training Errors
Test Errors
30 / 30

More Related Content

Similar to Ch2_Statistical_Learning.pdf

Applied Business Statistics ,ken black , ch 6
Applied Business Statistics ,ken black , ch 6Applied Business Statistics ,ken black , ch 6
Applied Business Statistics ,ken black , ch 6AbdelmonsifFadl
 
Machine Learning Printable For Studying Exam
Machine Learning Printable For Studying ExamMachine Learning Printable For Studying Exam
Machine Learning Printable For Studying ExamYusufFakhriAldrian1
 
Confidence Intervals––Exact Intervals, Jackknife, and Bootstrap
Confidence Intervals––Exact Intervals, Jackknife, and BootstrapConfidence Intervals––Exact Intervals, Jackknife, and Bootstrap
Confidence Intervals––Exact Intervals, Jackknife, and BootstrapFrancesco Casalegno
 
2. Fixed Point Iteration.pptx
2. Fixed Point Iteration.pptx2. Fixed Point Iteration.pptx
2. Fixed Point Iteration.pptxsaadhaq6
 
Statistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling DistributionStatistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling DistributionDexlab Analytics
 
5.3 Graphs of Polynomial Functions
5.3 Graphs of Polynomial Functions5.3 Graphs of Polynomial Functions
5.3 Graphs of Polynomial Functionssmiller5
 
__limite functions.sect22-24
  __limite functions.sect22-24  __limite functions.sect22-24
__limite functions.sect22-24argonaut2
 
Introduction to Functions
Introduction to FunctionsIntroduction to Functions
Introduction to FunctionsMelanie Loslo
 
積分 (人間科学のための基礎数学)
積分 (人間科学のための基礎数学)積分 (人間科学のための基礎数学)
積分 (人間科学のための基礎数学)Masahiro Okano
 
Introduction to functions
Introduction to functionsIntroduction to functions
Introduction to functionsElkin Guillen
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descentRevanth Kumar
 
WEEK-4-Piecewise-Function-and-Rational-Function.pptx
WEEK-4-Piecewise-Function-and-Rational-Function.pptxWEEK-4-Piecewise-Function-and-Rational-Function.pptx
WEEK-4-Piecewise-Function-and-Rational-Function.pptxExtremelyDarkness2
 

Similar to Ch2_Statistical_Learning.pdf (20)

Applied Business Statistics ,ken black , ch 6
Applied Business Statistics ,ken black , ch 6Applied Business Statistics ,ken black , ch 6
Applied Business Statistics ,ken black , ch 6
 
Machine Learning Printable For Studying Exam
Machine Learning Printable For Studying ExamMachine Learning Printable For Studying Exam
Machine Learning Printable For Studying Exam
 
Recursive Compressed Sensing
Recursive Compressed SensingRecursive Compressed Sensing
Recursive Compressed Sensing
 
CONTINUITY.pdf
CONTINUITY.pdfCONTINUITY.pdf
CONTINUITY.pdf
 
Confidence Intervals––Exact Intervals, Jackknife, and Bootstrap
Confidence Intervals––Exact Intervals, Jackknife, and BootstrapConfidence Intervals––Exact Intervals, Jackknife, and Bootstrap
Confidence Intervals––Exact Intervals, Jackknife, and Bootstrap
 
2. Fixed Point Iteration.pptx
2. Fixed Point Iteration.pptx2. Fixed Point Iteration.pptx
2. Fixed Point Iteration.pptx
 
Statistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling DistributionStatistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling Distribution
 
Normal Distribution.ppt
Normal Distribution.pptNormal Distribution.ppt
Normal Distribution.ppt
 
5.3 Graphs of Polynomial Functions
5.3 Graphs of Polynomial Functions5.3 Graphs of Polynomial Functions
5.3 Graphs of Polynomial Functions
 
__limite functions.sect22-24
  __limite functions.sect22-24  __limite functions.sect22-24
__limite functions.sect22-24
 
Introduction to Functions
Introduction to FunctionsIntroduction to Functions
Introduction to Functions
 
積分 (人間科学のための基礎数学)
積分 (人間科学のための基礎数学)積分 (人間科学のための基礎数学)
積分 (人間科学のための基礎数学)
 
Chapter 04 answers
Chapter 04 answersChapter 04 answers
Chapter 04 answers
 
Note introductions of functions
Note introductions of functionsNote introductions of functions
Note introductions of functions
 
Introduction to functions
Introduction to functionsIntroduction to functions
Introduction to functions
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
Calc 7.1a
Calc 7.1aCalc 7.1a
Calc 7.1a
 
WEEK-4-Piecewise-Function-and-Rational-Function.pptx
WEEK-4-Piecewise-Function-and-Rational-Function.pptxWEEK-4-Piecewise-Function-and-Rational-Function.pptx
WEEK-4-Piecewise-Function-and-Rational-Function.pptx
 
The integral
The integralThe integral
The integral
 
Section 7.8
Section 7.8Section 7.8
Section 7.8
 

Recently uploaded

一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理cyebo
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyRafigAliyev2
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdfvyankatesh1
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeralNABLAS株式会社
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfEmmanuel Dauda
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Calllward7
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理pyhepag
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理pyhepag
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfMichaelSenkow
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...ssuserf63bd7
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonPayment Village
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfscitechtalktv
 

Recently uploaded (20)

一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 

Ch2_Statistical_Learning.pdf

  • 1. What is Statistical Learning? 0 50 100 200 300 5 10 15 20 25 TV Sales 0 10 20 30 40 50 5 10 15 20 25 Radio Sales 0 20 40 60 80 100 5 10 15 20 25 Newspaper Sales Shown are Sales vs TV, Radio and Newspaper, with a blue linear-regression line fit separately to each. Can we predict Sales using these three? Perhaps we can do better using a model Sales ≈ f(TV, Radio, Newspaper) 1 / 30
  • 2. Notation Here Sales is a response or target that we wish to predict. We generically refer to the response as Y . TV is a feature, or input, or predictor; we name it X1. Likewise name Radio as X2, and so on. We can refer to the input vector collectively as X =    X1 X2 X3    Now we write our model as Y = f(X) + where captures measurement errors and other discrepancies. 2 / 30
  • 3. What is f(X) good for? • With a good f we can make predictions of Y at new points X = x. • We can understand which components of X = (X1, X2, . . . , Xp) are important in explaining Y , and which are irrelevant. e.g. Seniority and Years of Education have a big impact on Income, but Marital Status typically does not. • Depending on the complexity of f, we may be able to understand how each component Xj of X affects Y . 3 / 30
  • 4. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 −2 0 2 4 6 x y ● Is there an ideal f(X)? In particular, what is a good value for f(X) at any selected value of X, say X = 4? There can be many Y values at X = 4. A good value is f(4) = E(Y |X = 4) E(Y |X = 4) means expected value (average) of Y given X = 4. This ideal f(x) = E(Y |X = x) is called the regression function. 4 / 30
  • 5. The regression function f(x) • Is also defined for vector X; e.g. f(x) = f(x1, x2, x3) = E(Y |X1 = x1, X2 = x2, X3 = x3) 5 / 30
  • 6. The regression function f(x) • Is also defined for vector X; e.g. f(x) = f(x1, x2, x3) = E(Y |X1 = x1, X2 = x2, X3 = x3) • Is the ideal or optimal predictor of Y with regard to mean-squared prediction error: f(x) = E(Y |X = x) is the function that minimizes E[(Y − g(X))2|X = x] over all functions g at all points X = x. 5 / 30
  • 7. The regression function f(x) • Is also defined for vector X; e.g. f(x) = f(x1, x2, x3) = E(Y |X1 = x1, X2 = x2, X3 = x3) • Is the ideal or optimal predictor of Y with regard to mean-squared prediction error: f(x) = E(Y |X = x) is the function that minimizes E[(Y − g(X))2|X = x] over all functions g at all points X = x. • = Y − f(x) is the irreducible error — i.e. even if we knew f(x), we would still make errors in prediction, since at each X = x there is typically a distribution of possible Y values. 5 / 30
  • 8. The regression function f(x) • Is also defined for vector X; e.g. f(x) = f(x1, x2, x3) = E(Y |X1 = x1, X2 = x2, X3 = x3) • Is the ideal or optimal predictor of Y with regard to mean-squared prediction error: f(x) = E(Y |X = x) is the function that minimizes E[(Y − g(X))2|X = x] over all functions g at all points X = x. • = Y − f(x) is the irreducible error — i.e. even if we knew f(x), we would still make errors in prediction, since at each X = x there is typically a distribution of possible Y values. • For any estimate ˆ f(x) of f(x), we have E[(Y − ˆ f(X))2 |X = x] = [f(x) − ˆ f(x)]2 | {z } Reducible + Var() | {z } Irreducible 5 / 30
  • 9. How to estimate f • Typically we have few if any data points with X = 4 exactly. • So we cannot compute E(Y |X = x)! • Relax the definition and let ˆ f(x) = Ave(Y |X ∈ N(x)) where N(x) is some neighborhood of x. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 6 −2 −1 0 1 2 3 x y ● 6 / 30
  • 10. • Nearest neighbor averaging can be pretty good for small p — i.e. p ≤ 4 and large-ish N. • We will discuss smoother versions, such as kernel and spline smoothing later in the course. 7 / 30
  • 11. • Nearest neighbor averaging can be pretty good for small p — i.e. p ≤ 4 and large-ish N. • We will discuss smoother versions, such as kernel and spline smoothing later in the course. • Nearest neighbor methods can be lousy when p is large. Reason: the curse of dimensionality. Nearest neighbors tend to be far away in high dimensions. • We need to get a reasonable fraction of the N values of yi to average to bring the variance down—e.g. 10%. • A 10% neighborhood in high dimensions need no longer be local, so we lose the spirit of estimating E(Y |X = x) by local averaging. 7 / 30
  • 12. The curse of dimensionality ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 x1 x2 10% Neighborhood ● 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.5 1.0 1.5 Fraction of Volume Radius p= 1 p= 2 p= 3 p= 5 p= 10 8 / 30
  • 13. Parametric and structured models The linear model is an important example of a parametric model: fL(X) = β0 + β1X1 + β2X2 + . . . βpXp. • A linear model is specified in terms of p + 1 parameters β0, β1, . . . , βp. • We estimate the parameters by fitting the model to training data. • Although it is almost never correct, a linear model often serves as a good and interpretable approximation to the unknown true function f(X). 9 / 30
  • 14. A linear model ˆ fL(X) = β̂0 + β̂1X gives a reasonable fit here ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 6 −2 −1 0 1 2 3 x y ● A quadratic model ˆ fQ(X) = β̂0 + β̂1X + β̂2X2 fits slightly better. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 6 −2 −1 0 1 2 3 x y ● 10 / 30
  • 15. Years of Education S e n i o r i t y In c o m e Simulated example. Red points are simulated values for income from the model income = f(education, seniority) + f is the blue surface. 11 / 30
  • 16. Years of Education S e n i o r i t y In c o m e Linear regression model fit to the simulated data. ˆ fL(education, seniority) = β̂0+β̂1×education+β̂2×seniority 12 / 30
  • 17. Years of Education S e n i o r i t y In c o m e More flexible regression model ˆ fS(education, seniority) fit to the simulated data. Here we use a technique called a thin-plate spline to fit a flexible surface. We control the roughness of the fit (chapter 7). 13 / 30
  • 18. Years of Education S e n i o r i t y In c o m e Even more flexible spline regression model ˆ fS(education, seniority) fit to the simulated data. Here the fitted model makes no errors on the training data! Also known as overfitting. 14 / 30
  • 19. Some trade-offs • Prediction accuracy versus interpretability. — Linear models are easy to interpret; thin-plate splines are not. 15 / 30
  • 20. Some trade-offs • Prediction accuracy versus interpretability. — Linear models are easy to interpret; thin-plate splines are not. • Good fit versus over-fit or under-fit. — How do we know when the fit is just right? 15 / 30
  • 21. Some trade-offs • Prediction accuracy versus interpretability. — Linear models are easy to interpret; thin-plate splines are not. • Good fit versus over-fit or under-fit. — How do we know when the fit is just right? • Parsimony versus black-box. — We often prefer a simpler model involving fewer variables over a black-box predictor involving them all. 15 / 30
  • 22. Flexibility Interpretability Low High Low High Subset Selection Lasso Least Squares Generalized Additive Models Trees Bagging, Boosting Support Vector Machines FIGURE 2.7. A representation of the tradeoff between flexibility and inte pretability, using different statistical learning methods. In general, as the flexib 16 / 30
  • 23. Assessing Model Accuracy Suppose we fit a model ˆ f(x) to some training data Tr = {xi, yi}N 1 , and we wish to see how well it performs. • We could compute the average squared prediction error over Tr: MSETr = Avei∈Tr[yi − ˆ f(xi)]2 This may be biased toward more overfit models. • Instead we should, if possible, compute it using fresh test data Te = {xi, yi}M 1 : MSETe = Avei∈Te[yi − ˆ f(xi)]2 17 / 30
  • 24. 0 20 40 60 80 100 2 4 6 8 10 12 X Y 2 5 10 20 0.0 0.5 1.0 1.5 2.0 2.5 Flexibility Mean Squared Error FIGURE 2.9. Left: Data simulated from f, shown in black. Three estimates of f are shown: the linear regression line (orange curve), and two smoothing spline fits (blue and green curves). Right: Training MSE (grey curve), test MSE (red curve), and minimum possible test MSE over all methods (dashed line). Squares represent the training and test MSEs for the three fits shown in the left-hand Black curve is truth. Red curve on right is MSETe, grey curve is MSETr. Orange, blue and green curves/squares correspond to fits of different flexibility. 18 / 30
  • 25. 0 20 40 60 80 100 2 4 6 8 10 12 X Y 2 5 10 20 0.0 0.5 1.0 1.5 2.0 2.5 Flexibility Mean Squared Error FIGURE 2.10. Details are as in Figure 2.9, using a different true f that is much closer to linear. In this setting, linear regression provides a very good fit to the data. Here the truth is smoother, so the smoother fit and linear model do really well. 19 / 30
  • 26. the data. 0 20 40 60 80 100 −10 0 10 20 X Y 2 5 10 20 0 5 10 15 20 Flexibility Mean Squared Error FIGURE 2.11. Details are as in Figure 2.9, using a different f that is far from linear. In this setting, linear regression provides a very poor fit to the data. Here the truth is wiggly and the noise is low, so the more flexible fits do the best. 20 / 30
  • 27. Bias-Variance Trade-off Suppose we have fit a model ˆ f(x) to some training data Tr, and let (x0, y0) be a test observation drawn from the population. If the true model is Y = f(X) + (with f(x) = E(Y |X = x)), then E y0 − ˆ f(x0) 2 = Var( ˆ f(x0)) + [Bias( ˆ f(x0))]2 + Var(). The expectation averages over the variability of y0 as well as the variability in Tr. Note that Bias( ˆ f(x0))] = E[ ˆ f(x0)] − f(x0). Typically as the flexibility of ˆ f increases, its variance increases, and its bias decreases. So choosing the flexibility based on average test error amounts to a bias-variance trade-off. 21 / 30
  • 28. Bias-variance trade-off for the three examples 36 2. Statistical Learning 2 5 10 20 0.0 0.5 1.0 1.5 2.0 2.5 Flexibility 2 5 10 20 0.0 0.5 1.0 1.5 2.0 2.5 Flexibility 2 5 10 20 0 5 10 15 20 Flexibility MSE Bias Var FIGURE 2.12. Squared bias (blue curve), variance (orange curve), Var(ǫ) (dashed line), and test MSE (red curve) for the three data sets in Figures 2.9–2.11. The vertical dashed line indicates the flexibility level corresponding to the smallest test MSE. 22 / 30
  • 29. Classification Problems Here the response variable Y is qualitative — e.g. email is one of C = (spam, ham) (ham=good email), digit class is one of C = {0, 1, . . . , 9}. Our goals are to: • Build a classifier C(X) that assigns a class label from C to a future unlabeled observation X. • Assess the uncertainty in each classification • Understand the roles of the different predictors among X = (X1, X2, . . . , Xp). 23 / 30
  • 30. | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | | | | | || | | | | | | | | | | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ||| ||| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1 2 3 4 5 6 7 0.0 0.2 0.4 0.6 0.8 1.0 x y Is there an ideal C(X)? Suppose the K elements in C are numbered 1, 2, . . . , K. Let pk(x) = Pr(Y = k|X = x), k = 1, 2, . . . , K. These are the conditional class probabilities at x; e.g. see little barplot at x = 5. Then the Bayes optimal classifier at x is C(x) = j if pj(x) = max{p1(x), p2(x), . . . , pK(x)} 24 / 30
  • 31. | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | | | | | || | | | | | | | | | | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0 x y Nearest-neighbor averaging can be used as before. Also breaks down as dimension grows. However, the impact on Ĉ(x) is less than on p̂k(x), k = 1, . . . , K. 25 / 30
  • 32. Classification: some details • Typically we measure the performance of Ĉ(x) using the misclassification error rate: ErrTe = Avei∈TeI[yi 6= Ĉ(xi)] • The Bayes classifier (using the true pk(x)) has smallest error (in the population). 26 / 30
  • 33. Classification: some details • Typically we measure the performance of Ĉ(x) using the misclassification error rate: ErrTe = Avei∈TeI[yi 6= Ĉ(xi)] • The Bayes classifier (using the true pk(x)) has smallest error (in the population). • Support-vector machines build structured models for C(x). • We will also build structured models for representing the pk(x). e.g. Logistic regression, generalized additive models. 26 / 30
  • 34. Example: K-nearest neighbors in two dimensions 2. Statistical Learning o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o X1 X 2 27 / 30
  • 35. o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o KNN: K=10 X1 X 2 FIGURE 2.15. The black curve indicates the KNN decision boundary on the 28 / 30
  • 36. FIGURE 2.15. The black curve indicates the KNN decision boundary on the data from Figure 2.13, using K = 10. The Bayes decision boundary is shown as a purple dashed line. The KNN and Bayes decision boundaries are very similar. o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o KNN: K=1 KNN: K=100 FIGURE 2.16. A comparison of the KNN decision boundaries (solid black curves) obtained using K = 1 and K = 100 on the data from Figure 2.13. With K = 1, the decision boundary is overly flexible, while with K = 100 it is not sufficiently flexible. The Bayes decision boundary is shown as a purple dashed 29 / 30
  • 37. 0.01 0.02 0.05 0.10 0.20 0.50 1.00 0.00 0.05 0.10 0.15 0.20 1/K Error Rate Training Errors Test Errors 30 / 30