Elements of Statistical Learning 読み会第2章

The Elements of Statistical Learning
Ch.2: Overview of Supervised Learning
4/13/2017 坂間毅

2
• Supervised Learning
• Predict outputs from inputs
• Inputsの別名
• Predictors 予測変数
• Independent variables 独立変数
• Features 特徴
• Outputsの別名
• Responses 応答変数
• Dependent variables 従属変数
2.1 Introduction

3
• Outputs
1. Quantitative variable
• 大気の測定値など、連続値
• Quantitative prediction = Regression
2. Qualitative variable
• Categorical, discrete variableともいう
• アヤメの種類など、有限集合の値
• Qualitative prediction = Classification
• Inputの種類
1. Quantitative variable
2. Qualitative variable
3. Ordered categorical variable (eg. small, mid, large)
※ 間隔尺度と比例尺度は量的変数にまとめられている？
2.2 Variable Types and Terminology

4
• Notation
• Input
• Vector: 𝑋
• Component of vector: 𝑋𝑗
• i-th observation: 𝑥𝑖 （小文字）
• Matrix: 𝐗 （ボールド）
• All the observations on j-th variable: 𝐱𝐣 (ボールド）
• Output
• Quantitative output: 𝑌
• Prediction of 𝑌: 𝑌
• Qualitative output: 𝐺
• Prediction of 𝐺: 𝐺
2.2 Variable Types and Terminology (contd.)

5
• Linear Model
• With bias term in coefficient, 𝑌 = 𝑋 𝑇 𝛽
• Most popular Fitting method: least squares
• 𝑅𝑆𝑆 𝛽 = 𝐲 − 𝐗𝛽 𝑇 𝐲 − 𝐗𝛽
(RSS: Residual Sum of Squared errors)
• By differentiating RSS w.r.t. 𝛽, and set 0
• 𝐗 𝑇
𝒚 − 𝐗𝛽 = 0
• If 𝐗 𝑇 𝐗 is nonsingular (regular 正則行列), then inverse exists,
• 𝛽 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲
2.3.1 Linear Models and Least Squares

6
• Linear Model (Classification)
• 𝑮 = ORANGE if 𝑌 > 0.5
BLUE if 𝑌 ≤ 0.5
• Two classes are separated by Decision boundary
• 𝑥: 𝑥 𝑇 𝛽 = 0.5
• Two cases for generating 2-class data
1. 平均が異なる相関の無い2変数ガウス分布からそれぞれ生成される
⇒線形の決定境界が最善（第四章で）
2. それぞれの平均の分布がガウス分布になっている、10個の分散の小さいガ
ウス分布から生成される
⇒非線形の決定境界が最善（本章の例はこちら）
2.3.1 Linear Models and Least Squares (contd.)

7
• k-Nearest Neighbor
• 𝑌 𝑥 =
1
𝑘 𝑥 𝑖∈𝑁 𝑘(𝑥) 𝑦𝑖
𝑁𝑘 𝑥 is k (Euclidean) closest points to x in training set
• 𝑘 = 1: Voronoi tessellation
• Notice
• Effective number of parameters of k-NN = N/k
• “we will see”
• RSS is useless
• 𝑘 = 1のとき訓練データを誤差なく分類するので、𝑘 = 1がもっともRSSが
少ないことになる
2.3.2 Nearest-Neighbor Methods

8
• Today’s popular techniques are variants of Linear model
or k-Nearest Neighbor (or both)
2.3.3 From Least Squares to Nearest Neighbors
Variance Bias
Linear Model low high
k-Nearest Neighbors high low

9
• Theoretical Framework
• Joint distribution Pr 𝑋, 𝑌
• Squared error loss function 𝐿 𝑌, 𝑓 𝑋 = (𝑌 − 𝑓 𝑋 )2
• Expected (squared) prediction error
• EPE 𝑓 = E(𝑌 − 𝑓 𝑋 )2
= 𝑦 − 𝑓(𝑥) 2Pr(𝑑𝑥, 𝑑𝑦)
= 𝑦 − 𝑓(𝑥) 2 Pr 𝑥, 𝑦 𝑑𝑦 𝑑𝑥
= 𝑦 − 𝑓(𝑥) 2 Pr 𝑦 𝑥 Pr(𝑥 𝑑𝑦 𝑑𝑥
by Pr 𝑋, 𝑌 = Pr 𝑌 𝑋 Pr(𝑋)
= E 𝑌|𝑋 𝑌 − 𝑓(𝑋) 2|𝑋 = 𝑥 Pr(𝑥) 𝑑𝑥
= E 𝑋E 𝑌|𝑋 𝑌 − 𝑓(𝑋) 2
|𝑋
2.4 Statistical Decision Theory

10
• Minimum 𝑓 is the regression function
• The best prediction of 𝑌 at any point 𝑋 = 𝑥 is the conditional mean,
when best is measured by average squared error.
• 𝑓 𝑥 = argmin 𝑐E 𝑌|𝑋 𝑌 − 𝑐 2
|𝑋 = 𝑥
⇒
𝜕
𝜕𝑓
E 𝑌|𝑋 𝑌 − 𝑓(𝑋) 2
|𝑋 = 𝑥 = 0
⇒
𝜕
𝜕𝑓
𝑦 − 𝑓(𝑥) 2Pr(𝑦|𝑥) 𝑑𝑦 = 0
⇒ −2𝑦 + 2𝑓(𝑥) Pr 𝑦 𝑥 𝑑𝑦 = 0
⇒ 2𝑓 𝑥 Pr 𝑦 𝑥 𝑑𝑦 = 2 𝑦𝑃𝑟 𝑦 𝑥 𝑑𝑦
⇒ 𝑓 𝑥 = E(𝑌|𝑋 = 𝑥)
2.4 Statistical Decision Theory (contd.)

11
• How to estimate the conditional mean E(𝑌|𝑋 = 𝑥)
• k-Nearest Neighbor
• 𝑓(𝑥) = Ave(𝑦𝑖|𝑥𝑖 ∈ 𝑁𝑘 𝑥 )
• Two approximation: Ave, 𝑁𝑘(𝑥)
• Under mild regularity condition on Pr(𝑋, 𝑌),
• If 𝑁, 𝑘 → ∞ with
𝑘
𝑁
→ 0, then 𝑓 𝑥 → E(𝑌|𝑋 = 𝑥)
• However, the curse of dimensionality becomes severe

12
• How to estimate the conditional mean E(𝑌|𝑋 = 𝑥)
• Linear Regression
• 𝑓 𝑥 ≈ 𝑥 𝑇 𝛽 (or 𝑓 𝑥 = 𝑥 𝑇 𝛽?)
• Then,
•
𝜕EPE
𝜕𝛽
=
𝜕
𝜕𝛽
𝑦 − 𝑥 𝑇 𝛽 2 Pr 𝑥, 𝑦 𝑑𝑥𝑑𝑦
= 2 𝑦 − 𝑥 𝑇 𝛽 −𝑥 Pr 𝑥, 𝑦 𝑑𝑥 𝑑𝑦
= −2 𝑦 − 𝑥 𝑇 𝛽 𝑥𝑃𝑟 𝑥, 𝑦 𝑑𝑥𝑑𝑦
= −2 𝑦𝑥 − 𝑥𝑥 𝑇
𝛽 Pr 𝑥, 𝑦 𝑑𝑥𝑑𝑦
⇒ 𝑦𝑥Pr(𝑥, 𝑦)𝑑𝑥 𝑑𝑦 = 𝑥𝑥 𝑇 𝛽 Pr 𝑥, 𝑦 𝑑𝑥 𝑑𝑦
⇒𝛽 = E(𝑋𝑋 𝑇
) −1
E 𝑋𝑌
• This is not conditioned on X.
• Based on 𝐿1 loss function,
• EFE 𝑓 = E 𝑌 − 𝑓(𝑋)
• 𝑓 𝑥 = median(𝑌|𝑋 = 𝑥)

13
• In classification
• Zero-one loss function 𝐿 is represented by matrix 𝐋:
• 𝐋 =
0 ⋯ 𝛿1𝐾
𝛿21
⋮
⋱
𝛿2𝐾
⋮
𝛿 𝐾1 ⋯ 0
where 𝛿𝑖𝑗 ∈ 0,1 , K = card(ℊ)
• The Expected prediction error:
• EPE( 𝐺) = E 𝐿 𝐺, 𝐺(𝑋)
= E 𝑋 𝑘=1
𝐾
𝐿 ℊ 𝑘, 𝐺(𝑋) Pr(ℊ 𝑘|𝑋)

14
• In classification
• Minimum 𝐺 (at a point 𝑋 = 𝑥) is the Bayes classifier.
• 𝐺 𝑥 = argmin 𝑔∈ℊ 𝑘=1
𝐾
𝐿( ℊ 𝑘, 𝑔)Pr(ℊ 𝑘|𝑋 = 𝑥)
= argmin 𝑔∈ℊ 1 − Pr(𝑔|𝑋 = 𝑥)
= ℊ 𝑘 if Pr ℊ 𝑘 𝑋 = 𝑥 = max 𝑔∈ℊ Pr 𝑔 𝑋 = 𝑥
• This classifies to the most probable class, using the
conditional distribution Pr(𝐺|𝑋).
• Many approaches to modeling Pr 𝐺 𝑋 are discussed in Ch.4.

15
• The curse of dimensionality
1. If we want to include 10% of data in the neighbor, the
expected required rate of data in 10 dimensions is
𝑒10 0.1 = 0.8
2. Suppose a nearest-neighbor estimate at the origin, in 𝑁 data
uniformly distributed in 𝑝-dimensional unit ball
• The median distance to the closest data point
• 𝑑 𝑝, 𝑁 = 1 −
1
2
1 𝑁 1 𝑝
• If N = 500, 𝑝 = 10, then 𝑑 𝑝, 𝑁 ≈ 0.52
• more than half data points are closer to the boundary
2.5 Local Methods in High Dimensions

16
3. The sampling density is proportional to 𝑁1 𝑝
• 𝑁10 = 10010
• Sparseness in high dimension
4. Examples 𝑥𝑖 uniformly from −1.1 𝑝
• Assume 𝑌 = 𝑓 𝑋 = 𝑒−8 𝑋 2
• Using 1-Nearest Neighbor estimation at 𝑥0 = 0
• 𝑓 𝑥0 < 0 if 𝑥0 ≠ 0
• If the dimension increase,
the nearest neighbor get further
from the target point
2.5 Local Methods in High Dimensions (contd.)

17
5. In linear model 𝑌 = 𝑋 𝑇
𝛽 + 𝜀, 𝜀~𝑁(0, 𝜎2
)
• For arbitrary test set 𝑥0,
• EPE 𝑥0 = E 𝑦0|𝑥0
ET(𝑦0 − 𝑦0)2
= 𝜎2 + E 𝑇 𝑥 𝑜
𝑇(𝐗 𝑇 𝐗)−1 𝑥 𝑜 𝜎2 + 02
• If 𝑁 is large, 𝑇 were selected at random, E 𝑋 = 0,
E 𝑥0
EPE 𝑥0 ~𝜎2( 𝑝 𝑁) + 𝜎2
• If 𝑁 is large or 𝜎2
is small, EPE does not significantly
increases linearly as 𝑝 increases.
⇒ We can avoid the curse of dimensionality in this
restriction.
2.5 Local Methods in High Dimensions (contd.)

18
• Additive model
• 𝑌 = 𝑓 𝑋 + 𝜀
• Deterministic: 𝑓 𝑥 = E(𝑌|𝑋 = 𝑥)
• Anything non-deterministic goes to the random error 𝜀
• E 𝜀 = 0
• 𝜀 is independent of 𝑋
• Additive model cannot be used in the classification
• Target function 𝑝 𝑋 = Pr(𝐺|𝑋), the conditional density
2.6.1 A Statistical Model for the Joint Distribution Pr(𝑋, 𝑌)

19
• Learn 𝑓 𝑋 by example through teacher
• Training set are pair of inputs and outputs
• 𝑇 = 𝑥𝑖, 𝑦𝑖 for 𝑖 = 1, … , 𝑁
• Learning by example
1. Produce 𝑓 𝑥𝑖
2. Compute differences 𝑦𝑖 − 𝑓 𝑥𝑖
3. Modify 𝑓 𝑥𝑖
※ここまでも上記の考えは使ってきたと思うが、ここになってなぜ言い出し
たのか？
2.6.2 Supervised Learning

20
• Data point 𝑥𝑖, 𝑦𝑖 is viewed as a point in a 𝑝 + 1-
dimention Euclidean space
• Approximate Parameter 𝜃
• Linear model
• Linear basis expansions: 𝑓𝜃 𝑥 = 𝑘=1
𝐾
ℎ 𝑘(𝑥)𝜃 𝑘
• Criterion for approximation
1. The Residual sum-of-squares
• 𝑅𝑆𝑆 𝜃 = 𝑖=1
𝑁
𝑦𝑖 − 𝑓𝜃(𝑥𝑖) 2
• For linear model, we get
a simple closed form solution
2.6.3 Function Approximation

21
• Criterion for approximation
2. Maximum likelihood estimation
• 𝐿 𝜃 = 𝑖=1
𝑁
logPr 𝜃 (𝑦𝑖)
• The Principle of Maximum Likelihood:
• Most reasonable 𝜃 are for which the probability of the
observed sample is largest
• In classification, use cross-entropy with Pr 𝐺 = ℊ 𝑘 𝑋 = 𝑥 =
𝑝 𝑘,𝜃(𝑥)
• 𝐿 𝜃 = 𝑖=1
𝑁
log 𝑝 𝑔𝑖,𝜃(𝑥𝑖)
2.6.3 Function Approximation (contd.)

22
• Infinitely many function fits the training data
• The training sets (𝑥𝑖, 𝑦𝑖) are finite, so infinitely many 𝑓 fits them
• Constraint comes from consideration outside of the data
• The strength of the constraint (complexity) can be viewed as the
neighborhood size
• Constraint comes from the metric of the neighbors
• Especially, to overcome the curse of dimensionality, we need
non-isotropic neighborhoods
2.7.1 Difficulty of the Problem

23
• Variety of nonparametric regression techniques
• Add roughness penalty (regularization) term to RSS
• PRSS 𝑓; 𝜆 = RSS 𝑓 + 𝜆𝐽(𝑓)
• Penalty functional 𝐽 can be used to impose special structure
• Additive models with smooth coordinate (feature) functions
• 𝑗=1
𝑝
𝑓𝑗 𝑋𝑗 + 𝑗=1
𝑝
𝐽(𝑓𝑗)
• Projection pursuit regression
• PPR 𝑋 = 𝑚=1
𝑀
𝑔 𝑚(𝛼 𝑚
𝑇 𝑋)
• For more on penalty, see Ch.5
• For Bayesian approach, see Ch.8
2.8.1 Roughness Penalty and Bayesian methods

24
• Kernel methods specify the nature of local neighborhood
• The local neighborhood is specified by a kernel function
• Gaussian kernel is based on: 𝐾𝜆 𝑥0, 𝑥 =
1
𝜆
exp −
𝑥−𝑥0
2
2𝜆
• In general, a local regression estimate is 𝑓 𝜃 𝑥0 , where
• 𝜃 = argmin 𝜃RSS 𝑓𝜃, 𝑥0
= argmin 𝜃 𝑖=1
𝑁
𝐾𝜆(𝑥0, 𝑥𝑖) (𝑦𝑖 − 𝑓𝜃 𝑥𝑖 )2
• For more on this, see Ch.6
2.8.2 Kernel Methods and Local Regression

25
• This class includes a wide variety of methods
1. The model for 𝑓 is a linear expansion of basis functions ℎ𝑖(𝑥)
• 𝑓𝜃 𝑥 = 𝑚=1
𝑀
𝜃 𝑚ℎ 𝑚(𝑥)
• For more, see Sec.5.2, Ch.9
2. Radial basis functions are symmetric 𝑝-dimensional kernels
• 𝑓𝜃 𝑥 = 𝑚=1
𝑀
𝐾𝜆 𝑚
(𝜇 𝑚, 𝑥)𝜃 𝑚
• For more, see Sec.6.7
3. Feed-forward neural network (single layer)
• 𝑓𝜃 𝑥 = 𝑚=1
𝑀
𝛽 𝑚 𝜎(𝛼 𝑚
𝑇 𝑥 + 𝑏 𝑚) where 𝜎 is the sigmoid function
• For more, see Ch.11
• Dictionary methods mean to choose basis function adaptively
2.8.3 Basis Functions and Dictionary methods

26
• Many models have a smoothing or complexity parameter
• We cannot determine it with residual sum-of-squares on training
data
• Residuals will be zero and model will overfit
• The expected prediction error at 𝑥0 (test, generalization error)
• EPE 𝑘 𝑥0 = E 𝑌 − 𝑓𝑘 𝑥0
2
|𝑋 = 𝑥0
= 𝜎2
+ Bias2
( 𝑓(𝑥0)2
+Var 𝑇( 𝑓𝑘 𝑥0 )
= 𝜎2
+ 𝑓 𝑥0 −
1
𝑘 𝑙=1
𝑘
𝑓(𝑥 𝑙 )
2
+
𝜎2
𝑘
= 𝑇1 + 𝑇2 + 𝑇3
• 𝑇1: irreducible error, beyond our control
• 𝑇2: (Squared) Bias term of mean squared error
• 𝑇2 increases with 𝑘
• 𝑇3: Variance term of mean squared error
• 𝑇3 decreases with 𝑘
2.9 Model Selection and the Bias-Variance Tradeoff

27
• Model Complexity
• If model complexity increases,
• (Squared) Bias Term 𝑇2 decreases
• Variance Term 𝑇3 increases
• There is a trade-off between Bias and Variance
• The training error is not a good estimate of test error
• For more, see Ch.7.
2.9 Model Selection and the Bias-Variance Tradeoff (contd.)

Elements of Statistical Learning 読み会第2章

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Elements of Statistical Learning 読み会第2章

Similar to Elements of Statistical Learning 読み会第2章 (20)

Recently uploaded

Recently uploaded (20)