Successfully reported this slideshow.                           Upcoming SlideShare
×

# Elements of Statistical Learning 読み会 第2章

538 views

Published on

Published in: Data & Analytics
• Full Name
Comment goes here.

Are you sure you want to Yes No • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv

Are you sure you want to  Yes  No

### Elements of Statistical Learning 読み会 第2章

1. 1. The Elements of Statistical Learning Ch.2: Overview of Supervised Learning 4/13/2017 坂間 毅
2. 2. 2 • Supervised Learning • Predict outputs from inputs • Inputsの別名 • Predictors 予測変数 • Independent variables 独立変数 • Features 特徴 • Outputsの別名 • Responses 応答変数 • Dependent variables 従属変数 2.1 Introduction
3. 3. 3 • Outputs 1. Quantitative variable • 大気の測定値など、連続値 • Quantitative prediction = Regression 2. Qualitative variable • Categorical, discrete variableともいう • アヤメの種類など、有限集合の値 • Qualitative prediction = Classification • Inputの種類 1. Quantitative variable 2. Qualitative variable 3. Ordered categorical variable (eg. small, mid, large) ※ 間隔尺度と比例尺度は量的変数にまとめられている？ 2.2 Variable Types and Terminology
4. 4. 4 • Notation • Input • Vector: 𝑋 • Component of vector: 𝑋𝑗 • i-th observation: 𝑥𝑖 （小文字） • Matrix: 𝐗 （ボールド） • All the observations on j-th variable: 𝐱𝐣 (ボールド） • Output • Quantitative output: 𝑌 • Prediction of 𝑌: 𝑌 • Qualitative output: 𝐺 • Prediction of 𝐺: 𝐺 2.2 Variable Types and Terminology (contd.)
5. 5. 5 • Linear Model • With bias term in coefficient, 𝑌 = 𝑋 𝑇 𝛽 • Most popular Fitting method: least squares • 𝑅𝑆𝑆 𝛽 = 𝐲 − 𝐗𝛽 𝑇 𝐲 − 𝐗𝛽 (RSS: Residual Sum of Squared errors) • By differentiating RSS w.r.t. 𝛽, and set 0 • 𝐗 𝑇 𝒚 − 𝐗𝛽 = 0 • If 𝐗 𝑇 𝐗 is nonsingular (regular 正則行列), then inverse exists, • 𝛽 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲 2.3.1 Linear Models and Least Squares
6. 6. 6 • Linear Model (Classification) • 𝑮 = ORANGE if 𝑌 > 0.5 BLUE if 𝑌 ≤ 0.5 • Two classes are separated by Decision boundary • 𝑥: 𝑥 𝑇 𝛽 = 0.5 • Two cases for generating 2-class data 1. 平均が異なる相関の無い2変数ガウス分布からそれぞれ生成される ⇒線形の決定境界が最善（第四章で） 2. それぞれの平均の分布がガウス分布になっている、10個の分散の小さいガ ウス分布から生成される ⇒非線形の決定境界が最善（本章の例はこちら） 2.3.1 Linear Models and Least Squares (contd.)
7. 7. 7 • k-Nearest Neighbor • 𝑌 𝑥 = 1 𝑘 𝑥 𝑖∈𝑁 𝑘(𝑥) 𝑦𝑖 𝑁𝑘 𝑥 is k (Euclidean) closest points to x in training set • 𝑘 = 1: Voronoi tessellation • Notice • Effective number of parameters of k-NN = N/k • “we will see” • RSS is useless • 𝑘 = 1のとき訓練データを誤差なく分類するので、𝑘 = 1がもっともRSSが 少ないことになる 2.3.2 Nearest-Neighbor Methods
8. 8. 8 • Today’s popular techniques are variants of Linear model or k-Nearest Neighbor (or both) 2.3.3 From Least Squares to Nearest Neighbors Variance Bias Linear Model low high k-Nearest Neighbors high low
9. 9. 9 • Theoretical Framework • Joint distribution Pr 𝑋, 𝑌 • Squared error loss function 𝐿 𝑌, 𝑓 𝑋 = (𝑌 − 𝑓 𝑋 )2 • Expected (squared) prediction error • EPE 𝑓 = E(𝑌 − 𝑓 𝑋 )2 = 𝑦 − 𝑓(𝑥) 2Pr(𝑑𝑥, 𝑑𝑦) = 𝑦 − 𝑓(𝑥) 2 Pr 𝑥, 𝑦 𝑑𝑦 𝑑𝑥 = 𝑦 − 𝑓(𝑥) 2 Pr 𝑦 𝑥 Pr(𝑥 𝑑𝑦 𝑑𝑥 by Pr 𝑋, 𝑌 = Pr 𝑌 𝑋 Pr(𝑋) = E 𝑌|𝑋 𝑌 − 𝑓(𝑋) 2|𝑋 = 𝑥 Pr(𝑥) 𝑑𝑥 = E 𝑋E 𝑌|𝑋 𝑌 − 𝑓(𝑋) 2 |𝑋 2.4 Statistical Decision Theory
10. 10. 10 • Minimum 𝑓 is the regression function • The best prediction of 𝑌 at any point 𝑋 = 𝑥 is the conditional mean, when best is measured by average squared error. • 𝑓 𝑥 = argmin 𝑐E 𝑌|𝑋 𝑌 − 𝑐 2 |𝑋 = 𝑥 ⇒ 𝜕 𝜕𝑓 E 𝑌|𝑋 𝑌 − 𝑓(𝑋) 2 |𝑋 = 𝑥 = 0 ⇒ 𝜕 𝜕𝑓 𝑦 − 𝑓(𝑥) 2Pr(𝑦|𝑥) 𝑑𝑦 = 0 ⇒ −2𝑦 + 2𝑓(𝑥) Pr 𝑦 𝑥 𝑑𝑦 = 0 ⇒ 2𝑓 𝑥 Pr 𝑦 𝑥 𝑑𝑦 = 2 𝑦𝑃𝑟 𝑦 𝑥 𝑑𝑦 ⇒ 𝑓 𝑥 = E(𝑌|𝑋 = 𝑥) 2.4 Statistical Decision Theory (contd.)
11. 11. 11 • How to estimate the conditional mean E(𝑌|𝑋 = 𝑥) • k-Nearest Neighbor • 𝑓(𝑥) = Ave(𝑦𝑖|𝑥𝑖 ∈ 𝑁𝑘 𝑥 ) • Two approximation: Ave, 𝑁𝑘(𝑥) • Under mild regularity condition on Pr(𝑋, 𝑌), • If 𝑁, 𝑘 → ∞ with 𝑘 𝑁 → 0, then 𝑓 𝑥 → E(𝑌|𝑋 = 𝑥) • However, the curse of dimensionality becomes severe 2.4 Statistical Decision Theory (contd.)
12. 12. 12 • How to estimate the conditional mean E(𝑌|𝑋 = 𝑥) • Linear Regression • 𝑓 𝑥 ≈ 𝑥 𝑇 𝛽 (or 𝑓 𝑥 = 𝑥 𝑇 𝛽?) • Then, • 𝜕EPE 𝜕𝛽 = 𝜕 𝜕𝛽 𝑦 − 𝑥 𝑇 𝛽 2 Pr 𝑥, 𝑦 𝑑𝑥𝑑𝑦 = 2 𝑦 − 𝑥 𝑇 𝛽 −𝑥 Pr 𝑥, 𝑦 𝑑𝑥 𝑑𝑦 = −2 𝑦 − 𝑥 𝑇 𝛽 𝑥𝑃𝑟 𝑥, 𝑦 𝑑𝑥𝑑𝑦 = −2 𝑦𝑥 − 𝑥𝑥 𝑇 𝛽 Pr 𝑥, 𝑦 𝑑𝑥𝑑𝑦 ⇒ 𝑦𝑥Pr(𝑥, 𝑦)𝑑𝑥 𝑑𝑦 = 𝑥𝑥 𝑇 𝛽 Pr 𝑥, 𝑦 𝑑𝑥 𝑑𝑦 ⇒𝛽 = E(𝑋𝑋 𝑇 ) −1 E 𝑋𝑌 • This is not conditioned on X. • Based on 𝐿1 loss function, • EFE 𝑓 = E 𝑌 − 𝑓(𝑋) • 𝑓 𝑥 = median(𝑌|𝑋 = 𝑥) 2.4 Statistical Decision Theory (contd.)
13. 13. 13 • In classification • Zero-one loss function 𝐿 is represented by matrix 𝐋: • 𝐋 = 0 ⋯ 𝛿1𝐾 𝛿21 ⋮ ⋱ 𝛿2𝐾 ⋮ 𝛿 𝐾1 ⋯ 0 where 𝛿𝑖𝑗 ∈ 0,1 , K = card(ℊ) • The Expected prediction error: • EPE( 𝐺) = E 𝐿 𝐺, 𝐺(𝑋) = E 𝑋 𝑘=1 𝐾 𝐿 ℊ 𝑘, 𝐺(𝑋) Pr(ℊ 𝑘|𝑋) 2.4 Statistical Decision Theory (contd.)
14. 14. 14 • In classification • Minimum 𝐺 (at a point 𝑋 = 𝑥) is the Bayes classifier. • 𝐺 𝑥 = argmin 𝑔∈ℊ 𝑘=1 𝐾 𝐿( ℊ 𝑘, 𝑔)Pr(ℊ 𝑘|𝑋 = 𝑥) = argmin 𝑔∈ℊ 1 − Pr(𝑔|𝑋 = 𝑥) = ℊ 𝑘 if Pr ℊ 𝑘 𝑋 = 𝑥 = max 𝑔∈ℊ Pr 𝑔 𝑋 = 𝑥 • This classifies to the most probable class, using the conditional distribution Pr(𝐺|𝑋). • Many approaches to modeling Pr 𝐺 𝑋 are discussed in Ch.4. 2.4 Statistical Decision Theory (contd.)
15. 15. 15 • The curse of dimensionality 1. If we want to include 10% of data in the neighbor, the expected required rate of data in 10 dimensions is 𝑒10 0.1 = 0.8 2. Suppose a nearest-neighbor estimate at the origin, in 𝑁 data uniformly distributed in 𝑝-dimensional unit ball • The median distance to the closest data point • 𝑑 𝑝, 𝑁 = 1 − 1 2 1 𝑁 1 𝑝 • If N = 500, 𝑝 = 10, then 𝑑 𝑝, 𝑁 ≈ 0.52 • more than half data points are closer to the boundary 2.5 Local Methods in High Dimensions
16. 16. 16 • The curse of dimensionality 3. The sampling density is proportional to 𝑁1 𝑝 • 𝑁10 = 10010 • Sparseness in high dimension 4. Examples 𝑥𝑖 uniformly from −1.1 𝑝 • Assume 𝑌 = 𝑓 𝑋 = 𝑒−8 𝑋 2 • Using 1-Nearest Neighbor estimation at 𝑥0 = 0 • 𝑓 𝑥0 < 0 if 𝑥0 ≠ 0 • If the dimension increase, the nearest neighbor get further from the target point 2.5 Local Methods in High Dimensions (contd.)
17. 17. 17 • The curse of dimensionality 5. In linear model 𝑌 = 𝑋 𝑇 𝛽 + 𝜀, 𝜀~𝑁(0, 𝜎2 ) • For arbitrary test set 𝑥0, • EPE 𝑥0 = E 𝑦0|𝑥0 ET(𝑦0 − 𝑦0)2 = 𝜎2 + E 𝑇 𝑥 𝑜 𝑇(𝐗 𝑇 𝐗)−1 𝑥 𝑜 𝜎2 + 02 • If 𝑁 is large, 𝑇 were selected at random, E 𝑋 = 0, E 𝑥0 EPE 𝑥0 ~𝜎2( 𝑝 𝑁) + 𝜎2 • If 𝑁 is large or 𝜎2 is small, EPE does not significantly increases linearly as 𝑝 increases. ⇒ We can avoid the curse of dimensionality in this restriction. 2.5 Local Methods in High Dimensions (contd.)
18. 18. 18 • Additive model • 𝑌 = 𝑓 𝑋 + 𝜀 • Deterministic: 𝑓 𝑥 = E(𝑌|𝑋 = 𝑥) • Anything non-deterministic goes to the random error 𝜀 • E 𝜀 = 0 • 𝜀 is independent of 𝑋 • Additive model cannot be used in the classification • Target function 𝑝 𝑋 = Pr(𝐺|𝑋), the conditional density 2.6.1 A Statistical Model for the Joint Distribution Pr(𝑋, 𝑌)
19. 19. 19 • Learn 𝑓 𝑋 by example through teacher • Training set are pair of inputs and outputs • 𝑇 = 𝑥𝑖, 𝑦𝑖 for 𝑖 = 1, … , 𝑁 • Learning by example 1. Produce 𝑓 𝑥𝑖 2. Compute differences 𝑦𝑖 − 𝑓 𝑥𝑖 3. Modify 𝑓 𝑥𝑖 ※ここまでも上記の考えは使ってきたと思うが、ここになってなぜ言い出し たのか？ 2.6.2 Supervised Learning
20. 20. 20 • Data point 𝑥𝑖, 𝑦𝑖 is viewed as a point in a 𝑝 + 1- dimention Euclidean space • Approximate Parameter 𝜃 • Linear model • Linear basis expansions: 𝑓𝜃 𝑥 = 𝑘=1 𝐾 ℎ 𝑘(𝑥)𝜃 𝑘 • Criterion for approximation 1. The Residual sum-of-squares • 𝑅𝑆𝑆 𝜃 = 𝑖=1 𝑁 𝑦𝑖 − 𝑓𝜃(𝑥𝑖) 2 • For linear model, we get a simple closed form solution 2.6.3 Function Approximation
21. 21. 21 • Criterion for approximation 2. Maximum likelihood estimation • 𝐿 𝜃 = 𝑖=1 𝑁 logPr 𝜃 (𝑦𝑖) • The Principle of Maximum Likelihood: • Most reasonable 𝜃 are for which the probability of the observed sample is largest • In classification, use cross-entropy with Pr 𝐺 = ℊ 𝑘 𝑋 = 𝑥 = 𝑝 𝑘,𝜃(𝑥) • 𝐿 𝜃 = 𝑖=1 𝑁 log 𝑝 𝑔𝑖,𝜃(𝑥𝑖) 2.6.3 Function Approximation (contd.)
22. 22. 22 • Infinitely many function fits the training data • The training sets (𝑥𝑖, 𝑦𝑖) are finite, so infinitely many 𝑓 fits them • Constraint comes from consideration outside of the data • The strength of the constraint (complexity) can be viewed as the neighborhood size • Constraint comes from the metric of the neighbors • Especially, to overcome the curse of dimensionality, we need non-isotropic neighborhoods 2.7.1 Difficulty of the Problem
23. 23. 23 • Variety of nonparametric regression techniques • Add roughness penalty (regularization) term to RSS • PRSS 𝑓; 𝜆 = RSS 𝑓 + 𝜆𝐽(𝑓) • Penalty functional 𝐽 can be used to impose special structure • Additive models with smooth coordinate (feature) functions • 𝑗=1 𝑝 𝑓𝑗 𝑋𝑗 + 𝑗=1 𝑝 𝐽(𝑓𝑗) • Projection pursuit regression • PPR 𝑋 = 𝑚=1 𝑀 𝑔 𝑚(𝛼 𝑚 𝑇 𝑋) • For more on penalty, see Ch.5 • For Bayesian approach, see Ch.8 2.8.1 Roughness Penalty and Bayesian methods
24. 24. 24 • Kernel methods specify the nature of local neighborhood • The local neighborhood is specified by a kernel function • Gaussian kernel is based on: 𝐾𝜆 𝑥0, 𝑥 = 1 𝜆 exp − 𝑥−𝑥0 2 2𝜆 • In general, a local regression estimate is 𝑓 𝜃 𝑥0 , where • 𝜃 = argmin 𝜃RSS 𝑓𝜃, 𝑥0 = argmin 𝜃 𝑖=1 𝑁 𝐾𝜆(𝑥0, 𝑥𝑖) (𝑦𝑖 − 𝑓𝜃 𝑥𝑖 )2 • For more on this, see Ch.6 2.8.2 Kernel Methods and Local Regression
25. 25. 25 • This class includes a wide variety of methods 1. The model for 𝑓 is a linear expansion of basis functions ℎ𝑖(𝑥) • 𝑓𝜃 𝑥 = 𝑚=1 𝑀 𝜃 𝑚ℎ 𝑚(𝑥) • For more, see Sec.5.2, Ch.9 2. Radial basis functions are symmetric 𝑝-dimensional kernels • 𝑓𝜃 𝑥 = 𝑚=1 𝑀 𝐾𝜆 𝑚 (𝜇 𝑚, 𝑥)𝜃 𝑚 • For more, see Sec.6.7 3. Feed-forward neural network (single layer) • 𝑓𝜃 𝑥 = 𝑚=1 𝑀 𝛽 𝑚 𝜎(𝛼 𝑚 𝑇 𝑥 + 𝑏 𝑚) where 𝜎 is the sigmoid function • For more, see Ch.11 • Dictionary methods mean to choose basis function adaptively 2.8.3 Basis Functions and Dictionary methods
26. 26. 26 • Many models have a smoothing or complexity parameter • We cannot determine it with residual sum-of-squares on training data • Residuals will be zero and model will overfit • The expected prediction error at 𝑥0 (test, generalization error) • EPE 𝑘 𝑥0 = E 𝑌 − 𝑓𝑘 𝑥0 2 |𝑋 = 𝑥0 = 𝜎2 + Bias2 ( 𝑓(𝑥0)2 +Var 𝑇( 𝑓𝑘 𝑥0 ) = 𝜎2 + 𝑓 𝑥0 − 1 𝑘 𝑙=1 𝑘 𝑓(𝑥 𝑙 ) 2 + 𝜎2 𝑘 = 𝑇1 + 𝑇2 + 𝑇3 • 𝑇1: irreducible error, beyond our control • 𝑇2: (Squared) Bias term of mean squared error • 𝑇2 increases with 𝑘 • 𝑇3: Variance term of mean squared error • 𝑇3 decreases with 𝑘 2.9 Model Selection and the Bias-Variance Tradeoff
27. 27. 27 • Model Complexity • If model complexity increases, • (Squared) Bias Term 𝑇2 decreases • Variance Term 𝑇3 increases • There is a trade-off between Bias and Variance • The training error is not a good estimate of test error • For more, see Ch.7. 2.9 Model Selection and the Bias-Variance Tradeoff (contd.)