Upcoming SlideShare
×

# Data Mining and Statistical Learning - 2008

652 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
652
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
26
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Data Mining and Statistical Learning - 2008

1. 1. Kernel methods - overview <ul><li>Kernel smoothers </li></ul><ul><li>Local regression </li></ul><ul><li>Kernel density estimation </li></ul><ul><li>Radial basis functions </li></ul>
2. 2. Introduction <ul><li>Kernel methods are regression techniques used to estimate a response function </li></ul><ul><li>from noisy data </li></ul><ul><li>Properties: </li></ul><ul><li>Different models are fitted at each query point, and only those observations close to that point are used to fit the model </li></ul><ul><li>The resulting function is smooth </li></ul><ul><li>The models require only a minimum of training </li></ul>
3. 3. A simple one-dimensional kernel smoother <ul><li>where </li></ul>
4. 4. Kernel methods, splines and ordinary least squares regression (OLS) <ul><li>OLS : A single model is fitted to all data </li></ul><ul><li>Splines : Different models are fitted to different subintervals (cuboids) of the input domain </li></ul><ul><li>Kernel methods : Different models are fitted at each query point </li></ul>
5. 5. Kernel-weighted averages and moving averages <ul><li>The Nadaraya-Watson kernel-weighted average </li></ul><ul><li>where  indicates the window size and the function D shows how the weights change with distance within this window </li></ul><ul><li>The estimated function is smooth! </li></ul><ul><li>K-nearest neighbours </li></ul><ul><li>The estimated function is piecewise constant! </li></ul>
6. 6. Examples of one-dimesional kernel smoothers <ul><li>Epanechnikov kernel </li></ul><ul><li>Tri-cube kernel </li></ul>
7. 7. Issues in kernel smoothing <ul><li>The smoothing parameter λ has to be defined </li></ul><ul><li>When there are ties at x i : Compute an average y value and introduce weights representing the number of points </li></ul><ul><li>Boundary issues </li></ul><ul><li>Varying density of observations: </li></ul><ul><ul><li>bias is constant </li></ul></ul><ul><ul><li>the variance is inversely proportional to the density </li></ul></ul>
8. 8. Boundary effects of one-dimensional kernel smoothers <ul><li>Locally-weighted averages can be badly biased on the boundaries if the response function has a significant slope  apply local linear regression </li></ul>
9. 9. Local linear regression <ul><li>Find the intercept and slope parameters solving </li></ul><ul><li>The solution is a linear combination of y i : </li></ul>
10. 10. Kernel smoothing vs local linear regression <ul><li>Kernel smoothing </li></ul><ul><li>Solve the minimization problem </li></ul><ul><li>Local linear regression </li></ul><ul><li>Solve the minimization problem </li></ul>
11. 11. Properties of local linear regression <ul><li>Automatically modifies the kernel weights to correct for bias </li></ul><ul><li>Bias depends only on the terms of order higher than one in the expansion of f. </li></ul>
12. 12. Local polynomial regression <ul><li>Fitting polynomials instead of straight lines </li></ul><ul><li>Behavior of estimated response function: </li></ul>
13. 13. Polynomial vs local linear regression <ul><li>Advantages : </li></ul><ul><li>Reduces the ”Trimming of hills and filling of valleys” </li></ul><ul><li>Disadvantages: </li></ul><ul><li>Higher variance (tails are more wiggly) </li></ul>
14. 14. Selecting the width of the kernel <ul><li>Bias-Variance tradeoff : </li></ul><ul><li>Selecting narrow window leads to high variance and low bias whilst selecting wide window leads to high bias and low variance. </li></ul>
15. 15. Selecting the width of the kernel <ul><li>Automatic selection ( cross-validation) </li></ul><ul><li>Fixing the degrees of freedom </li></ul>
16. 16. Local regression in R P <ul><li>The one-dimensional approach is easily extended to p dimensions by </li></ul><ul><li>Using the Euclidian norm as a measure of distance in the kernel. </li></ul><ul><li>Modifying the polynomial </li></ul>
17. 17. Local regression in R P <ul><li>” The curse of dimensionality” </li></ul><ul><li>The fraction of points close to the boundary of the input domain increases with its dimension </li></ul><ul><li>Observed data do not cover the whole input domain </li></ul>
18. 18. Structured local regression models <ul><li>Structured kernels (standardize each variable) </li></ul><ul><li>Note : A is positive semidefinite </li></ul>
19. 19. Structured local regression models <ul><li>Structured regression functions </li></ul><ul><li>ANOVA decompositions (e.g., additive models) </li></ul><ul><li>Backfitting algorithms can be used </li></ul><ul><li>Varying coefficient models (partition X) </li></ul><ul><li>INSERT FORMULA 6.17 </li></ul>
20. 20. Structured local regression models <ul><li>Varying coefficient </li></ul><ul><li>models (example) </li></ul>
21. 21. Local methods <ul><li>Assumption: model is locally linear ->maximize the log-likelihood locally at x 0 : </li></ul><ul><li>Autoregressive time series. y t = β 0 + β 1 y t-1 +…+ β k y t-k +e t -> </li></ul><ul><li>y t =z t T β +e t . Fit by local least-squares with kernel K(z 0 ,z t ) </li></ul>
22. 22. Kernel density estimation <ul><li>Straightforward estimates of the density are bumpy </li></ul><ul><li>Instead, Parzen’s smooth estimate is preferred: </li></ul><ul><li>Normally, Gaussian kernels are used </li></ul>
23. 23. Radial basis functions and kernels <ul><li>Using the idea of basis expansion, we treat kernel functions as basis functions: </li></ul><ul><li>where ξ j –prototype parameter, λ j -scale parameter </li></ul>
24. 24. Radial basis functions and kernels <ul><li>Choosing the parameters: </li></ul><ul><li>Estimate { λ j , ξ j } separately from β j (often by using the distribution of X alone) and solve least-squares. </li></ul>