SlideShare
Explore
Search
You
Upload
Login
Signup
Home
Technology
Education
More Topics
For Uploaders
Collect Leads
Get Started
Tips & Tricks
Tools
For Business
Predicting Real-valued Outputs: An introduction to regression
Upcoming SlideShare
Loading in...
5
×
1
1
of
51
Like this document? Why not share!
Share
Email
Flexible least squares for temporal...
by Tommy96
579 views
Insight on multiple regression
by Vijay Ganesh S
59 views
Least square fit
by nazarin
1322 views
What is Predictive About Partial Le...
by Galit Shmueli
842 views
Boston Predictive Analytics: Linear...
by Enplus Advisors, ...
1347 views
Ordinary least squares linear regre...
by Elkana Rorio
781 views
Share SlideShare
Facebook
Twitter
LinkedIn
Google+
Email
Email sent successfully!
Embed
Size (px)
Start on
Show related SlideShares at end
WordPress Shortcode
Link
Predicting Real-valued Outputs: An introduction to regression
753
Share
Like
Download
guestfee8698
Follow
0
0
0
0
Published on
May 14, 2009
Published in:
Technology
,
Economy & Finance
0 Comments
0 Likes
Statistics
Notes
Full Name
Comment goes here.
12 hours ago
Delete
Reply
Spam
Block
Are you sure you want to
Yes
No
Your message goes here
Post
Be the first to comment
Be the first to like this
No Downloads
Views
Total Views
753
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
40
Comments
0
Likes
0
Embeds
0
No embeds
No notes for slide
Predicting Real-valued Outputs: An introduction to regression
1. Predicting Real-valued outputs: an introduction to Regression Andrew W. Moore Note to other teachers and users of these slides. Andrew would be delighted Professor if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them School of Computer Science to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in Carnegie Mellon University your own lecture, please include this l teria d ma message, or the following link to the rdere Nets source repository of Andrew’s tutorials: is reo www.cs.cmu.edu/~awm l This he Neura “Favorite http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully t d the rithms” awm@cs.cmu.edu from re an received. o lectu ssion Alg 412-268-7599 e Regr e r lectu Copyright © 2001, 2003, Andrew W. Moore Single- Parameter Linear Regression Copyright © 2001, 2003, Andrew W. Moore 2
2.
Linear Regression DATASET inputs outputs x1 = 1 y1 = 1 x2 = 3 y2 = 2.2 ↑ x3 = 2 y3 = 2 w ↓ ←1→ x4 = 1.5 y4 = 1.9 x5 = 4 y5 = 3.1 Linear regression assumes that the expected value of the output given an input, E[y|x], is linear. Simplest case: Out(x) = wx for some unknown w. Given the data, we can estimate w. Copyright © 2001, 2003, Andrew W. Moore 3 1-parameter linear regression Assume that the data is formed by yi = wxi + noisei where… • the noise signals are independent • the noise has a normal distribution with mean 0 and unknown variance σ2 p(y|w,x) has a normal distribution with • mean wx • variance σ2 Copyright © 2001, 2003, Andrew W. Moore 4
3.
Bayesian Linear Regression p(y|w,x) = Normal (mean wx, var σ2) We have a set of datapoints (x1,y1) (x2,y2) … (xn,yn) which are EVIDENCE about w. We want to infer w from the data. p(w|x1, x2, x3,…xn, y1, y2…yn) •You can use BAYES rule to work out a posterior distribution for w given the data. •Or you could do Maximum Likelihood Estimation Copyright © 2001, 2003, Andrew W. Moore 5 Maximum likelihood estimation of w Asks the question: “For which value of w is this data most likely to have happened?” <=> For what w is p(y1, y2…yn |x1, x2, x3,…xn, w) maximized? <=> For what w is n ∏ p( y w, x ) maximized? i i i =1 Copyright © 2001, 2003, Andrew W. Moore 6
4.
For what w is n ∏ p ( y i w , x i ) maximized? i =1 For what w is n 1 ∏ exp(− 2 ( yi − wxi ) 2 ) maximized? σ i =1 For what w is 2 1 y − wx i n ∑ −i maximized? σ 2 i =1 For what w is 2 n ∑ (y ) − wx minimized? i i i =1 Copyright © 2001, 2003, Andrew W. Moore 7 Linear Regression The maximum likelihood w is the one that E(w) w minimizes sum- Ε = ∑( yi − wxi ) 2 of-squares of residuals i (∑ x )w = ∑ yi − (2∑ xi yi )w + 2 2 2 i i We want to minimize a quadratic function of w. Copyright © 2001, 2003, Andrew W. Moore 8
5.
Linear Regression Easy to show the sum of squares is minimized when ∑x y i i w= ∑x 2 i The maximum likelihood Out(x) = wx model is We can use it for prediction Copyright © 2001, 2003, Andrew W. Moore 9 Linear Regression Easy to show the sum of squares is minimized when ∑x y p(w) i i w= w ∑x 2 Note: i In Bayesian stats you’d have ended up with a prob dist of w The maximum likelihood Out(x) = wx model is And predictions would have given a prob dist of expected output Often useful to know your confidence. We can use it for Max likelihood can give some kinds of prediction confidence too. Copyright © 2001, 2003, Andrew W. Moore 10
6.
Multivariate Linear Regression Copyright © 2001, 2003, Andrew W. Moore 11 Multivariate Regression What if the inputs are vectors? 3. .4 6. 2-d input 5 . example .8 x2 . 10 x1 Dataset has form x1 y1 x2 y2 x3 y3 .: : . xR yR Copyright © 2001, 2003, Andrew W. Moore 12
7.
Multivariate Regression Write matrix X and Y thus: .....x1 ..... x11 x1m y1 x12 ... .....x ..... x y x22 ... x2 m = 21 x= y = 2 2 M M M .....x R ..... xR1 xR 2 ... xRm yR (there are R datapoints. Each input has m components) The linear regression model assumes a vector w such that Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D] The max. likelihood w is w = (XTX) -1(XTY) Copyright © 2001, 2003, Andrew W. Moore 13 Multivariate Regression Write matrix X and Y thus: .....x1 ..... x11 x1m y1 x12 ... .....x ..... x ... x2 m x22 y = y2 = 21 x= 2 M M M .....x R ..... xR1 xR 2 ... xRm yR IMPORTANT EXERCISE: (there are R datapoints. Each input hasPROVE IT !!!!! m components) The linear regression model assumes a vector w such that Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D] The max. likelihood w is w = (XTX) -1(XTY) Copyright © 2001, 2003, Andrew W. Moore 14
8.
Multivariate Regression (con’t) The max. likelihood w is w = (XTX)-1(XTY) R ∑x x ki kj XTX is an m x m matrix: i,j’th elt is k =1 R ∑x y XTY is an m-element vector: i’th elt ki k k =1 Copyright © 2001, 2003, Andrew W. Moore 15 Constant Term in Linear Regression Copyright © 2001, 2003, Andrew W. Moore 16
9.
What about a constant term? We may expect linear data that does not go through the origin. Statisticians and Neural Net Folks all agree on a simple obvious hack. Can you guess?? Copyright © 2001, 2003, Andrew W. Moore 17 The constant term • The trick is to create a fake input “X0” that always takes the value 1 X1 X2 Y X0 X1 X2 Y 2 4 16 1 2 4 16 3 4 17 1 3 4 17 5 5 20 1 5 5 20 Before: After: Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2 …has to be a poor = w0+w1X1+ w2X2 In this example, You should be able model …has a fine constant to see the MLE w0 , w1 and w2 by term inspection Copyright © 2001, 2003, Andrew W. Moore 18
10.
. .. i ty astic ed rosc e H et Linear Regression with varying noise Copyright © 2001, 2003, Andrew W. Moore 19 Regression with varying noise • Suppose you know the variance of the noise that was added to each datapoint. y=3 σ=2 σi2 xi yi ½ ½ 4 y=2 σ=1/2 1 1 1 σ=1 2 1 1/4 y=1 σ=1/2 σ=2 2 3 4 y=0 3 2 1/4 x=0 x=1 x=2 x=3 MLE the w? yi ~ N ( wxi , σ i2 ) at’s f W h at e o Assume stim e Copyright © 2001, 2003, Andrew W. Moore 20
11.
MLE estimation with varying noise argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ 12 , σ 2 ,..., σ R , w) = 2 2 1 2 R w Assuming independence ( yi − wxi ) 2 R among noise and then argmin ∑ = plugging in equation for σ i2 Gaussian and simplifying. i =1 w x ( y − wx ) Setting dLL/dw R w such that ∑ i i 2 i = 0 = equal to zero σi i =1 R xi yi Trivial algebra ∑ 2 i =1 σ i R xi2 ∑ 2 σ i =1 i Copyright © 2001, 2003, Andrew W. Moore 21 This is Weighted Regression • We are asking to minimize the weighted sum of squares y=3 σ=2 ( yi − wxi ) 2 R argmin ∑ σ i2 y=2 i =1 σ=1/2 w σ=1 y=1 σ=1/2 σ=2 y=0 x=0 x=1 x=2 x=3 1 where weight for i’th datapoint is σ i2 Copyright © 2001, 2003, Andrew W. Moore 22
12.
Non-linear Regression Copyright © 2001, 2003, Andrew W. Moore 23 Non-linear Regression • Suppose you know that y is related to a function of x in such a way that the predicted values have a non-linear dependence on w, e.g: y=3 xi yi ½ ½ y=2 1 2.5 2 3 y=1 3 2 y=0 3 3 x=0 x=1 x=2 x=3 MLE yi ~ N ( w + xi , σ ) the w? 2 at’s f W h at e o Assume stim e Copyright © 2001, 2003, Andrew W. Moore 24
13.
Non-linear MLE estimation argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ (y ) R then plugging in 2 − w + xi = equation for Gaussian i and simplifying. i =1 w y − w + xi Setting dLL/dw R w such that ∑ i = 0 = equal to zero w + xi i =1 Copyright © 2001, 2003, Andrew W. Moore 25 Non-linear MLE estimation argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ (y ) R then plugging in 2 − w + xi = equation for Gaussian i and simplifying. i =1 w y − w + xi Setting dLL/dw R w such that ∑ i = 0 = equal to zero w + xi i =1 We’re down the algebraic toilet t wha ess u So g e do? w Copyright © 2001, 2003, Andrew W. Moore 26
14.
Non-linear MLE estimation argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ ( ) R Common (but not only) approach: then plugging in 2 yi − w + xi = equation for Gaussian Numerical Solutions: and simplifying. i =1 w • Line Search • Simulated Annealing yi − w + xi Setting dLL/dw R ∑ w such = 0 = • Gradient Descent that equal to zero w + xi • Conjugate Gradient i =1 • Levenberg Marquart We’re down the • Newton’s Method algebraic toilet Also, special purpose statistical- t wha optimization-specific tricks such as uess ? So g e do E.M. (See Gaussian Mixtures lecture for introduction) w Copyright © 2001, 2003, Andrew W. Moore 27 Polynomial Regression Copyright © 2001, 2003, Andrew W. Moore 28
15.
Polynomial Regression So far we’ve mainly been dealing with linear regression X1 X2 Y X= 3 2 y= 7 3 2 7 1 1 3 1 1 3 : : : : : : x1=(3,2).. y1=7.. Z= 1 y= 3 2 7 1 1 1 3 : : β=(ZTZ)-1(ZTy) : z1=(1,3,2).. y1=7.. yest = β0+ β1 x1+ β2 x2 zk=(1,xk1,xk2) Copyright © 2001, 2003, Andrew W. Moore 29 Quadratic Regression It’s trivial to do linear fits of fixed nonlinear basis functions X1 X2 Y X= 3 2 y= 7 3 2 7 1 1 3 1 1 3 : : : : : : x1=(3,2).. y1=7.. y= 1 3 2 9 6 4 Z= 7 1 1 1 1 1 1 β=(ZTZ)-1(ZTy) 3 : : : yest = β0+ β1 x1+ β2 x2+ z=(1 , x1, x2 , x1 2, x1x2,x22,) β3 x12 + β4 x1x2 + β5 x22 Copyright © 2001, 2003, Andrew W. Moore 30
16.
Quadratic Regression It’s trivial to do linear fitsof afixed nonlinear basisterm. Each component of z vector is called a functions X1 X2 Each column of X= 3 matrix is called a term column Y y= 7 the Z 2 3 2 How many terms in 1 quadratic regression with m 7 a1 3 1 inputs? 1 3 :: : : •1 constant term 1=(3,2).. : : x y1=7.. 1 •m linear terms6 4 y= 329 Z= 7 1 •(m+1)-choose-2 =1 1 1 1 1 m(m+1)/2 quadratic terms β=(ZT -1 T (m+2)-choose-2 terms in total = O(m2) Z) (Z y) 3 : : : y = β0+ β1 x1+ β2 x2+ est z=(1 , x1, x2 , x12, x1x2,x22,) -1 T Note that solving β=(ZTZ) (Z y) is thusβO(m6)+ β x 2 β3 x12 + 4 x1x2 52 Copyright © 2001, 2003, Andrew W. Moore 31 Qth-degree polynomial Regression X1 X2 Y X= 3 y= 7 2 3 2 7 1 1 3 1 1 3 : : : : : : x1=(3,2).. y1=7.. y= 7 1 3 2 9 6 … Z= 3 1 1 1 1 1 … β=(ZTZ)-1(ZTy) : : … yest = β0+ z=(all products of powers of inputs in which sum of powers is q or less,) β1 x1+… Copyright © 2001, 2003, Andrew W. Moore 32
17.
m inputs, degree Q: how many terms? = the number of unique terms of the form m x1q1 x2 2 ...xmm where ∑ qi ≤ Q q q i =1 = the number of unique terms of the m form 1q0 x1q1 x2 2 ...xmm where ∑ qi = Q q q i =0 = the number of lists of non-negative integers [q0,q1,q2,..qm] Σqi = Q in which = the number of ways of placing Q red disks on a row of squares of length Q+m = (Q+m)-choose-Q Q=11, m=4 q0=2 q1=2 q2=0 q3=4 q4=3 Copyright © 2001, 2003, Andrew W. Moore 33 Radial Basis Functions Copyright © 2001, 2003, Andrew W. Moore 34
18.
Radial Basis Functions (RBFs) X1 X2 Y X= 3 y= 7 2 3 2 7 1 1 3 1 1 3 : : : : :: x1=(3,2).. y1=7.. … … … … … … y= 7 Z= 3 ……………… β=(ZTZ)-1(ZTy) : ……………… yest = β0+ z=(list of radial basis function evaluations) β1 x1+… Copyright © 2001, 2003, Andrew W. Moore 35 1-d RBFs y c1 c1 c1 x yest = β1 φ1(x) + β2 φ2(x) + β3 φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) Copyright © 2001, 2003, Andrew W. Moore 36
19.
Example y c1 c1 c1 x yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) Copyright © 2001, 2003, Andrew W. Moore 37 RBFs with Linear Regression KW also held constant (initialized to be large enough that there’s decent c Ally i ’s are held constant overlap between basis (initialized randomly or functions* on a grid in m- c1 c1 c1 x dimensional input space) *Usually much better than the crappy overlap on my diagram yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) Copyright © 2001, 2003, Andrew W. Moore 38
20.
RBFs with Linear Regression KW also held constant (initialized to be large enough that there’s decent c Ally i ’s are held constant overlap between basis (initialized randomly or functions* on a grid in m- c1 c1 c1 x dimensional input space) *Usually much better than the crappy overlap on my diagram yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) then given Q basis functions, define the matrix Z such that Zkj = KernelFunction( | xk - ci | / KW) where xk is the kth vector of inputs And as before, β=(ZTZ)-1(ZTy) Copyright © 2001, 2003, Andrew W. Moore 39 RBFs with NonLinear Regression Allow the ci ’s to adapt to KW allowed to adapt to the data. y data (initialized the (Some folks even let each basis function have its own randomly or on a grid in c1 c c1 KWj,permitting fine detail in m-dimensional input 1 x dense regions of input space) space) yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) But how do we now find all the βj’s, ci ’s and KW ? Copyright © 2001, 2003, Andrew W. Moore 40
21.
RBFs with NonLinear Regression Allow the ci ’s to adapt to KW allowed to adapt to the data. y data (initialized the (Some folks even let each basis function have its own randomly or on a grid in c1 c c1 KWj,permitting fine detail in m-dimensional input 1 x dense regions of input space) space) yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) But how do we now find all the βj’s, ci ’s and KW ? Answer: Gradient Descent Copyright © 2001, 2003, Andrew W. Moore 41 RBFs with NonLinear Regression Allow the ci ’s to adapt to KW allowed to adapt to the data. y data (initialized the (Some folks even let each basis function have its own randomly or on a grid in c1 c c1 KWj,permitting fine detail in m-dimensional input 1 x dense regions of input space) space) yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) But how do we now find all the βj’s, ci ’s and KW ? (But I’d like to see, or hope someone’s already done, a Answer: Gradient Descent hybrid, where the ci ’s and KW are updated with gradient descent while the βj’s use matrix inversion) Copyright © 2001, 2003, Andrew W. Moore 42
22.
Radial Basis Functions in 2-d Two inputs. Outputs (heights Center sticking out of page) not shown. Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 43 Happy RBFs in 2-d Blue dots denote coordinates of input vectors Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 44
23.
Crabby RBFs in 2-d What’s the problem in this example? Blue dots denote coordinates of input vectors Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 45 More crabby RBFs And what’s the problem in this example? Blue dots denote coordinates of input vectors Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 46
24.
Hopeless! Even before seeing the data, you should understand that this is a disaster! Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 47 Unhappy Even before seeing the data, you should understand that this isn’t good either.. Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 48
25.
Robust Regression Copyright © 2001, 2003, Andrew W. Moore 49 Robust Regression y x Copyright © 2001, 2003, Andrew W. Moore 50
26.
Robust Regression This is the best fit that Quadratic Regression can manage y x Copyright © 2001, 2003, Andrew W. Moore 51 Robust Regression …but this is what we’d probably prefer y x Copyright © 2001, 2003, Andrew W. Moore 52
27.
LOESS-based Robust Regression After the initial fit, score each datapoint according to how well it’s fitted… y You are a very good datapoint. x Copyright © 2001, 2003, Andrew W. Moore 53 LOESS-based Robust Regression After the initial fit, score each datapoint according to how well it’s fitted… y You are a very good datapoint. You are not too shabby. x Copyright © 2001, 2003, Andrew W. Moore 54
28.
LOESS-based Robust Regression After the initial fit, score each datapoint according to how well it’s fitted… But you are pathetic. y You are a very good datapoint. You are not too shabby. x Copyright © 2001, 2003, Andrew W. Moore 55 Robust Regression For k = 1 to R… •Let (xk,yk) be the kth datapoint y •Let yestk be predicted value of yk x •Let wk be a weight for datapoint k that is large if the datapoint fits well and small if it fits badly: wk = KernelFn([yk- yestk]2) Copyright © 2001, 2003, Andrew W. Moore 56
29.
Robust Regression For k = 1 to R… •Let (xk,yk) be the kth datapoint y •Let yestk be predicted value of yk x •Let wk be a weight for datapoint k that is large if the Then redo the regression datapoint fits well and small if it using weighted datapoints. fits badly: Weighted regression was described earlier in the “vary noise” section, and is also discussed wk = KernelFn([yk- yestk]2) in the “Memory-based Learning” Lecture. Guess what happens next? Copyright © 2001, 2003, Andrew W. Moore 57 Robust Regression For k = 1 to R… •Let (xk,yk) be the kth datapoint y •Let yestk be predicted value of yk x •Let wk be a weight for datapoint k that is large if the Then redo the regression datapoint fits well and small if it using weighted datapoints. fits badly: I taught you how to do this in the “Instance- based” lecture (only then the weights wk = KernelFn([yk- yestk]2) depended on distance in input-space) Repeat whole thing until converged! Copyright © 2001, 2003, Andrew W. Moore 58
30.
Robust Regression---what we’re doing What regular regression does: Assume yk was originally generated using the following recipe: yk = β0+ β1 xk+ β2 xk2 +N(0,σ2) Computational task is to find the Maximum Likelihood β0 , β1 and β2 Copyright © 2001, 2003, Andrew W. Moore 59 Robust Regression---what we’re doing What LOESS robust regression does: Assume yk was originally generated using the following recipe: With probability p: yk = β0+ β1 xk+ β2 xk2 +N(0,σ2) But otherwise yk ~ N(µ,σhuge2) Computational task is to find the Maximum Likelihood β0 , β1 , β2 , p, µ and σhuge Copyright © 2001, 2003, Andrew W. Moore 60
31.
Robust Regression---what we’re doing What LOESS robust regression does: Mysteriously, the Assume yk was originally generated using the reweighting procedure following recipe: does this computation for us. With probability p: yk = β0+ β1 xk+ β2 xk2 +N(0,σ2) Your first glimpse of two spectacular letters: But otherwise yk ~ N(µ,σhuge2) E.M. Computational task is to find the Maximum Likelihood β0 , β1 , β2 , p, µ and σhuge Copyright © 2001, 2003, Andrew W. Moore 61 Regression Trees Copyright © 2001, 2003, Andrew W. Moore 62
32.
Regression Trees • “Decision trees for regression” Copyright © 2001, 2003, Andrew W. Moore 63 A regression tree leaf Predict age = 47 Mean age of records matching this leaf node Copyright © 2001, 2003, Andrew W. Moore 64
33.
A one-split regression tree Gender? Female Male Predict age = 39 Predict age = 36 Copyright © 2001, 2003, Andrew W. Moore 65 Choosing the attribute to split on Gender Rich? Num. Num. Beany Age Children Babies Female No 2 1 38 Male No 0 0 24 Male Yes 0 5+ 72 : : : : : • We can’t use information gain. • What should we use? Copyright © 2001, 2003, Andrew W. Moore 66
34.
Choosing the attribute to split on Gender Rich? Num. Num. Beany Age Children Babies Female No 2 1 38 Male No 0 0 24 Male Yes 0 5+ 72 : : : : : MSE(Y|X) = The expected squared error if we must predict a record’s Y value given only knowledge of the record’s X value If we’re told x=j, the smallest expected error comes from predicting the mean of the Y-values among those records in which x=j. Call this mean quantity µyx=j Then… 1 NX MSE (Y | X )= ∑ ∑ (x y=kj )− µ yx= j ) 2 R j =1 ( k such that k Copyright © 2001, 2003, Andrew W. Moore 67 Choosing the attribute to split on Gender Rich? Num. Num. Beany Age Children Babies NoRegression tree attribute selection: greedily Female 2 1 38 choose the attribute that minimizes MSE(Y|X) Male No 0 0 24 Guess 0what we do about real-valued inputs? Male Yes 5+ 72 : : : : : Guess how we prevent overfitting MSE(Y|X) = The expected squared error if we must predict a record’s Y value given only knowledge of the record’s X value If we’re told x=j, the smallest expected error comes from predicting the mean of the Y-values among those records in which x=j. Call this mean quantity µyx=j Then… 1 NX MSE (Y | X )= ∑ ∑ (x y=kj )− µ yx= j ) 2 R j =1 ( k such that k Copyright © 2001, 2003, Andrew W. Moore 68
35.
Pruning Decision …property-owner = Yes Do I deserve to live? Gender? Female Male Predict age = 39 Predict age = 36 # property-owning females = 56712 # property-owning males = 55800 Mean age among POFs = 39 Mean age among POMs = 36 Age std dev among POFs = 12 Age std dev among POMs = 11.5 Use a standard Chi-squared test of the null- hypothesis “these two populations have the same mean” and Bob’s your uncle. Copyright © 2001, 2003, Andrew W. Moore 69 Linear Regression Trees …property-owner = Yes Also known as “Model Trees” Gender? Female Male Predict age = Predict age = 26 + 6 * NumChildren - 24 + 7 * NumChildren - 2 * YearsEducation 2.5 * YearsEducation Leaves contain linear Split attribute chosen to minimize functions (trained using MSE of regressed children. linear regression on all Pruning with a different Chi- records matching that leaf) squared Copyright © 2001, 2003, Andrew W. Moore 70
36.
Linear Regression Trees …property-owner = Yes Also known as “Model Trees” Gender? d ste ny en te Female Male ae re b gno has he i lly Predict raget = ha u ing d t tes a Predict age = ic t typ ibute ree d ste ttribu You 26 + 6 * NumChildrenttr the 24 l+ nte ued a e t - u7 *l NumChildren - ail: rical a in al v et o se e* l-va abo 2 * YearsEducation up D 2.5 a YearsEducation teg her But u e r d te ca hig tes us n. sio , and been n or Leaves contain lineares s Split attribute chosen to minimize reg ibute ey’ve of regressed children. functions (trained using h MSE r at t t linear regression on alln if e ev Pruning with a different Chi- records matching that leaf) squared Copyright © 2001, 2003, Andrew W. Moore 71 Test your understanding Assuming regular regression trees, can you sketch a graph of the fitted function yest(x) over this diagram? y x Copyright © 2001, 2003, Andrew W. Moore 72
37.
Test your understanding Assuming linear regression trees, can you sketch a graph of the fitted function yest(x) over this diagram? y x Copyright © 2001, 2003, Andrew W. Moore 73 Multilinear Interpolation Copyright © 2001, 2003, Andrew W. Moore 74
38.
Multilinear Interpolation Consider this dataset. Suppose we wanted to create a continuous and piecewise linear fit to the data y x Copyright © 2001, 2003, Andrew W. Moore 75 Multilinear Interpolation Create a set of knot points: selected X-coordinates (usually equally spaced) that cover the data y q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 76
39.
Multilinear Interpolation We are going to assume the data was generated by a noisy version of a function that can only bend at the knots. Here are 3 examples (none fits the data well) y q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 77 How to find the best fit? Idea 1: Simply perform a separate regression in each segment for each part of the curve What’s the problem with this idea? y q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 78
40.
How to find the best fit? Let’s look at what goes on in the red segment (q3 − x) (q − x) y est ( x) = h2 + 2 h3 where w = q3 − q2 w w h2 y h3 q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 79 How to find the best fit? In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x) x − q2 q −x where φ2 ( x) = 1 − , φ3 ( x) = 1 − 3 w w h2 y φ2(x) h3 q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 80
41.
How to find the best fit? In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x) x − q2 q −x where φ2 ( x) = 1 − , φ3 ( x) = 1 − 3 w w h2 y φ2(x) h3 q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 81 How to find the best fit? In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x) | x − q2 | | x − q3 | where φ2 ( x) = 1 − , φ3 ( x) = 1 − w w h2 y φ2(x) h3 q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 82
42.
How to find the best fit? In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x) | x − q2 | | x − q3 | where φ2 ( x) = 1 − , φ3 ( x) = 1 − w w h2 y φ2(x) h3 q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 83 How to find the best fit? NK In general y est ( x) = ∑ hi φi ( x) i =1 | x − qi | 1 − if | x − qi |<w where φi ( x) = w 0 otherwise h2 y φ2(x) h3 q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 84
43.
How to find the best fit? NK In general y est ( x) = ∑ hi φi ( x) i =1 | x − qi | 1 − if | x − qi |<w where φi ( x) = w 0 otherwise h2 And this is simply a basis function regression problem! y We know how to find the least φ2(x) h3 squares hiis! q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 85 In two dimensions… Blue dots show locations of input vectors (outputs not depicted) x2 x1 Copyright © 2001, 2003, Andrew W. Moore 86
44.
In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated not depicted) surface x2 x1 Copyright © 2001, 2003, Andrew W. Moore 87 In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated 9 3 not depicted) surface But how do we do the interpolation to ensure that the surface is 7 8 continuous? x2 x1 Copyright © 2001, 2003, Andrew W. Moore 88
45.
In two dimensions… Each purple dot is a knot point. Blue dots showthe It will contain To predict locations of input the height of value here… vectors (outputs the estimated 9 3 not depicted) surface But how do we do the interpolation to ensure that the surface is 7 8 continuous? x2 x1 Copyright © 2001, 2003, Andrew W. Moore 89 In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated 97 3 not depicted) surface But how do we do the To predict the interpolation to value here… ensure that the First interpolate surface is 7 8 its value on two continuous? opposite edges… 7.33 x2 x1 Copyright © 2001, 2003, Andrew W. Moore 90
46.
In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated 97 3 not depicted) surface To predict the But how do we value here… do the 7.05 interpolation to First interpolate ensure that the its value on two surface is 7 8 opposite edges… continuous? Then interpolate 7.33 between those x two values 2 x1 Copyright © 2001, 2003, Andrew W. Moore 91 In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated 97 3 not depicted) surface To predict the But how do we value here… do the 7.05 interpolation to First interpolate ensure that the its value on two Notes: 8 surface is 7 opposite edges… continuous? Then interpolate 7.33 can easily be generalized This between those to m dimensions. x two values 2 It should be easy to see that it ensures continuity x1 The patches are not linear Copyright © 2001, 2003, Andrew W. Moore 92
47.
Doing the regression Given data, how do we find the optimal knot heights? Happily, it’s 9 3 simply a two- dimensional basis function problem. (Working out 7 8 the basis functions is tedious, x2 unilluminating, and easy) What’s the problem in higher x1 dimensions? Copyright © 2001, 2003, Andrew W. Moore 93 MARS: Multivariate Adaptive Regression Splines Copyright © 2001, 2003, Andrew W. Moore 94
48.
MARS • Multivariate Adaptive Regression Splines • Invented by Jerry Friedman (one of Andrew’s heroes) • Simplest version: Let’s assume the function we are learning is of the following form: m y (x) = ∑ g k ( xk ) est k =1 Instead of a linear combination of the inputs, it’s a linear combination of non-linear functions of individual inputs Copyright © 2001, 2003, Andrew W. Moore 95 MARS m y (x) = ∑ g k ( xk ) est k =1 Instead of a linear combination of the inputs, it’s a linear combination of non-linear functions of individual inputs Idea: Each gk is one of these y q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 96
49.
MARS m y (x) = ∑ g k ( xk ) est k =1 Instead of a linear combination of the inputs, it’s a linear combination of non-linear functions of individual inputs m NK y (x) = ∑∑ h k φ k ( xk ) qkj : The location of est the j’th knot in the jj k =1 j =1 k’th dimension hkj : The regressed | xk − q k | height of the j’th 1 − j if | xk − q k |<wk knot in the k’th where φ k ( x) = j y wk j dimension 0 otherwise wk: The spacing between knots in the kth dimension q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 97 That’s not complicated enough! • Okay, now let’s get serious. We’ll allow arbitrary “two-way interactions”: m m m y (x) = ∑ g k ( xk ) + ∑ ∑g est ( xk , xt ) kt k =1 k =1 t = k +1 Can still be expressed as a linear The function we’re combination of basis functions learning is allowed to be Thus learnable by linear regression a sum of non-linear functions over all one-d Full MARS: Uses cross-validation to and 2-d subsets of choose a subset of subspaces, knot attributes resolution and other parameters. Copyright © 2001, 2003, Andrew W. Moore 98
50.
If you like MARS… …See also CMAC (Cerebellar Model Articulated Controller) by James Albus (another of Andrew’s heroes) • Many of the same gut-level intuitions • But entirely in a neural-network, biologically plausible way • (All the low dimensional functions are by means of lookup tables, trained with a delta- rule and using a clever blurred update and hash-tables) Copyright © 2001, 2003, Andrew W. Moore 99 Where are we now? Inputs Inference p(E1|E2) Joint DE Engine Learn Dec Tree, Gauss/Joint BC, Gauss Naïve BC, Inputs Predict Classifier category Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve Inputs Inputs Prob- Density DE ability Estimator Linear Regression, Polynomial Regression, RBFs, Predict Regressor Robust Regression Regression Trees, Multilinear real no. Interp, MARS Copyright © 2001, 2003, Andrew W. Moore 100
51.
Citations Radial Basis Functions Regression Trees etc T. Poggio and F. Girosi, L. Breiman and J. H. Friedman and Regularization Algorithms for R. A. Olshen and C. J. Stone, Learning That Are Equivalent to Classification and Regression Multilayer Networks, Science, Trees, Wadsworth, 1984 247, 978--982, 1989 J. R. Quinlan, Combining Instance- LOESS Based and Model-Based Learning, Machine Learning: W. S. Cleveland, Robust Locally Proceedings of the Tenth Weighted Regression and International Conference, 1993 Smoothing Scatterplots, Journal of the American Statistical MARS Association, 74, 368, 829-836, J. H. Friedman, Multivariate December, 1979 Adaptive Regression Splines, Department for Statistics, Stanford University, 1988, Technical Report No. 102 Copyright © 2001, 2003, Andrew W. Moore 101
×
Share Clipboard
×
Email
Email sent successfully..
Facebook
Twitter
LinkedIn
Google+
Link
Be the first to comment