Your SlideShare is downloading. ×
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Predicting Real-valued Outputs: An introduction to regression

740

Published on

Published in: Technology, Economy & Finance
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
740
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
40
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Predicting Real-valued outputs: an introduction to Regression Andrew W. Moore Note to other teachers and users of these slides. Andrew would be delighted Professor if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them School of Computer Science to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in Carnegie Mellon University your own lecture, please include this l teria d ma message, or the following link to the rdere Nets source repository of Andrew’s tutorials: is reo www.cs.cmu.edu/~awm l This he Neura “Favorite http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully t d the rithms” awm@cs.cmu.edu from re an received. o lectu ssion Alg 412-268-7599 e Regr e r lectu Copyright © 2001, 2003, Andrew W. Moore Single- Parameter Linear Regression Copyright © 2001, 2003, Andrew W. Moore 2
  • 2. Linear Regression DATASET inputs outputs x1 = 1 y1 = 1 x2 = 3 y2 = 2.2 ↑ x3 = 2 y3 = 2 w ↓ ←1→ x4 = 1.5 y4 = 1.9 x5 = 4 y5 = 3.1 Linear regression assumes that the expected value of the output given an input, E[y|x], is linear. Simplest case: Out(x) = wx for some unknown w. Given the data, we can estimate w. Copyright © 2001, 2003, Andrew W. Moore 3 1-parameter linear regression Assume that the data is formed by yi = wxi + noisei where… • the noise signals are independent • the noise has a normal distribution with mean 0 and unknown variance σ2 p(y|w,x) has a normal distribution with • mean wx • variance σ2 Copyright © 2001, 2003, Andrew W. Moore 4
  • 3. Bayesian Linear Regression p(y|w,x) = Normal (mean wx, var σ2) We have a set of datapoints (x1,y1) (x2,y2) … (xn,yn) which are EVIDENCE about w. We want to infer w from the data. p(w|x1, x2, x3,…xn, y1, y2…yn) •You can use BAYES rule to work out a posterior distribution for w given the data. •Or you could do Maximum Likelihood Estimation Copyright © 2001, 2003, Andrew W. Moore 5 Maximum likelihood estimation of w Asks the question: “For which value of w is this data most likely to have happened?” <=> For what w is p(y1, y2…yn |x1, x2, x3,…xn, w) maximized? <=> For what w is n ∏ p( y w, x ) maximized? i i i =1 Copyright © 2001, 2003, Andrew W. Moore 6
  • 4. For what w is n ∏ p ( y i w , x i ) maximized? i =1 For what w is n 1 ∏ exp(− 2 ( yi − wxi ) 2 ) maximized? σ i =1 For what w is 2 1  y − wx i  n ∑ −i  maximized? σ 2  i =1 For what w is 2 n ∑ (y ) − wx minimized? i i i =1 Copyright © 2001, 2003, Andrew W. Moore 7 Linear Regression The maximum likelihood w is the one that E(w) w minimizes sum- Ε = ∑( yi − wxi ) 2 of-squares of residuals i (∑ x )w = ∑ yi − (2∑ xi yi )w + 2 2 2 i i We want to minimize a quadratic function of w. Copyright © 2001, 2003, Andrew W. Moore 8
  • 5. Linear Regression Easy to show the sum of squares is minimized when ∑x y i i w= ∑x 2 i The maximum likelihood Out(x) = wx model is We can use it for prediction Copyright © 2001, 2003, Andrew W. Moore 9 Linear Regression Easy to show the sum of squares is minimized when ∑x y p(w) i i w= w ∑x 2 Note: i In Bayesian stats you’d have ended up with a prob dist of w The maximum likelihood Out(x) = wx model is And predictions would have given a prob dist of expected output Often useful to know your confidence. We can use it for Max likelihood can give some kinds of prediction confidence too. Copyright © 2001, 2003, Andrew W. Moore 10
  • 6. Multivariate Linear Regression Copyright © 2001, 2003, Andrew W. Moore 11 Multivariate Regression What if the inputs are vectors? 3. .4 6. 2-d input 5 . example .8 x2 . 10 x1 Dataset has form x1 y1 x2 y2 x3 y3 .: : . xR yR Copyright © 2001, 2003, Andrew W. Moore 12
  • 7. Multivariate Regression Write matrix X and Y thus:  .....x1 .....   x11 x1m   y1  x12 ... .....x .....  x  y  x22 ... x2 m   =  21 x= y =  2 2    M M M     .....x R .....  xR1 xR 2 ... xRm   yR  (there are R datapoints. Each input has m components) The linear regression model assumes a vector w such that Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D] The max. likelihood w is w = (XTX) -1(XTY) Copyright © 2001, 2003, Andrew W. Moore 13 Multivariate Regression Write matrix X and Y thus:  .....x1 .....   x11 x1m   y1  x12 ... .....x .....  x ... x2 m   x22  y =  y2   =  21 x= 2    M M M     .....x R .....  xR1 xR 2 ... xRm   yR  IMPORTANT EXERCISE: (there are R datapoints. Each input hasPROVE IT !!!!! m components) The linear regression model assumes a vector w such that Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D] The max. likelihood w is w = (XTX) -1(XTY) Copyright © 2001, 2003, Andrew W. Moore 14
  • 8. Multivariate Regression (con’t) The max. likelihood w is w = (XTX)-1(XTY) R ∑x x ki kj XTX is an m x m matrix: i,j’th elt is k =1 R ∑x y XTY is an m-element vector: i’th elt ki k k =1 Copyright © 2001, 2003, Andrew W. Moore 15 Constant Term in Linear Regression Copyright © 2001, 2003, Andrew W. Moore 16
  • 9. What about a constant term? We may expect linear data that does not go through the origin. Statisticians and Neural Net Folks all agree on a simple obvious hack. Can you guess?? Copyright © 2001, 2003, Andrew W. Moore 17 The constant term • The trick is to create a fake input “X0” that always takes the value 1 X1 X2 Y X0 X1 X2 Y 2 4 16 1 2 4 16 3 4 17 1 3 4 17 5 5 20 1 5 5 20 Before: After: Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2 …has to be a poor = w0+w1X1+ w2X2 In this example, You should be able model …has a fine constant to see the MLE w0 , w1 and w2 by term inspection Copyright © 2001, 2003, Andrew W. Moore 18
  • 10. . .. i ty astic ed rosc e H et Linear Regression with varying noise Copyright © 2001, 2003, Andrew W. Moore 19 Regression with varying noise • Suppose you know the variance of the noise that was added to each datapoint. y=3 σ=2 σi2 xi yi ½ ½ 4 y=2 σ=1/2 1 1 1 σ=1 2 1 1/4 y=1 σ=1/2 σ=2 2 3 4 y=0 3 2 1/4 x=0 x=1 x=2 x=3 MLE the w? yi ~ N ( wxi , σ i2 ) at’s f W h at e o Assume stim e Copyright © 2001, 2003, Andrew W. Moore 20
  • 11. MLE estimation with varying noise argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ 12 , σ 2 ,..., σ R , w) = 2 2 1 2 R w Assuming independence ( yi − wxi ) 2 R among noise and then argmin ∑ = plugging in equation for σ i2 Gaussian and simplifying. i =1 w   x ( y − wx ) Setting dLL/dw R  w such that ∑ i i 2 i = 0  = equal to zero   σi   i =1  R xi yi  Trivial algebra ∑ 2     i =1 σ i   R xi2  ∑ 2   σ  i =1 i  Copyright © 2001, 2003, Andrew W. Moore 21 This is Weighted Regression • We are asking to minimize the weighted sum of squares y=3 σ=2 ( yi − wxi ) 2 R argmin ∑ σ i2 y=2 i =1 σ=1/2 w σ=1 y=1 σ=1/2 σ=2 y=0 x=0 x=1 x=2 x=3 1 where weight for i’th datapoint is σ i2 Copyright © 2001, 2003, Andrew W. Moore 22
  • 12. Non-linear Regression Copyright © 2001, 2003, Andrew W. Moore 23 Non-linear Regression • Suppose you know that y is related to a function of x in such a way that the predicted values have a non-linear dependence on w, e.g: y=3 xi yi ½ ½ y=2 1 2.5 2 3 y=1 3 2 y=0 3 3 x=0 x=1 x=2 x=3 MLE yi ~ N ( w + xi , σ ) the w? 2 at’s f W h at e o Assume stim e Copyright © 2001, 2003, Andrew W. Moore 24
  • 13. Non-linear MLE estimation argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ (y ) R then plugging in 2 − w + xi = equation for Gaussian i and simplifying. i =1 w   y − w + xi Setting dLL/dw R  w such that ∑ i = 0 = equal to zero   w + xi   i =1 Copyright © 2001, 2003, Andrew W. Moore 25 Non-linear MLE estimation argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ (y ) R then plugging in 2 − w + xi = equation for Gaussian i and simplifying. i =1 w   y − w + xi Setting dLL/dw R  w such that ∑ i = 0 = equal to zero   w + xi   i =1 We’re down the algebraic toilet t wha ess u So g e do? w Copyright © 2001, 2003, Andrew W. Moore 26
  • 14. Non-linear MLE estimation argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ ( ) R Common (but not only) approach: then plugging in 2 yi − w + xi = equation for Gaussian Numerical Solutions: and simplifying. i =1 w • Line Search • Simulated Annealing   yi − w + xi Setting dLL/dw R ∑  w such = 0 = • Gradient Descent that equal to zero   w + xi   • Conjugate Gradient i =1 • Levenberg Marquart We’re down the • Newton’s Method algebraic toilet Also, special purpose statistical- t wha optimization-specific tricks such as uess ? So g e do E.M. (See Gaussian Mixtures lecture for introduction) w Copyright © 2001, 2003, Andrew W. Moore 27 Polynomial Regression Copyright © 2001, 2003, Andrew W. Moore 28
  • 15. Polynomial Regression So far we’ve mainly been dealing with linear regression X1 X2 Y X= 3 2 y= 7 3 2 7 1 1 3 1 1 3 : : : : : : x1=(3,2).. y1=7.. Z= 1 y= 3 2 7 1 1 1 3 : : β=(ZTZ)-1(ZTy) : z1=(1,3,2).. y1=7.. yest = β0+ β1 x1+ β2 x2 zk=(1,xk1,xk2) Copyright © 2001, 2003, Andrew W. Moore 29 Quadratic Regression It’s trivial to do linear fits of fixed nonlinear basis functions X1 X2 Y X= 3 2 y= 7 3 2 7 1 1 3 1 1 3 : : : : : : x1=(3,2).. y1=7.. y= 1 3 2 9 6 4 Z= 7 1 1 1 1 1 1 β=(ZTZ)-1(ZTy) 3 : : : yest = β0+ β1 x1+ β2 x2+ z=(1 , x1, x2 , x1 2, x1x2,x22,) β3 x12 + β4 x1x2 + β5 x22 Copyright © 2001, 2003, Andrew W. Moore 30
  • 16. Quadratic Regression It’s trivial to do linear fitsof afixed nonlinear basisterm. Each component of z vector is called a functions X1 X2 Each column of X= 3 matrix is called a term column Y y= 7 the Z 2 3 2 How many terms in 1 quadratic regression with m 7 a1 3 1 inputs? 1 3 :: : : •1 constant term 1=(3,2).. : : x y1=7.. 1 •m linear terms6 4 y= 329 Z= 7 1 •(m+1)-choose-2 =1 1 1 1 1 m(m+1)/2 quadratic terms β=(ZT -1 T (m+2)-choose-2 terms in total = O(m2) Z) (Z y) 3 : : : y = β0+ β1 x1+ β2 x2+ est z=(1 , x1, x2 , x12, x1x2,x22,) -1 T Note that solving β=(ZTZ) (Z y) is thusβO(m6)+ β x 2 β3 x12 + 4 x1x2 52 Copyright © 2001, 2003, Andrew W. Moore 31 Qth-degree polynomial Regression X1 X2 Y X= 3 y= 7 2 3 2 7 1 1 3 1 1 3 : : : : : : x1=(3,2).. y1=7.. y= 7 1 3 2 9 6 … Z= 3 1 1 1 1 1 … β=(ZTZ)-1(ZTy) : : … yest = β0+ z=(all products of powers of inputs in which sum of powers is q or less,) β1 x1+… Copyright © 2001, 2003, Andrew W. Moore 32
  • 17. m inputs, degree Q: how many terms? = the number of unique terms of the form m x1q1 x2 2 ...xmm where ∑ qi ≤ Q q q i =1 = the number of unique terms of the m form 1q0 x1q1 x2 2 ...xmm where ∑ qi = Q q q i =0 = the number of lists of non-negative integers [q0,q1,q2,..qm] Σqi = Q in which = the number of ways of placing Q red disks on a row of squares of length Q+m = (Q+m)-choose-Q Q=11, m=4 q0=2 q1=2 q2=0 q3=4 q4=3 Copyright © 2001, 2003, Andrew W. Moore 33 Radial Basis Functions Copyright © 2001, 2003, Andrew W. Moore 34
  • 18. Radial Basis Functions (RBFs) X1 X2 Y X= 3 y= 7 2 3 2 7 1 1 3 1 1 3 : : : : :: x1=(3,2).. y1=7.. … … … … … … y= 7 Z= 3 ……………… β=(ZTZ)-1(ZTy) : ……………… yest = β0+ z=(list of radial basis function evaluations) β1 x1+… Copyright © 2001, 2003, Andrew W. Moore 35 1-d RBFs y c1 c1 c1 x yest = β1 φ1(x) + β2 φ2(x) + β3 φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) Copyright © 2001, 2003, Andrew W. Moore 36
  • 19. Example y c1 c1 c1 x yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) Copyright © 2001, 2003, Andrew W. Moore 37 RBFs with Linear Regression KW also held constant (initialized to be large enough that there’s decent c Ally i ’s are held constant overlap between basis (initialized randomly or functions* on a grid in m- c1 c1 c1 x dimensional input space) *Usually much better than the crappy overlap on my diagram yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) Copyright © 2001, 2003, Andrew W. Moore 38
  • 20. RBFs with Linear Regression KW also held constant (initialized to be large enough that there’s decent c Ally i ’s are held constant overlap between basis (initialized randomly or functions* on a grid in m- c1 c1 c1 x dimensional input space) *Usually much better than the crappy overlap on my diagram yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) then given Q basis functions, define the matrix Z such that Zkj = KernelFunction( | xk - ci | / KW) where xk is the kth vector of inputs And as before, β=(ZTZ)-1(ZTy) Copyright © 2001, 2003, Andrew W. Moore 39 RBFs with NonLinear Regression Allow the ci ’s to adapt to KW allowed to adapt to the data. y data (initialized the (Some folks even let each basis function have its own randomly or on a grid in c1 c c1 KWj,permitting fine detail in m-dimensional input 1 x dense regions of input space) space) yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) But how do we now find all the βj’s, ci ’s and KW ? Copyright © 2001, 2003, Andrew W. Moore 40
  • 21. RBFs with NonLinear Regression Allow the ci ’s to adapt to KW allowed to adapt to the data. y data (initialized the (Some folks even let each basis function have its own randomly or on a grid in c1 c c1 KWj,permitting fine detail in m-dimensional input 1 x dense regions of input space) space) yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) But how do we now find all the βj’s, ci ’s and KW ? Answer: Gradient Descent Copyright © 2001, 2003, Andrew W. Moore 41 RBFs with NonLinear Regression Allow the ci ’s to adapt to KW allowed to adapt to the data. y data (initialized the (Some folks even let each basis function have its own randomly or on a grid in c1 c c1 KWj,permitting fine detail in m-dimensional input 1 x dense regions of input space) space) yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) But how do we now find all the βj’s, ci ’s and KW ? (But I’d like to see, or hope someone’s already done, a Answer: Gradient Descent hybrid, where the ci ’s and KW are updated with gradient descent while the βj’s use matrix inversion) Copyright © 2001, 2003, Andrew W. Moore 42
  • 22. Radial Basis Functions in 2-d Two inputs. Outputs (heights Center sticking out of page) not shown. Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 43 Happy RBFs in 2-d Blue dots denote coordinates of input vectors Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 44
  • 23. Crabby RBFs in 2-d What’s the problem in this example? Blue dots denote coordinates of input vectors Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 45 More crabby RBFs And what’s the problem in this example? Blue dots denote coordinates of input vectors Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 46
  • 24. Hopeless! Even before seeing the data, you should understand that this is a disaster! Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 47 Unhappy Even before seeing the data, you should understand that this isn’t good either.. Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 48
  • 25. Robust Regression Copyright © 2001, 2003, Andrew W. Moore 49 Robust Regression y x Copyright © 2001, 2003, Andrew W. Moore 50
  • 26. Robust Regression This is the best fit that Quadratic Regression can manage y x Copyright © 2001, 2003, Andrew W. Moore 51 Robust Regression …but this is what we’d probably prefer y x Copyright © 2001, 2003, Andrew W. Moore 52
  • 27. LOESS-based Robust Regression After the initial fit, score each datapoint according to how well it’s fitted… y You are a very good datapoint. x Copyright © 2001, 2003, Andrew W. Moore 53 LOESS-based Robust Regression After the initial fit, score each datapoint according to how well it’s fitted… y You are a very good datapoint. You are not too shabby. x Copyright © 2001, 2003, Andrew W. Moore 54
  • 28. LOESS-based Robust Regression After the initial fit, score each datapoint according to how well it’s fitted… But you are pathetic. y You are a very good datapoint. You are not too shabby. x Copyright © 2001, 2003, Andrew W. Moore 55 Robust Regression For k = 1 to R… •Let (xk,yk) be the kth datapoint y •Let yestk be predicted value of yk x •Let wk be a weight for datapoint k that is large if the datapoint fits well and small if it fits badly: wk = KernelFn([yk- yestk]2) Copyright © 2001, 2003, Andrew W. Moore 56
  • 29. Robust Regression For k = 1 to R… •Let (xk,yk) be the kth datapoint y •Let yestk be predicted value of yk x •Let wk be a weight for datapoint k that is large if the Then redo the regression datapoint fits well and small if it using weighted datapoints. fits badly: Weighted regression was described earlier in the “vary noise” section, and is also discussed wk = KernelFn([yk- yestk]2) in the “Memory-based Learning” Lecture. Guess what happens next? Copyright © 2001, 2003, Andrew W. Moore 57 Robust Regression For k = 1 to R… •Let (xk,yk) be the kth datapoint y •Let yestk be predicted value of yk x •Let wk be a weight for datapoint k that is large if the Then redo the regression datapoint fits well and small if it using weighted datapoints. fits badly: I taught you how to do this in the “Instance- based” lecture (only then the weights wk = KernelFn([yk- yestk]2) depended on distance in input-space) Repeat whole thing until converged! Copyright © 2001, 2003, Andrew W. Moore 58
  • 30. Robust Regression---what we’re doing What regular regression does: Assume yk was originally generated using the following recipe: yk = β0+ β1 xk+ β2 xk2 +N(0,σ2) Computational task is to find the Maximum Likelihood β0 , β1 and β2 Copyright © 2001, 2003, Andrew W. Moore 59 Robust Regression---what we’re doing What LOESS robust regression does: Assume yk was originally generated using the following recipe: With probability p: yk = β0+ β1 xk+ β2 xk2 +N(0,σ2) But otherwise yk ~ N(µ,σhuge2) Computational task is to find the Maximum Likelihood β0 , β1 , β2 , p, µ and σhuge Copyright © 2001, 2003, Andrew W. Moore 60
  • 31. Robust Regression---what we’re doing What LOESS robust regression does: Mysteriously, the Assume yk was originally generated using the reweighting procedure following recipe: does this computation for us. With probability p: yk = β0+ β1 xk+ β2 xk2 +N(0,σ2) Your first glimpse of two spectacular letters: But otherwise yk ~ N(µ,σhuge2) E.M. Computational task is to find the Maximum Likelihood β0 , β1 , β2 , p, µ and σhuge Copyright © 2001, 2003, Andrew W. Moore 61 Regression Trees Copyright © 2001, 2003, Andrew W. Moore 62
  • 32. Regression Trees • “Decision trees for regression” Copyright © 2001, 2003, Andrew W. Moore 63 A regression tree leaf Predict age = 47 Mean age of records matching this leaf node Copyright © 2001, 2003, Andrew W. Moore 64
  • 33. A one-split regression tree Gender? Female Male Predict age = 39 Predict age = 36 Copyright © 2001, 2003, Andrew W. Moore 65 Choosing the attribute to split on Gender Rich? Num. Num. Beany Age Children Babies Female No 2 1 38 Male No 0 0 24 Male Yes 0 5+ 72 : : : : : • We can’t use information gain. • What should we use? Copyright © 2001, 2003, Andrew W. Moore 66
  • 34. Choosing the attribute to split on Gender Rich? Num. Num. Beany Age Children Babies Female No 2 1 38 Male No 0 0 24 Male Yes 0 5+ 72 : : : : : MSE(Y|X) = The expected squared error if we must predict a record’s Y value given only knowledge of the record’s X value If we’re told x=j, the smallest expected error comes from predicting the mean of the Y-values among those records in which x=j. Call this mean quantity µyx=j Then… 1 NX MSE (Y | X )= ∑ ∑ (x y=kj )− µ yx= j ) 2 R j =1 ( k such that k Copyright © 2001, 2003, Andrew W. Moore 67 Choosing the attribute to split on Gender Rich? Num. Num. Beany Age Children Babies NoRegression tree attribute selection: greedily Female 2 1 38 choose the attribute that minimizes MSE(Y|X) Male No 0 0 24 Guess 0what we do about real-valued inputs? Male Yes 5+ 72 : : : : : Guess how we prevent overfitting MSE(Y|X) = The expected squared error if we must predict a record’s Y value given only knowledge of the record’s X value If we’re told x=j, the smallest expected error comes from predicting the mean of the Y-values among those records in which x=j. Call this mean quantity µyx=j Then… 1 NX MSE (Y | X )= ∑ ∑ (x y=kj )− µ yx= j ) 2 R j =1 ( k such that k Copyright © 2001, 2003, Andrew W. Moore 68
  • 35. Pruning Decision …property-owner = Yes Do I deserve to live? Gender? Female Male Predict age = 39 Predict age = 36 # property-owning females = 56712 # property-owning males = 55800 Mean age among POFs = 39 Mean age among POMs = 36 Age std dev among POFs = 12 Age std dev among POMs = 11.5 Use a standard Chi-squared test of the null- hypothesis “these two populations have the same mean” and Bob’s your uncle. Copyright © 2001, 2003, Andrew W. Moore 69 Linear Regression Trees …property-owner = Yes Also known as “Model Trees” Gender? Female Male Predict age = Predict age = 26 + 6 * NumChildren - 24 + 7 * NumChildren - 2 * YearsEducation 2.5 * YearsEducation Leaves contain linear Split attribute chosen to minimize functions (trained using MSE of regressed children. linear regression on all Pruning with a different Chi- records matching that leaf) squared Copyright © 2001, 2003, Andrew W. Moore 70
  • 36. Linear Regression Trees …property-owner = Yes Also known as “Model Trees” Gender? d ste ny en te Female Male ae re b gno has he i lly Predict raget = ha u ing d t tes a Predict age = ic t typ ibute ree d ste ttribu You 26 + 6 * NumChildrenttr the 24 l+ nte ued a e t - u7 *l NumChildren - ail: rical a in al v et o se e* l-va abo 2 * YearsEducation up D 2.5 a YearsEducation teg her But u e r d te ca hig tes us n. sio , and been n or Leaves contain lineares s Split attribute chosen to minimize reg ibute ey’ve of regressed children. functions (trained using h MSE r at t t linear regression on alln if e ev Pruning with a different Chi- records matching that leaf) squared Copyright © 2001, 2003, Andrew W. Moore 71 Test your understanding Assuming regular regression trees, can you sketch a graph of the fitted function yest(x) over this diagram? y x Copyright © 2001, 2003, Andrew W. Moore 72
  • 37. Test your understanding Assuming linear regression trees, can you sketch a graph of the fitted function yest(x) over this diagram? y x Copyright © 2001, 2003, Andrew W. Moore 73 Multilinear Interpolation Copyright © 2001, 2003, Andrew W. Moore 74
  • 38. Multilinear Interpolation Consider this dataset. Suppose we wanted to create a continuous and piecewise linear fit to the data y x Copyright © 2001, 2003, Andrew W. Moore 75 Multilinear Interpolation Create a set of knot points: selected X-coordinates (usually equally spaced) that cover the data y q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 76
  • 39. Multilinear Interpolation We are going to assume the data was generated by a noisy version of a function that can only bend at the knots. Here are 3 examples (none fits the data well) y q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 77 How to find the best fit? Idea 1: Simply perform a separate regression in each segment for each part of the curve What’s the problem with this idea? y q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 78
  • 40. How to find the best fit? Let’s look at what goes on in the red segment (q3 − x) (q − x) y est ( x) = h2 + 2 h3 where w = q3 − q2 w w h2 y h3 q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 79 How to find the best fit? In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x) x − q2 q −x where φ2 ( x) = 1 − , φ3 ( x) = 1 − 3 w w h2 y φ2(x) h3 q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 80
  • 41. How to find the best fit? In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x) x − q2 q −x where φ2 ( x) = 1 − , φ3 ( x) = 1 − 3 w w h2 y φ2(x) h3 q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 81 How to find the best fit? In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x) | x − q2 | | x − q3 | where φ2 ( x) = 1 − , φ3 ( x) = 1 − w w h2 y φ2(x) h3 q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 82
  • 42. How to find the best fit? In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x) | x − q2 | | x − q3 | where φ2 ( x) = 1 − , φ3 ( x) = 1 − w w h2 y φ2(x) h3 q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 83 How to find the best fit? NK In general y est ( x) = ∑ hi φi ( x) i =1  | x − qi | 1 − if | x − qi |<w where φi ( x) =  w  0 otherwise  h2 y φ2(x) h3 q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 84
  • 43. How to find the best fit? NK In general y est ( x) = ∑ hi φi ( x) i =1  | x − qi | 1 − if | x − qi |<w where φi ( x) =  w  0 otherwise  h2 And this is simply a basis function regression problem! y We know how to find the least φ2(x) h3 squares hiis! q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 85 In two dimensions… Blue dots show locations of input vectors (outputs not depicted) x2 x1 Copyright © 2001, 2003, Andrew W. Moore 86
  • 44. In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated not depicted) surface x2 x1 Copyright © 2001, 2003, Andrew W. Moore 87 In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated 9 3 not depicted) surface But how do we do the interpolation to ensure that the surface is 7 8 continuous? x2 x1 Copyright © 2001, 2003, Andrew W. Moore 88
  • 45. In two dimensions… Each purple dot is a knot point. Blue dots showthe It will contain To predict locations of input the height of value here… vectors (outputs the estimated 9 3 not depicted) surface But how do we do the interpolation to ensure that the surface is 7 8 continuous? x2 x1 Copyright © 2001, 2003, Andrew W. Moore 89 In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated 97 3 not depicted) surface But how do we do the To predict the interpolation to value here… ensure that the First interpolate surface is 7 8 its value on two continuous? opposite edges… 7.33 x2 x1 Copyright © 2001, 2003, Andrew W. Moore 90
  • 46. In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated 97 3 not depicted) surface To predict the But how do we value here… do the 7.05 interpolation to First interpolate ensure that the its value on two surface is 7 8 opposite edges… continuous? Then interpolate 7.33 between those x two values 2 x1 Copyright © 2001, 2003, Andrew W. Moore 91 In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated 97 3 not depicted) surface To predict the But how do we value here… do the 7.05 interpolation to First interpolate ensure that the its value on two Notes: 8 surface is 7 opposite edges… continuous? Then interpolate 7.33 can easily be generalized This between those to m dimensions. x two values 2 It should be easy to see that it ensures continuity x1 The patches are not linear Copyright © 2001, 2003, Andrew W. Moore 92
  • 47. Doing the regression Given data, how do we find the optimal knot heights? Happily, it’s 9 3 simply a two- dimensional basis function problem. (Working out 7 8 the basis functions is tedious, x2 unilluminating, and easy) What’s the problem in higher x1 dimensions? Copyright © 2001, 2003, Andrew W. Moore 93 MARS: Multivariate Adaptive Regression Splines Copyright © 2001, 2003, Andrew W. Moore 94
  • 48. MARS • Multivariate Adaptive Regression Splines • Invented by Jerry Friedman (one of Andrew’s heroes) • Simplest version: Let’s assume the function we are learning is of the following form: m y (x) = ∑ g k ( xk ) est k =1 Instead of a linear combination of the inputs, it’s a linear combination of non-linear functions of individual inputs Copyright © 2001, 2003, Andrew W. Moore 95 MARS m y (x) = ∑ g k ( xk ) est k =1 Instead of a linear combination of the inputs, it’s a linear combination of non-linear functions of individual inputs Idea: Each gk is one of these y q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 96
  • 49. MARS m y (x) = ∑ g k ( xk ) est k =1 Instead of a linear combination of the inputs, it’s a linear combination of non-linear functions of individual inputs m NK y (x) = ∑∑ h k φ k ( xk ) qkj : The location of est the j’th knot in the jj k =1 j =1 k’th dimension hkj : The regressed  | xk − q k | height of the j’th 1 − j if | xk − q k |<wk knot in the k’th where φ k ( x) =  j y wk j dimension  0 otherwise  wk: The spacing between knots in the kth dimension q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 97 That’s not complicated enough! • Okay, now let’s get serious. We’ll allow arbitrary “two-way interactions”: m m m y (x) = ∑ g k ( xk ) + ∑ ∑g est ( xk , xt ) kt k =1 k =1 t = k +1 Can still be expressed as a linear The function we’re combination of basis functions learning is allowed to be Thus learnable by linear regression a sum of non-linear functions over all one-d Full MARS: Uses cross-validation to and 2-d subsets of choose a subset of subspaces, knot attributes resolution and other parameters. Copyright © 2001, 2003, Andrew W. Moore 98
  • 50. If you like MARS… …See also CMAC (Cerebellar Model Articulated Controller) by James Albus (another of Andrew’s heroes) • Many of the same gut-level intuitions • But entirely in a neural-network, biologically plausible way • (All the low dimensional functions are by means of lookup tables, trained with a delta- rule and using a clever blurred update and hash-tables) Copyright © 2001, 2003, Andrew W. Moore 99 Where are we now? Inputs Inference p(E1|E2) Joint DE Engine Learn Dec Tree, Gauss/Joint BC, Gauss Naïve BC, Inputs Predict Classifier category Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve Inputs Inputs Prob- Density DE ability Estimator Linear Regression, Polynomial Regression, RBFs, Predict Regressor Robust Regression Regression Trees, Multilinear real no. Interp, MARS Copyright © 2001, 2003, Andrew W. Moore 100
  • 51. Citations Radial Basis Functions Regression Trees etc T. Poggio and F. Girosi, L. Breiman and J. H. Friedman and Regularization Algorithms for R. A. Olshen and C. J. Stone, Learning That Are Equivalent to Classification and Regression Multilayer Networks, Science, Trees, Wadsworth, 1984 247, 978--982, 1989 J. R. Quinlan, Combining Instance- LOESS Based and Model-Based Learning, Machine Learning: W. S. Cleveland, Robust Locally Proceedings of the Tenth Weighted Regression and International Conference, 1993 Smoothing Scatterplots, Journal of the American Statistical MARS Association, 74, 368, 829-836, J. H. Friedman, Multivariate December, 1979 Adaptive Regression Splines, Department for Statistics, Stanford University, 1988, Technical Report No. 102 Copyright © 2001, 2003, Andrew W. Moore 101

×