Introduction to mars_2009


Published on

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to mars_2009

  1. 1. Introduction to MARS Dan Steinberg Mykhaylo Golovnya [email_address] August, 2009
  2. 2. <ul><li>MARS is a highly-automated tool for regression </li></ul><ul><li>Developed by Jerome H. Friedman of Stanford University </li></ul><ul><ul><li>Annals of Statistics, 1991 dense 65 page article </li></ul></ul><ul><ul><li>Takes some inspiration from its ancestor CART® </li></ul></ul><ul><ul><li>Produces smooth curves and surfaces, not the step-functions of CART </li></ul></ul><ul><li>Appropriate target variables are continuous </li></ul><ul><li>End result of a MARS run is a regression model </li></ul><ul><ul><li>MARS automatically chooses which variables to use </li></ul></ul><ul><ul><li>variables are optimally transformed </li></ul></ul><ul><ul><li>interactions are detected </li></ul></ul><ul><ul><li>model is self-tested to protect against over-fitting </li></ul></ul><ul><li>Can also perform well on binary dependent variables </li></ul><ul><ul><li>censored survival model (waiting time models as in churn) </li></ul></ul>Introduction
  3. 3. <ul><li>Harrison, D. and D. Rubinfeld. Hedonic Housing Prices & Demand For Clean Air. Journal of Environmental Economics and Management, v5, 81-102 , 1978 </li></ul><ul><li>506 census tracts in City of Boston for the year 1970 </li></ul><ul><li>Goal: study relationship between quality of life variables and property values </li></ul><ul><ul><li>MV median value of owner-occupied homes in tract (‘000s) </li></ul></ul><ul><ul><li>CRIM per capita crime rates </li></ul></ul><ul><ul><li>NOX concentration of nitrogen oxides (pphm) </li></ul></ul><ul><ul><li>AGE percent built before 1940 </li></ul></ul><ul><ul><li>DIS weighted distance to centers of employment </li></ul></ul><ul><ul><li>RM average number of rooms per house </li></ul></ul><ul><ul><li>LSTAT percent neighborhood ‘lower socio-economic status’ </li></ul></ul><ul><ul><li>RAD accessibility to radial highways </li></ul></ul><ul><ul><li>CHAS borders Charles River (0/1) </li></ul></ul><ul><ul><li>INDUS percent non-retail business </li></ul></ul><ul><ul><li>TAX tax rate </li></ul></ul><ul><ul><li>PT pupil teacher ratio </li></ul></ul>Boston Housing Dataset
  4. 4. <ul><li>The dataset poses significant challenges to conventional regression modeling </li></ul><ul><ul><li>Clearly departure from normality, non-linear relationships, and skewed distributions </li></ul></ul><ul><ul><li>Multicollinearity, mutual dependency, and outlying observations </li></ul></ul>Scatter Matrix                                             
  5. 5. <ul><li>A typical MARS solution (univariate for simplicity) is shown above </li></ul><ul><ul><li>Essentially a piece-wise linear regression model with the continuity requirement at the transition points called knots </li></ul></ul><ul><ul><li>The locations and number of knots were determined automatically to ensure the best possible model fit </li></ul></ul><ul><ul><li>The solution can be analytically expressed as conventional regression equations </li></ul></ul>MARS Model
  6. 6. <ul><li>Finding the one best knot in a simple regression is a straightforward search problem </li></ul><ul><ul><li>try a large number of potential knots and choose one with best R-squared </li></ul></ul><ul><ul><li>computation can be implemented efficiently using update algorithms; entire regression does not have to be rerun for every possible knot (just update X’X matrices) </li></ul></ul><ul><li>Finding K knots simultaneously would require N K order of computations assuming N observations </li></ul><ul><li>To preserve linear problem complexity, multiple knot placement is implemented in a step-wise manner: </li></ul><ul><ul><li>Need a forward/backward procedure </li></ul></ul><ul><ul><li>The forward procedure adds knots sequentially one at a time </li></ul></ul><ul><ul><ul><li>The resulting model will have many knots and overfit the training data </li></ul></ul></ul><ul><ul><li>The backward procedure removes least contributing knots one at a time </li></ul></ul><ul><ul><ul><li>This produces a list of models of varying complexity </li></ul></ul></ul><ul><ul><li>Using appropriate evaluation criterion, identify the optimal model </li></ul></ul><ul><li>Resulting model will have approximately correct knot locations </li></ul>Challenge: Searching for Multiple Knots
  7. 7. <ul><li>True conditional mean has two knots at X =30 and X =60, bbserved data includes additional random error </li></ul><ul><li>Best single knot will be at X =45, subsequent best locations are true knots around 30 and 60 </li></ul><ul><li>The backward elimination step is needed to remove the redundant node at X =45 </li></ul>Example: Flat Top Function
  8. 8. <ul><li>Thinking in terms of knot selection works very well to illustrate splines in one dimension but unwieldy for working with a large number of variables simultaneously </li></ul><ul><ul><li>Need a concise notation easy to program and extend in multiple dimensions </li></ul></ul><ul><ul><li>Need to support interactions, categorical variables, and missing values </li></ul></ul><ul><li>Basis Functions (BF) provide analytical machinery to express the knot placement strategy </li></ul><ul><li>Basis function is a continuous univariate transform that reduces predictor influence to a smaller range of values controlled by a parameter c (20 in the example below) </li></ul><ul><ul><li>Direct BF : max(X-c, 0) – the original range is cut below c </li></ul></ul><ul><ul><li>Mirror BF : max(c-X, 0) – the original range is cut above c </li></ul></ul>Basis Functions
  9. 9. <ul><li>MARS constructs basis functions for each unique value present in a continuous variable </li></ul><ul><li>Each new BF results to a different number of zeroes in the transformed variable – hence the set of all BFs is linearly independent </li></ul><ul><li>The resulting collection is naturally resistant to multicollinearity issues </li></ul><ul><li>This is further reinforced by introducing minimum number of observations requirement between two consecutive knots </li></ul>The Set of All Basis Functions
  10. 10. Step-Wise Model Development using BFs <ul><li>Define a basis function BF1 on the variable INDUS: BF1 = max (0, INDUS - 4) </li></ul><ul><li>Use this function instead of INDUS in a regression y = constant +   *BF1 + error </li></ul><ul><li>This fits a model in which the effect of INDUS on the dependent variable is 0 for all values below 4 and  1 for values above 4 </li></ul><ul><li>Suppose we added a second basis function BF2 to the model: BF2 = max (0, INDUS -8 ) </li></ul><ul><li>Then our regression function would be: y = constant +   *BF1 +   *BF2 + error </li></ul>
  11. 11. Solution with 1 Basis Function <ul><li>MV = 27.395 - 0.659*(INDUS -4) + </li></ul>
  12. 12. Solution with 2 Basis Functions <ul><li>MV = 30.290 - 2.439*(INDUS - 4) + + 2.215*(INDUS-8) + </li></ul><ul><li>Slope starts at 0 and then becomes -2.439 after INDUS=4 </li></ul><ul><li>Slope on third portion (after INDUS=8) is (- 2.439 + 2.215) = -0.224 </li></ul>
  13. 13. <ul><li>The following model represents a 3-knot univariate solution for the Boston Housing dataset using two direct and one mirror basis functions </li></ul><ul><li>BF1 = max(0, 4-INDUS) BF2 = max(0, INDUS-4) BF3=max(0, INDUS-8) </li></ul><ul><li>MV= 29.433 + 0.925*(4-INDUS) + - 2.180*(INDUS-4) + + 1.939*(INDUS-8) + </li></ul><ul><ul><li>All three line segments have negative slope even though two coefficients are above zero </li></ul></ul>Example: Solution with 3 Basis Functions
  14. 14. MARS Creates Basis Functions in Pairs <ul><li>To fully emulate the geometric concept of a knot, MARS creates basis functions in pairs </li></ul><ul><ul><li>thus twice as many basis functions possible as there are distinct data values </li></ul></ul><ul><ul><li>reminiscent of CART (left and right sides of a split) </li></ul></ul><ul><ul><li>mirror image is needed to ultimately find right model </li></ul></ul><ul><ul><li>not all linearly independent but increases flexibility of model </li></ul></ul><ul><li>For a given set of knots only a subset of mirror image basis functions will be linearly independent of the standard basis functions – MARS is clever enough to identify such cases and discard redundant pieces </li></ul><ul><li>However, using the mirror image INSTEAD of the standard basis function at any knot will change the model and is important for interaction detection </li></ul>
  15. 15. <ul><li>MARS core technology: </li></ul><ul><ul><li>Forward step: add basis function pairs one at a time in conventional step-wise forward manner until the largest model size (specified by the user) is reached </li></ul></ul><ul><ul><ul><li>The pairs are needed to fully implement the geometric sense of a knot </li></ul></ul></ul><ul><ul><ul><li>Possible collinearity due to redundancy in pairs must be detected and eliminated </li></ul></ul></ul><ul><ul><ul><li>For categorical predictors define basis functions as indicator variables for all possible subsets of levels </li></ul></ul></ul><ul><ul><ul><li>To support interactions, allow cross products between a new candidate pair and basis functions already present in the model </li></ul></ul></ul><ul><ul><li>Backward step: remove basis functions one at a time in conventional step-wise backward manner to obtain a sequence of candidate models </li></ul></ul><ul><ul><li>Use test sample or cross-validation to identify the optimal model size </li></ul></ul><ul><li>Missing values are treated by constructing missing value indicator (MVI) variables and nesting the basis functions within the corresponding MVIs </li></ul><ul><li>Fast update formulae and smart computational shortcuts exist to make the MARS process as fast and efficient as possible </li></ul>MARS Process
  16. 16. Example of Categorical Predictors <ul><li>Where RAD is declared categorical, MARS reports in classic output: </li></ul><ul><li>Basis Functions found: </li></ul><ul><li>BF1 = max(0, INDUS - 8.140); </li></ul><ul><li>BF3 = ( RAD = 1 OR RAD = 4 OR RAD = 6 OR RAD = 24); </li></ul><ul><li>BF13 = max(0, INDUS - 3.970); </li></ul><ul><li>BF3 is essentially a dummy indicator for the {1, 4, 6, 24} subset of RAD levels </li></ul><ul><li>MARS looks at all possible 2 K-1 -1 groupings of levels and ultimately chooses the one showing the greatest error reduction </li></ul><ul><li>A different grouping can enter the model at subsequent iterations </li></ul><ul><li>This machinery mimics CART and is vastly more powerful than the conventional regression approach of replacing categorical variables by a set of dummies </li></ul>
  17. 17. Missing Value Handling <ul><li>In one of the choice models we encountered the following MARS code: </li></ul><ul><ul><li>BF10 = ( INCGT5 > .); </li></ul></ul><ul><ul><li>BF11 = ( INCGT5 = .); </li></ul></ul><ul><ul><li>BF12 = max(0, INCGT5 + .110320E-07) * BF10; </li></ul></ul><ul><ul><li>BF13 = max(0, ENVIRON - 12.000) * BF11; </li></ul></ul><ul><ul><li>BF14 = max(0, 12.000 - ENVIRON ) * BF11; </li></ul></ul><ul><li>This uses income when it is available and uses the ENVIRON variable when INCGT5 is missing </li></ul><ul><ul><li>Effectively this creates a surrogate variable for INCGT5 </li></ul></ul><ul><ul><li>No guarantee that MARS will find a surrogate; however, MARS will search all possible surrogates in basis function generation stage </li></ul></ul><ul><ul><li>Unlike CART, this machinery is turned on only when missing values are present in the LEARN sample – hence, care must be exercised when scoring new data </li></ul></ul>
  18. 18. Interaction Support in MARS <ul><li>MARS builds up its interactions by combining a SINGLE previously-entered basis function with a PAIR of new basis functions </li></ul><ul><li>The “new pair” of basis functions (a standard and a mirror image) could coincide with a previously entered pair or could be a new pair in an already specified variable or a new pair in a new variable </li></ul><ul><li>Interactions are thus built by accretion </li></ul><ul><ul><li>first one of the members of the interaction must appear as a main effect </li></ul></ul><ul><ul><li>then an interaction can be created involving this term </li></ul></ul><ul><ul><li>the second member of the interaction does NOT need to enter as a main effect (modeler might wish to require otherwise via ex post modification of model) </li></ul></ul><ul><li>Generally a MARS interaction will be region-specific and look like </li></ul><ul><ul><li>(PT - 18.6) + * (RM - 6.431) + </li></ul></ul><ul><li>This is not the familiar interaction of PT*RM because the interaction is confined to the data region where RM<=6.431 and PT<=18.6 </li></ul>
  19. 19. Boston Housing – Conventional Regression <ul><li>We compare the results of classical linear regression and MARS </li></ul><ul><ul><li>Top three significant predictors are shown for each model </li></ul></ul><ul><ul><li>Linear regression provides global insights </li></ul></ul><ul><ul><li>MARS regression provides local insights and has superior accuracy </li></ul></ul><ul><ul><ul><li>All cut points were automatically discovered by MARS </li></ul></ul></ul><ul><ul><ul><li>MARS model can be presented as a linear regression model in the BF space </li></ul></ul></ul>OLS Regression (R-squared 73%) MARS Regression (R-squared 87%)
  20. 20. Further Reading <ul><li>To the best of our knowledge, as of 2008 Salford Systems MARS tutorial and the documentation for MARS™ software constitutes the sum total of any extended discussion of MARS. MARS is referenced in several hundred scientific publications appearing since 1994 but the reader is assumed to have read Freidman’s articles. </li></ul><ul><li>Friedman, J. H. (1991a). Multivariate adaptive regression splines (with discussion). Annals of Statistics, 19, 1-141 (March). </li></ul><ul><li>Friedman, J. H. (1991b). Estimating functions of mixed ordinal and categorical variables using adaptive splines. Department of Statistics,Stanford University, Tech. Report LCS108. </li></ul><ul><li>Friedman, J. H. and Silverman, B. W. (1989). Flexible parsimonious smoothing and additive modeling (with discussion). TECHNOMETRICS, 31, 3-39 (Feburary). </li></ul><ul><li>De Veaux R.D., Psichogios D.C., and Ungar L.H. (1993), A Comparison of Two Nonparametric Estimation Schemes: Mars and Neutral Networks, Computers Chemical Engineering , Vol.17, No.8. </li></ul>