Evolution of regression ols to gps to mars


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Evolution of regression ols to gps to mars

  1. 1. Evolution of Regression:From Classical Least Squares to RegularizedRegression to Machine Learning EnsemblesCovering MARS®, Generalized PathSeeker®, TreeNet®Gradient Boosting and Random Forests®A Brief Overview the 4 Part Webinarat www.salford-systems.comMay 2013Dan SteinbergMikhail GolovnyaSalford SystemsSalford Systems ©2013 1
  2. 2. Full Webinar Outline• Regression Problem – quick overview• Classical Least Squares – the starting point• RIDGE/LASSO/GPS – regularized regression• MARS – adaptive non-linear regression splinesSalford Systems ©2013 2• CART Regression tree– quick overview• Random Forest decision tree ensembles• TreeNet Stochastic Gradient Boosted Trees• Hybrid TreeNet/GPS (trees and regularized regression)Webinar Part 1Webinar Part 2
  3. 3. Regression• Regression analysis at least 200 years oldo most used predictive modeling technique (including logistic regression)• American Statistical Association reports 18,900 memberso Bureau of Labor Statistics reports more than 22,000 statisticians in 2008• Many other professionals involved in the sophisticated analysisof data not included in these countso Statistical specialists in marketing, economics, psychology, bioinformaticso Machine Learning specialists and „Data Scientists‟o Data Base professionals involved in data analysiso Web analytics, social media analytics, text analytics• Few of these other researchers will call themselves statisticianso but may make extensive use of variations of regression• One reason for popularity of regression: effectiveSalford Systems ©2013 3
  4. 4. Regression Challenges• Preparation of data – errors, missing values, etc.o Largest part of typical data analysis (modelers often report80% time)o Missing values a huge headache (listwise deletion of rows)• Determining which predictors to include in modelo Text book examples typically have 10 predictors availableo Hundreds, thousands, even tens and hundreds of thousands available• Transformation or coding of predictorso Conventional approaches: logarithm, power, inverse, etc..o Required to obtain a good model• High correlation among predictorso With increasing numbers of predictors this complicationbecomes more seriousSalford Systems ©2013 4
  5. 5. More Regression Challenges• Obtaining “sensible” results (correct signs, no wildoutcomes)• Detecting and modeling important interactionso Typically never done because too difficult• “Wide” data has more columns than rows• Lack of external knowledge or theory to guidemodeling as more topics are modeledSalford Systems ©2013 5
  6. 6. Boston Housing Data Set• Concerns the housing values in Boston area• Harrison, D. and D. Rubinfeld. Hedonic Prices and theDemand For Clean Air.o Journal of Environmental Economics and Management, v5, 81-102 , 1978• Combined information from 10 separate governmentaland educational sources to produce data set• 506 census tracts in City of Boston for the year 1970o Goal: study relationship between quality of life variables and property valueso MV median value of owner-occupied homes in tract ($1,000‟s)o CRIM per capita crime rateso NOX concentration nitric oxides (p.p. 10 million) proxy for air pollution generallyo AGE percent built before 1940o DIS weighted distance to centers of employmento RM average number of rooms per houseo LSTAT % lower status of population (without some high school and male laborers)o RAD index of accessibility to radial highwayso CHAS borders Charles River (0/1)o INDUS percent of acreage non-retail businesso TAX property tax rate per $10,000o PT pupil teacher ratioo ZN proportion of neighborhood zoned for large lots (>25K sq ft)Salford Systems ©2013 6
  7. 7. Ten Data Sources Organized• US Census (1970)• FBI (1970)• MIT Boston Project• Metropolitan Area Planning Commission (1972)• Voigt, Ivers, and Associates (1965) (Land Use Survey)• US Census Tract Maps• Massachusetts Dept Of Education (1971-1972)• Massachusetts Tax Payer‟s Foundation (1970)• Transportation and Air Shed Simulation Model, Ingram, et. al.Harvard University Dept of City and Regional Planning (1974)• A. Schnare: An Empirical Analysis of the dimensions ofneighborhood quality. Ph.D. Thesis. Harvard. (1974)• An excellent example of creative data blending• Also excellent example of careful model construction• Authors emphasize the quality (completeness of their data)Salford Systems ©2013 7
  8. 8. Least Squares Regression• LS – ordinary least squares regressiono Discovered by Legendre (1805) and Gauss (1809)o Solve problems in astronomy using pen and papero Statistical foundation by Fisher in 1920so 1950s – use of electro-mechanical calculators• The model is always of the form• The response surface is a hyper-plane!• A – the intercept term• B1, B2, B3, … – parameter estimates• A usually unique combination of values exists whichminimizes the mean squared error of predictions on thelearn sample• Experimental approach to model buildingResponse = A + B1X1 + B2X2 + B3X3 + …Salford Systems ©2013 8
  9. 9. Transformations In Original Paper(For Historical Reference)• RM number of rooms in house: RM2• NOX raised to power p, experiments on value: NOXp• DIS, RAD, LSTAT entered as logarithms of predictor• Regression in paper is run on ln(MV)• Considerable experimentation undertaken• No train/test methodology• Classical Regression agrees very closely with paper onreported coefficients and R2=.81 (same w/o logging MV)• Converting predictions back from logs yields MSE=15.77• Note that this is learn sample only no testing performedSalford Systems ©2013 9
  10. 10. Classical Regression ResultsSalford Systems ©2013 10• 20% random test partition• Out of the box regression• No attempt to perfect• Test MSE=27.069
  11. 11. BATTERY PARTITION: Rerun 80/20 Learn test 100 timesSalford Systems ©2013 11Note partition sizes are constantAll three partitions change each cycleMean MSE=23.80
  12. 12. Least Squares Regression on Raw Boston Data• 414 records in the learnsample• 92 records in the testsample• Good agreement L/T:o LEARN MSE = 27.455o TEST MSE = 26.147• Used MARS in forwardstepwise LS mode togenerate this model3-variableSolution-0.597 +5.247-0.858Salford Systems ©2013 12
  13. 13. Motivation for Regularized Regression1960s and 1970s• Unsatisfactory results based modeling physical processeso Coefficients changed dramatically with small changes in datao Some coefficients judged to be too largeo Appearance of coefficients with “wrong sign”o Severe with substantial correlations among predictors(multicollinearity)• Solution (1970) Hoerl and Kennard, “Ridge Regression”• Earlier version just for stabilization of coefficients 1962o Initially poorly received by statistics professionSalford Systems ©2013 13
  14. 14. Regression Formulas• X matrix of potential predictors (NxK)• Y column: the target or dependent variable (Nx1)• Estimated = (X’X)-1(X’y) standard formula• Ridge (X’X + rI)-1(X’y)• Simplest version: constant added to diagonalelements of the X’X matrix• r=0 yields usual LS• r=∞ yields degenerate model• eed to find r that yields best generalization error• Observe that there is a potentially distinct “solution”for every value of the penalty term r• Varying r traces a path of solutionsSalford Systems ©2013 14
  15. 15. Ridge Regression• “Shrinkage” of regression coefficients towards zero• If zero correlation among all predictors then shrinkagewill be uniform over all coefficients (same percentage)• If predictors correlated then while the length of thecoefficient vector decreases some coefficients mightincrease (in absoluter value)• Coefficients intentionally biased but yields both moresatisfactory estimates and superior generalizationo Better performance (test MSE) on previously unseen data• Coefficients much less variable even if biased• Coefficients will be typically be closer to the “truth”Salford Systems ©2013 15
  16. 16. Ridge Regression Features• Ridge frequently fixes the wrong sign problem• Suppose you have K predictors which happen to beexact copies of each other• RIDGE will give each a coefficient equal to 1/Ktimes the coefficient that would be given to just onecopy in a modelSalford Systems ©2013 16
  17. 17. Ridge Regression vs OLSSalford Systems ©2013 17Ridge RegressionClassical RegressionRidge: Worse on training data but much better on test dataWithout test data must use Cross-Validation to determine how much to shrinkRIDGE TEST MSE=21.36
  18. 18. Lasso Regularized Regression• Tibshirani (1996) an alternative to RIDGE regression• Least Absolute Shrinkage and Selection Operator• Desire to gain the stability and lower variance of ridgeregression while also performing variable selection• Especially in the context of many possible predictorslooking for a simple, stable, low predictive variancemodel• Historical note: Lasso inspired by related work (1993) byLeo Breiman (of CART and RandomForests fame) „non-negative garotte‟.• Breiman‟s simulation studies showed the potential forimproved prediction via selection and shrinkageSalford Systems ©2013 18
  19. 19. Regularized Regression - Concepts• Any regularized regression approach tries to balance modelperformance and model complexity• λ – regularization parameter, to be estimatedo λ = ∞ Null model zero-coefficients (maximum possible penalty)o λ = 0 LS solution (no penalty)Salford Systems ©2013 19Mean Squared Error Model ComplexityLS RegressionMinimizeMinimizeRegularized RegressionRidge:Sum of squaredcoefficientsLasso:Sum of absolutecoefficientsCompact:Number ofcoefficientsλ
  20. 20. Regularized Regression: Penalized Loss Functions• RIDGE penalty squared• LASSO penalty absolute value• COMPACT penalty count of s• General penalty• RIDGE does no selection but Lasso and Compact select• Power on is called the “elasticity” ( 0, 1, 2)• Penalty to be estimated is a constant multiplying one ofthe above functions of the vector• Intermediate elasticities can be created: e.g. we couldhave a 50/50 mix of RIDGE and LASSO yielding anelasticity of 1.5Salford Systems ©2013 20
  21. 21. LASSO Features• With highly correlated predictors the LASSO will tendto pick just one of them for model inclusion• Dispersion of greater than for RIDGE• Unlike AIC and BIC model selection methods thatpenalize after the model is built these penaltiesinfluence the s• A convenient trick for estimating models withregularization is weighted average of any twoof the major elasticities 0, 1, and 2. e.g.:• w w) the “elastic net”)Salford Systems ©2013 21
  22. 22. Computational Challenge• For a given regularization (e.g LASSO) find theoptimal penalty on the term• Find the best regularization from the family• Potentially very many models to fitSalford Systems ©2013 22
  23. 23. Computing Regularized Regressions -1• Earliest versions of regularized regressions requiredconsiderable computation as the penaltyparameter is unknown and must be estimated• Lasso was originally computed by starting with nopenalty and gradually increasing the penaltyo So start with ALL vars in the modelo Gradually tighten the noose to squeeze predictors outo Infeasible for problems with thousands of possible predictors• Need to solve a quadratic programming problemto optimize the Lasso solution for every penaltyvalueSalford Systems ©2013 23
  24. 24. Computing Regularized Regressions -2• Work by Friedman and others introduced very fastforward stepping approaches• Start with maximum penalty (no predictors)• Progress forward with stopping ruleo Dealing with millions of predictors possible• Coordinate gradient descent methods (next slides)• Will still want test sample or cross-validation foroptimization• Generalized PathSeeeker full range of regularizationfrom compact to Ridge (elasticies from 0 thru 2)• Glmnet in R partial range of regularization from Lasso toRidge (elasticities from 1 to 2)Salford Systems ©2013 24
  25. 25. GPS Algorithm• Start with NO predictors in model• Seek the path ( of solutions as function of penaltystrength• Define pj( P/ j marginal change in Penalty• Define gj( R/ j marginal change in Loss• Define j( gj( pj( ratio (benefit/cost)• Find max| j( to identify coefficient to update (j*)• Update j* in the direction of sign j*• R/ j requires computing inner products ofcurrent residual with available predictorso Easily parallelizableSalford Systems ©2013 25
  26. 26. How to Forward Step• At any stage of model development choose between• Add a new variable to Update an existingmodel variable coefficient• Step sizes are small, initial coefficients for any model arevery small and are updated in very small increments• This explains why the Ridge elasticity can have solutionswith less than all the variableso Technically ridge does not select variables, it only shrinkso In practice it can only add one variable per stepSalford Systems ©2013 26
  27. 27. Regularized Regression – Practical Algorithm• Start with the zero-coefficient solution• Look for best first step which moves one coefficient away from zeroo Reduces Learn Sample MSEo Increases Penalty as the model has become more complex• Next step: Update one of the coefficients by a small amounto If the selected coefficient was zero, a new variable effectively enters into the modelo If the selected coefficient was not zero, the model is simply updatedSalford Systems ©2013 27CurrentModelX1 0.0X2 0.0X3 0.2X4 0.0X5 0.4X6 0.5X7 0.0X8 0.0X1 0.0X2 0.0X3 0.2X4 0.1X5 0.4X6 0.5X7 0.0X8 0.0Introducing New VariableNextModelCurrentModelX1 0.0X2 0.0X3 0.2X4 0.0X5 0.4X6 0.5X7 0.0X8 0.0X1 0.0X2 0.0X3 0.3X4 0.1X5 0.4X6 0.5X7 0.0X8 0.0Updating Existing ModelNextModel
  28. 28. Path Building Process• Elasticity Parameter – controls the variable selectionstrategy along the path (using the LEARN sampleonly), it can be between 0 and 2, inclusiveo Elasticity = 2 – fast approximation of Ridge Regression, introducesvariables as quickly as possible and then jointly varies the magnitude ofcoefficients – lowest degree of compressiono Elasticity = 1 – fast approximation of Lasso Regression, introducesvariables sparingly letting the current active variables develop theircoefficients – good degree of compression versus accuracyo Elasticity = 0 – fast approximation of Best Subset Regression, introducesnew variables only after the current active variables were fully developed– excellent degree of compression but may loose accuracyZeroCoefficientModelA Variableis AddedSequence of1-variablemodelsA Variableis AddedSequence of2-variablemodelsA Variableis AddedSequence of3-variablemodelsFinalOLSSolutionVariable Selection StrategySalford Systems ©2013 28λ = ∞ λ = 0…
  29. 29. Points Versus Steps• Each path(elasticity) will have different number of steps• To facilitate model comparison among different paths,the Point Selection Strategy extracts a fixed collection ofmodels into the points grido This eliminates some of the original irregularity among individual paths andfacilitates model extraction and comparisonPath 2: Steps OLSSolutionPointsPath 1Path 2Path 3ZeroSolutionPath 1: StepsPath 3: StepsPoint Selection Strategy1 2 3 4 5 6 7 8 9 101 2 3 4 5 6 7 8 9 101 2 3 4 5 6 7 8 9 10Salford Systems ©2013 29
  30. 30. LS versus GPS• GPS (Generalized Path Seeker) introduced by Jerome Friedman in 2008 (FastSparse Regression and Classification)• Dramatically expands the pool of potential linear models by including different setsof variables in addition to varying the magnitude of coefficients• The optimal model of any desirable size can then be selected based on itsperformance on the TEST sampleLearn SampleOLS RegressionX1, X2 , X3, X4, X5, X6,…Test SampleX1, X2 , X3, X4, X5, X6,…A Sequence of Linear Models1-variable model2-variable model3-variable model…GPS RegressionLarge Collection of Linear Models (Paths)1-variable models, varying coefficients2-variable models, varying coefficients3-variable models, varying coefficients…Salford Systems ©2013 30
  31. 31. Paths Produced by SPM GPS• Example of 21 paths with different variable selectionstrategiesSalford Systems ©2013 31
  32. 32. Path Points on Boston Data• Each path uses a different variable selectionstrategy and separate coefficient updatesPoint 30 Point 100 Point 150 Point 190Path DevelopmentSalford Systems ©2013 32
  33. 33. GPS on Boston Data3-variableSolution• 414 records in the learn sample• 92 records in the test sample• 15% performance improvementon the test sampleo GPS TEST MSE = 22.669o LS MSE= 26.147+5.247-0.858-0.597LS26.147Salford Systems ©2013 33
  34. 34. Sentinel Solutions DetailSalford Systems ©2013 34• Along the path followed by GPS for every elasticity we identify the solution(coefficient vector) best for each performance measure• No attention is paid to model size here so you might still prefer to select a modelfrom the graphical display
  35. 35. Regularized Logistic RegressionAll the same GPS ideas applySalford Systems ©2013 35Specify Logistic Binary AnalysisSpecify optimality criterion
  36. 36. How To Select a Best Model• Regularized regression was originally invented tohelp modelers obtain more intuitively acceptablemodels• Can think of the process as a search enginegenerating predictive models• User can decide based ono Complexity of modelo Acceptability of coefficients magnitude, signs, predictors included)• Clearly can be set to automatic mode• Criterion could well be performance on test dataSalford Systems ©2013 36
  37. 37. Key Problems with GPS• Still a linear regression!• Response surface is still a global hyper-plane• Incapable of discovering local structure in the data• Develop non-linear algorithms that build responsesurface locally based on the data itselfo By trying all possible data cuts as local boundarieso By fitting first-order adaptive splines locallyo By exploiting regression trees and their ensemblesSalford Systems ©2013 37
  38. 38. From Linear to Non-linear• Classical regression and regularized regression buildglobally linear models• Further accuracy can be achieved by building locallylinear models connected to each other at boundarypoints called knots• Function is known as a spline• Each separate region of data represented by a “basisfunction” (BF)-1001020304050600 10 20 30 40LSTATMV01020304050600 10 20 30 40LSTATMVLocalizeKnotsSalford Systems ©2013 38
  39. 39. Finding Knots Automatically• Stage-wise knot placement process on a flat-top function0204060800 30 60 90XY0204060800 30 60 90XYTrue Knots Knot 1 Knot 2 Knot 3Knot 4 Knot 5 Knot 6Salford Systems ©2013 39DataTrue Function
  40. 40. MARS Algorithm• Multivariate Adaptive Regression Splines• Introduced by Jerome Friedman in 1991o (Annals of Statistics 19 (1): 1-67) (earlier discussion papers from 1988)• Forward stage:o Add pairs of BFs (direct and mirror pair of basis functions represents a singleknot) in a step-wise regression mannero The process stops once a user specified upper limit is reached• Backward stage:o Remove BFs one at a time in a step-wise regression mannero This creates a sequence of candidate models of declining complexity• Selection stage:o Select optimal model based on the TEST performance (modern approach)o Select optimal model based on GCV criterion (legacy approach)Salford Systems ©2013 40
  41. 41. MARS on Boston Data: TEST MSE=14.669-BF (7-variable)SolutionSalford Systems ©2013 41
  42. 42. Non-linear Response Surface• MARS automatically determined transition points betweenvarious local regions• This model provides major insights into the nature of therelationship• Observe in this model NOX appears linearlySalford Systems ©2013 42
  43. 43. 200 Replications Learn/Test Partition• Models were repeatedwith 200 randomlyselected 20% test partitions• GPS shows marginalperformance improvementbut much smaller model• MARS shows dramaticperformance improvementRegressionGPSMARSSalford Systems ©2013 43Distribution of TEST MSE across runs
  44. 44. Combining MARS and GPS• Use MARS as a search engine to break predictorsinto ranges reflecting differences in relationshipbetween target and predictors• MARS also handles missing values with missing valueindicators and interactions for conditional use of apredictor (only when not missing)• Allow the MARS model to be large• GPS can then select basis functions and shrinkcoefficients• We will see that this combination of the best of bothworlds will also apply to ensembles of decision treesSalford Systems ©2013 44
  45. 45. Running Score: Test Sample MSEMethod 20% random ParametricBootstrapBattery PartitionRegression 27.069 27.97 23.80MARS Regression Splines 14.663 15.91 14.12GPS Lasso/ Regularized 21.361 21.11 23.15Salford Systems © Copyright 2005-201345
  46. 46. Regression TreeOut of the box results, no tuning of controls9 regions (terminalnodes)Test MSE= 17.296Salford Systems © Copyright 2005-201346
  47. 47. Regression Tree Representation of a SurfaceHigh Dimensional Step functionShould be at a disadvantage relative to other tools. Can never be smooth.But always worth checking
  48. 48. Regression Tree Partial Dependency PlotLSTAT NOXUse model to simulate impact of a change in predictorHere we simulate separately for every training data record and then averageFor CART trees is essentially a step functionMay only get one “knot” in graph if variable appears only once in treeSee appendix to learn how to get these plots
  49. 49. Running ScoreMethod 20% random ParametricBootstrapRepeated 10020% PartitionsRegression 27.069 27.97 23.80MARS 14.663 15.91 14.12GPS Lasso 21.361 21.11 23.15CART 17.296 17.26 20.66Salford Systems © Copyright 2005-201349
  50. 50. Bagger Mechanism• Generate a reasonable number of bootstrap sampleso Breiman started with numbers like 50, 100, 200• Grow a standard CART tree on each sample• Use the unpruned tree to make predictionso Pruned trees yield inferior predictive accuracy for the ensemble• Simple voting for classificationo Majority rule voting for binary classificationo Plurality rule voting for multi-class classificationo Average predicted target for regression models• Will result in a much smoother range of predictionso Single tree gives same prediction for all records in a terminal nodeo In bagger records will have different patterns of terminal node results• Each record likely to have a unique score from ensembleSalford Systems © Copyright 2005-201350
  51. 51. Bagger Partial Dependency PlotLSTAT NOXAveraging over many trees allows for a more complex dependencyOpportunity for many splits of a variable (100 large trees)Jaggedness may reflect existence of interactionsSalford Systems © Copyright 2005-201351
  52. 52. Running ScoreMethod 20% random ParametricBootstrapBattery PartitionRegression 27.069 27.97 23.80MARS 14.663 15.91 14.12GPS Lasso 21.361 21.11 23.15CART 17.296 17.26 20.66Bagged CART 9.545 12,79Salford Systems © Copyright 2005-201352
  53. 53. RandomForests: Bagger on Steroids• Leo Breiman was frustrated by the fact that the bagger didnot perform better. Convinced there was a better way• Observed that trees generated bagging across differentbootstrap samples were surprisingly similar• How to make them more different?• Bagger induces randomness in how the rows of the data areused for model construction• Why not also introduce randomness in how the columns areused for model construction• Pick a random subset of predictors as candidate predictors –a new random subset for every node• Breiman was inspired by earlier research that experimentedwith variations on these ideas• Breiman perfected the bagger to make RandomForestsSalford Systems © Copyright 2005-201353
  54. 54. Running ScoreMethod 20% random ParametricBootstrapBattery PartitionRegression 27.069 27.97 23.80MARS 14.663 15.91 14.12GPS Lasso 21.361 21.11 23.15CART 17.296 17.26 20.66Bagged CART 9.545 12,79RF Defaults 8.286 12.84Salford Systems © Copyright 2005-201354
  55. 55. Stochastic Gradient Boosting (TreeNet )• SGB is a revolutionary data mining methodology firstintroduced by Jerome H. Friedman in 1999• Seminal paper defining SGB released in 2001o Google scholar reports more than 1600 references to this paper and a further3300 references to a companion paper• Extended further by Friedman in major papers in 2004and 2008 (Model compression and rule extraction)• Ongoing development and refinement by SalfordSystemso Latest version released 2013 as part of SPM 7.0• TreeNet/Gradient boosting has emerged as one of themost used learning machines and has been successfullyapplied across many industries• Friedman‟s proprietary code in TreeNetSalford Systems © Copyright 2005-201355
  56. 56. Trees incrementally revise predictionsFirst tree grown onoriginal target.Intentionally“weak” model2nd tree grown onresiduals from first.Predictions made toimprove first tree3rd tree grown onresiduals from modelconsisting of first twotrees+ +Tree 1 Tree 2 Tree 3Every tree produces at least one positive and at least one negative node. Redreflects a relatively large positive and deep blue reflects a relatively negativenode. Total “score” for a given record is obtained by finding relevant terminal nodein every tree in model and summing across all treesSalford Systems © Copyright 2005-201356
  57. 57. Gradient Boosting Methodology: Key points• Trees are usually kept small (2-6 nodes common)o However, should experiment with larger trees (12, 20, 30 nodes)o Sometimes larger trees are surprisingly good• Updates are small (downweighted). Update factors canbe as small as .01, .001, .0001.o Do not accept the full learning of a tree (small step size, also GPS style)o Larger trees should be coupled with slower learn rates• Use random subsets of the training data in each cycle.Never train on all the training data in any one cycleo Typical is to use a random half of the learn data to grow each treeSalford Systems © Copyright 2005-201357
  58. 58. Running ScoreMethod 20% random ParametricBootstrapBattery PartitionRegression 27.069 27.97 23.80MARS 14.663 15.91 14.12GPS Lasso 21.361 21.11 23.15CART 17.296 17.26 20.66Bagged CART 9.545 12,79RF Defaults 8.286 12.84RF PREDS=6 8.002 12.05TreeNet Defaults 7.417 8.67 11.02Using cross-validation on learn partition to determine optimal number of treesand then scoring the test partition with that model: TreeNet MSE=8.523Salford Systems © Copyright 2005-201358
  59. 59. Vary HUBER Threshold: Best MSE=6.71Vary threshold where we switch from squared errors to absolute errorsOptimum when the 5% largest errors are not squared in loss computationYields best MSE on test data. Sometimes LAD yields best test sample MSE.Salford Systems © Copyright 2005-201359
  60. 60. Gradient Boosting Partial Dependency PlotsSalford Systems © Copyright 2005-201360LSTAT NOX
  61. 61. Running ScoreMethod 20% random ParametricBootstrapBattery PartitionRegression 27.069 27.97 23.80MARS 14.663 15.91 14.12GPS Lasso 21.361 21.11 23.15CART 17.296 17.26 20.66Bagged CART 9.545 12,79RF Defaults 8.286 12.84RF PREDS=6 8.002 12.05TreeNet Defaults 7.417 8.67 11.02TreeNet Huber 6.682 7.86 11.46TN Additive 9.897 10.48If we had used cross-validation to determine the optimal number of trees andthen used those to score test partition the TreeNet Default model MSE=8.523Salford Systems © Copyright 2005-201361
  62. 62. References MARS• Friedman, J. H. (1991a). Multivariate adaptive regressionsplines (with discussion). Annals of Statistics, 19, 1-141(March).• Friedman, J. H. (1991b). Estimating functions of mixedordinal and categorical variables using adaptive splines.Department of Statistics,Stanford University, Tech. ReportLCS108.• De Veaux R.D., Psichogios D.C., and Ungar L.H. (1993), AComparison of Two Nonparametric Estimation Schemes: Marsand Neutral Networks, Computers Chemical Engineering,Vol.17, No.8.Salford Systems ©2013 62
  63. 63. References Regularized Regression• Arthur E. HOERL and Robert W. KENNARD. RidgeRegression: Biased Estimation for NonorthogonalProblems TECHNOMETRICS, 1970, VOL. 12, 55-67• Friedman, Jerome. H. Fast Sparse regression andClassification.http://www-stat.stanford.edu/~jhf/ftp/GPSpaper.pdf• Friedman, J. H., and Popescu, B. E. (2003). Importancesampled learning ensembles. Stanford University,Department of Statistics. Technical Report. http://www-stat.stanford.edu/~jhf/ftp/isle.pdf• Tibshirani, R. (1996). Regression shrinkage and selectionvia the lasso. J. Royal. Statist. Soc. B. 58, 267-288.Salford Systems ©2013 63
  64. 64. References Regression via Trees• Breiman, L., J. Friedman, R. Olshen and C. Stone (1984),Classification and Regression Trees, CRC Press.• Breiman, L (1996), Bagging Predictors, Machine Learning, 24, 123-140• Breiman, L. (2001) Random Forests. Machine Learning. 45, pp5-32.• Friedman, J. H. Greedy function approximation: A gradientboosting machine http://www-stat.stanford.edu/~jhf/ftp/trebst.pdf Ann. Statist. Volume 29,Number 5 (2001), 1189-1232.• Friedman, J. H., and Popescu, B. E. (2003). Importancesampled learning ensembles. Stanford University, Departmentof Statistics. Technical Report. http://www-stat.stanford.edu/~jhf/ftp/isle.pdfSalford Systems ©2013 64
  65. 65. What’s Next• Visit our website for the full 4-hour video series• https://www.salford-systems.com/videos/tutorials/the-evolution-of-regression-modelingo 2 hours methodologyo 2 hours hands-on running of exampleso Also other tutorials on CART, TreeNet gradient boosting• Download no-cost 60-day evaluationo Just let the Unlock Department know you participated in the on-demand webinar series• Contains many capabilities not present in opensource renditionso Largely the source code of the inventor of today‟s most importantdata mining methods: Jerome H. Friedmano We started working with Friedman in 1990 when very few peoplewere interested in his workSalford Systems ©2013 65
  66. 66. Salford Predictive Modeler SPM• Download a current version from our websitehttp://www.salford-systems.com• Version will run without a license key for 10-days• For more time request a license key fromunlock@salford-systems.com• Request configuration to meet your needso Data handling capacityo Data mining engines made available© Salford Systems 2012