Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tree net and_randomforests_2009


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Tree net and_randomforests_2009

  1. 1. Introduction to Random Forests and Stochastic Gradient Boosting Dan Steinberg Mykhaylo Golovnya [email_address] August, 2009
  2. 2. Initial Ideas on Combining Trees <ul><li>Idea that combining good methods could yield promising results was suggested by researchers more than a decade ago </li></ul><ul><ul><li>In tree-structured analysis, suggestion stems from: </li></ul></ul><ul><ul><ul><li>Wray Buntine (1991) </li></ul></ul></ul><ul><ul><ul><li>Kwok and Carter (1990) </li></ul></ul></ul><ul><ul><ul><li>Heath, Kasif and Salzberg (1993) </li></ul></ul></ul><ul><li>Notion is that if the trees can somehow get at different aspects of the data, the combination will be “better” </li></ul><ul><ul><li>Better in this context means more accurate in classification and prediction for future cases </li></ul></ul><ul><li>The original implementation of CART already included bagging ( B ootstrap A ggregation) and ARCing ( A daptive R esampling and C ombining) approaches to build tree ensembles </li></ul>
  3. 3. Past Decade Development <ul><li>The original bagging and boosting approaches relied on sampling with replacement techniques to obtain a new modeling dataset </li></ul><ul><li>Subsequent approaches focused on refining the sampling machinery or changing the modeling emphasis from the original dependent variable to current model generalized residuals </li></ul><ul><li>Most important variants (and dates of published articles) are: </li></ul><ul><ul><li>Bagging (Breiman, 1996, “ B ootstrap Ag gregation”) </li></ul></ul><ul><ul><li>Boosting (Freund and Schapire, 1995) </li></ul></ul><ul><ul><li>M ultiple A dditive R egression T rees (Friedman, 1999, aka MART™ or TreeNet™) </li></ul></ul><ul><ul><li>RandomForests™ (Breiman, 2001) </li></ul></ul><ul><li>Work continues with major refinements underway (Friedman in collaboration with Salford Systems) </li></ul>
  4. 4. <ul><li>Simplest example: </li></ul><ul><ul><li>Grow a tree on training data </li></ul></ul><ul><ul><li>Find a way to grow another tree, different from currently available (change something in set up) </li></ul></ul><ul><ul><li>Repeat many times, say 500 replications </li></ul></ul><ul><ul><li>Average results or create voting scheme; for example, relate PD to fraction of trees predicting default for a given </li></ul></ul>Multi Tree Methods <ul><li>Beauty of the method is that every new tree starts with a complete set of data </li></ul><ul><li>Any one tree can run out of data, but when that happens we just start again with a new tree and all the data (before sampling) </li></ul>Prediction Via Voting
  5. 5. Random Forest <ul><li>A random forest is a collection of single trees grown in a special way </li></ul><ul><li>The overall prediction is determined by voting (in classification) or averaging (in regression) </li></ul><ul><li>Accuracy is achieved by using a large number of trees </li></ul><ul><ul><li>The Law of Large Numbers ensures convergence </li></ul></ul><ul><ul><li>The key to accuracy is low correlation and bias </li></ul></ul><ul><ul><li>To keep bias and correlation low, trees are grown to maximum depth </li></ul></ul><ul><ul><li>Using more trees does not lead to overfitting, because each tree is grown independently </li></ul></ul><ul><li>Correlation is kept low through explicitly introduced randomness </li></ul><ul><li>RandomForests™ often works well when other methods work poorly </li></ul><ul><ul><li>The reasons for this are poorly understood </li></ul></ul><ul><ul><li>Sometimes other methods work well and RandomForests™ doesn’t </li></ul></ul>
  6. 6. Randomness is introduced in order to keep correlation low <ul><li>Randomness is introduced in two distinct ways </li></ul><ul><li>Each tree is grown on a bootstrap sample from the learning set </li></ul><ul><ul><li>Default bootstrap sample size equals original sample size </li></ul></ul><ul><ul><li>Smaller bootstrap sample sizes are sometimes useful </li></ul></ul><ul><li>A number R is specified (square root by default) such that it is noticeably smaller than the total number of available predictors </li></ul><ul><li>During tree growing phase, at each node only R predictors are randomly selected and tried </li></ul><ul><li>Randomness also reduces the signal to noise ratio in a single tree </li></ul><ul><ul><li>A low correlation between trees is more important than a high signal when many trees contribute to forming the model </li></ul></ul><ul><ul><li>RandomForests™ trees often have very low signal strength, even when the signal strength of the forest is high </li></ul></ul>
  7. 7. Important to Keep Correlation Low <ul><li>Averaging many base learners improves the signal to noise ratio dramatically provided that the correlation of errors is kept low </li></ul><ul><li>Hundreds of base learners are needed for the most noticeable effect </li></ul>
  8. 8. Randomness in Split Selection <ul><li>Topic discussed by several Machine Learning researchers </li></ul><ul><li>Possibilities: </li></ul><ul><ul><li>Select splitter, split point, or both at random </li></ul></ul><ul><ul><li>Choose splitter at random from the top K splitters </li></ul></ul><ul><li>Random Forests: Suppose we have M available predictors </li></ul><ul><ul><li>Select R eligible splitters at random and let best split node </li></ul></ul><ul><ul><li>If R =1 this is just random splitter selection </li></ul></ul><ul><ul><li>If R=M this becomes Brieman’s bagger </li></ul></ul><ul><ul><li>If R << M then we get Breiman’s Random Forests </li></ul></ul><ul><ul><ul><li>Breiman suggests R=sqrt( M ) as a good rule of thumb </li></ul></ul></ul>
  9. 9. Performance as a Function of R <ul><li>In this experiment, we ran RF with 100 trees on sample data (772x111) using different values for the number of variables R (N Vars) searched at each split </li></ul><ul><li>Combining trees always improves performance, with the optimal number of sampled predictors already establishing around 11 </li></ul>
  10. 10. Usage Notes <ul><li>RF does not require an explicit test sample </li></ul><ul><li>Capable of capturing high-order interactions </li></ul><ul><li>Both running speed and resources consumed for the most part depends on the row dimension of the data </li></ul><ul><ul><li>Trees are grown using in as simple as feasible way to keep run times low (no surrogates, no priors, etc.) </li></ul></ul><ul><li>Classification models produce pseudo-probability scores (percent of votes) </li></ul><ul><li>Performance-wise is capable of matching the performance of modern boosting techniques, including MART (described later) </li></ul><ul><li>Naturally allows parallel processing </li></ul><ul><li>The final model code is usually bulky, voluminous, and impossible to interpret directly </li></ul><ul><li>Current stable implementations include multinomial classification and least squares regression with an on-going research in the more advanced fields of predictive modeling (survival, choice, etc.) </li></ul>
  11. 11. Proximity Matrix – Raw Material for Further Advances <ul><li>RF introduces a novel way to define proximity between two observations: </li></ul><ul><ul><li>For a dataset of size N define an N x N matrix of proximities </li></ul></ul><ul><ul><li>Initialize all proximities to zeroes </li></ul></ul><ul><ul><li>For any given tree, apply the tree to the dataset </li></ul></ul><ul><ul><li>If case i and case j both end up in the same node, increase proximity Prox i j between i and j by one </li></ul></ul><ul><ul><li>Accumulate over all trees in RF and normalize by twice the number of trees in RF </li></ul></ul><ul><li>The resulting matrix provides intrinsic measure of proximity </li></ul><ul><ul><li>Observations that are “alike” will have proximities close to one </li></ul></ul><ul><ul><li>The closer the proximity to 0, the more dissimilar cases i and j are </li></ul></ul><ul><ul><li>The measure is invariant to monotone transformations </li></ul></ul><ul><ul><li>The measure is clearly defined for any type of independent variables, including categorical </li></ul></ul>
  12. 12. <ul><li>Based on proximities one can: </li></ul><ul><ul><li>Proceed with a well-defined clustering solution </li></ul></ul><ul><ul><ul><li>Note: the solution is guided by the target variable used in the RF model </li></ul></ul></ul><ul><ul><li>Detect outliers </li></ul></ul><ul><ul><ul><li>By computing average proximity between the current observation and all the remaining observations sharing the same class </li></ul></ul></ul><ul><ul><li>Generate informative data views/projections using scaling coordinates </li></ul></ul><ul><ul><ul><li>Non-metric multidimensional scaling produces most satisfactory results here </li></ul></ul></ul><ul><ul><li>Do missing value imputation using current proximities as weights in the nearest neighbor imputation techniques </li></ul></ul><ul><li>Ongoing work on possible expansion of the above to the unsupervised learning area of data mining </li></ul>Post Processing and Interpretation
  13. 13. Introduction to Stochastic Gradient Boosting <ul><li>TreeNet (TN) is a new approach to machine learning and function approximation developed by Jerome H. Friedman at Stanford University </li></ul><ul><ul><li>Co-author of CART® with Breiman, Olshen and Stone </li></ul></ul><ul><ul><li>Author of MARS®, PRIM, Projection Pursuit, COSA, RuleFit™ and more </li></ul></ul><ul><li>Also known as Stochastic Gradient Boosting and MART ( Multiple Additive Regression Trees ) </li></ul><ul><li>Naturally supports the following classes of predictive models </li></ul><ul><ul><li>Regression (continuous target, LS and LAD loss functions) </li></ul></ul><ul><ul><li>Binary classification (binary target, logistic likelihood loss function) </li></ul></ul><ul><ul><li>Multinomial classification (multiclass target, multinomial likelihood loss function) </li></ul></ul><ul><ul><li>Poisson regression (counting target, Poisson likelihood loss function) </li></ul></ul><ul><ul><li>Exponential survival (positive target with censoring) </li></ul></ul><ul><ul><li>Proportional hazard Cox survival model </li></ul></ul><ul><li>TN builds on the notions of committees of experts and boosting but is substantially different in key implementation details </li></ul>
  14. 14. Predictive Modeling <ul><li>We are interested in studying the conditional distribution of the dependent variable Y given X in the predictor space </li></ul><ul><li>We assume that some quantity f can be used to fully or partially describe such distribution </li></ul><ul><ul><li>In regression problems f is usually the mean or the median </li></ul></ul><ul><ul><li>In binary classification problems f is the log-odds of Y =1 </li></ul></ul><ul><ul><li>In Cox survival problems f is the scaling factor in the unknown hazard function </li></ul></ul><ul><li>Thus we want to construct a “nice” function f ( X ) which in turn can be used to study the behavior of y at the given point in the predictor space </li></ul><ul><ul><li>Function f ( X ) is sometimes referred to as “ response surface ” </li></ul></ul><ul><li>We need to define how “nice” can be measured </li></ul>Model X f
  15. 15. Loss Functions <ul><li>In predictive modeling the problem is usually attacked by introducing a well chosen loss function L ( Y , X , f ( X )) </li></ul><ul><ul><li>In stochastic gradient boosting we need a loss function for which gradients can easily be computed and used to construct good base learners </li></ul></ul><ul><ul><li>The loss function used on the test data does not need the same properties </li></ul></ul><ul><li>Practical ways of constructing loss functions </li></ul><ul><ul><li>Direct interpretation of f ( X i ) as an estimate Y i or a population statistic of the distribution of Y conditional on X </li></ul></ul><ul><ul><ul><li>Least Squares Loss (LS), f i is an estimate of E( Y| X i ) </li></ul></ul></ul><ul><ul><ul><li>Least Absolute Deviation Loss (LAD), f i is an estimate of median( Y | X i ) </li></ul></ul></ul><ul><ul><ul><li>Huber-M Loss, f i is an estimate of Y i </li></ul></ul></ul><ul><ul><li>Choosing a conditional distribution for Y | X, defining f ( X ) as a parameter of that distribution and using the negative log-likelihood as the loss function </li></ul></ul><ul><ul><ul><li>Logistic Loss (conditional Bernoulli, f ( X ) is the half log-odds of Y =1) </li></ul></ul></ul><ul><ul><ul><li>Poisson Loss (conditional Poisson, f ( X ) is the log(  )) </li></ul></ul></ul><ul><ul><ul><li>Exponential Loss (conditional Exponential, f ( X ) is the log(  )) </li></ul></ul></ul><ul><ul><li>More general likelihood functions, for example, multinomial discrete choice, the Cox model </li></ul></ul>
  16. 16. Regression and Classification Losses <ul><li>Huber-M regression loss is a reasonable compromise between the classical LS loss and robust LAD loss </li></ul><ul><li>Logistic log-likelihood based loss strikes the middle ground between the extremely sensitive exponential loss on one side and conventional LS and LAD losses on the other side </li></ul>
  17. 17. Practical Estimate <ul><li>In reality, we have a set of N observed pairs ( X i , y i ) from the population, not the entire population </li></ul><ul><li>Hence, we use sample-based estimates of L ( Y , X , f ( X )) </li></ul><ul><li>To avoid biased estimates, one usually partitions the data into independent learn and test samples using the latter to compute an unbiased estimate of the population loss </li></ul><ul><li>In Stochastic gradient boosting the problem is attacked by acting like we are trying to minimize the loss function on the learn sample. But doing so in a slow constrained way </li></ul><ul><li>This results in a series of models that move closer and closer to the f(X) function that minimizes the loss on the learn sample. Eventually new models become overfit to the learn sample </li></ul><ul><li>From this sequence the function f ( X ) with the lowest loss on the test sample is chosen </li></ul><ul><ul><li>By choosing from a fixed set of models overfitting to the test data is avoided </li></ul></ul><ul><ul><li>Sometimes the loss functions used on the test data and learn data differ </li></ul></ul>
  18. 18. Parametric Approach <ul><li>The function f ( X ) is introduced as a known function of a fixed set of unknown parameters </li></ul><ul><li>The problem then reduces to finding a set of optimal parameter estimates using classical optimization techniques </li></ul><ul><li>In linear regression and logistic regression: f ( X ) is a linear combination of fixed predictors; the parameters are the intercept and the slope coefficients </li></ul><ul><li>Major problem: the function and predictors need to be specified beforehand – this can result in a lengthy specification search process by trial and error </li></ul><ul><ul><li>If this trial-and error-process uses the same data as the final model, that model will be overfit. This is the classical overfitting problem </li></ul></ul><ul><ul><li>If new data are used to estimate the final model and the model performs poorly, the specification search process must be repeated </li></ul></ul><ul><li>This approach shows most benefits on small datasets where only simple specifications can be justified, or on datasets where there is strong a priori knowledge of the correct specification </li></ul>
  19. 19. Non-parametric Approach <ul><li>Construct f ( X ) using data driven incremental approach </li></ul><ul><li>Start with a constant, then at each stage adjust the values of f ( X ) by small increments in various regions of data </li></ul><ul><li>It is important to keep the adjustment rate low – the resulting model will become smoother and be less subject to overfitting </li></ul><ul><li>Treating f i = f ( X i ) at all individual observed data points as separate parameters, the negative of the gradient points in the direction of change in f ( X ) that results in the steepest reduction of the loss </li></ul><ul><li> G = { g i = - d R / d f i ; i =1,…, N } . </li></ul><ul><ul><li>The components of the negative gradient will be called generalized residuals </li></ul></ul><ul><li>We want to limit the number of currently allowed separate adjustments to a small number M – a natural way to proceed then is to find an orthogonal partition of the X -space into M mutually exclusive regions such that the variance of the residuals within each region is minimized </li></ul><ul><ul><li>This job is accomplished by building a fixed size M-node regression tree using the generalized residuals as the current target variable </li></ul></ul>
  20. 20. TreeNet Process <ul><li>Begin with the sample mean (e.g., logit - for all observations set p=sample share) </li></ul><ul><li>Add one very small tree as initial model based on gradients </li></ul><ul><ul><li>For regression and logit, residuals are gradients </li></ul></ul><ul><ul><li>Could be as small as ONE split generating 2 terminal nodes </li></ul></ul><ul><ul><li>Typical model will have 3-5 splits in a tree, generating 4-6 terminal nodes </li></ul></ul><ul><ul><li>Output is a continuous response surface (e.g. log-odds for binary classification) </li></ul></ul><ul><ul><li>Model is intentionally “weak” </li></ul></ul><ul><ul><li>Multiply contribution by a learning factor  before adding it to model </li></ul></ul><ul><ul><li>Model is now: mean +    Tree 1 </li></ul></ul><ul><li>Compute new gradients (residuals) </li></ul><ul><ul><li>The actual definition of the residual is driven by the type of the loss function </li></ul></ul><ul><li>Grow second small tree to predict the residuals from the first tree </li></ul><ul><li>New model is now: mean +    Tree 1 +    Tree 2 </li></ul><ul><li>Repeat iteratively while checking performance on an independent test sample </li></ul>
  21. 21. Benefits of TreeNet <ul><li>Built on CART trees and thus </li></ul><ul><ul><li>immune to outliers </li></ul></ul><ul><ul><li>selects variables, </li></ul></ul><ul><ul><li>results invariant with monotone transformations of variables </li></ul></ul><ul><ul><li>handles missing values automatically </li></ul></ul><ul><li>Resistant to mislabeled target data </li></ul><ul><ul><li>In medicine cases are commonly misdiagnosed </li></ul></ul><ul><ul><li>In business, occasionally non-responders flagged as “responders” </li></ul></ul><ul><li>Resistant to over training – generalizes very well </li></ul><ul><li>Can be remarkably accurate with little effort </li></ul><ul><li>Trains very rapidly; comparable to CART </li></ul>
  22. 22. <ul><li>2009 KDD Cup 2 nd place “Fast Scoring on Large Database” </li></ul><ul><li>2007 PAKDD competition: home loans up-sell to credit card owners 2 nd place </li></ul><ul><ul><li>Model built in half a day using previous year submission as a blueprint </li></ul></ul><ul><li>2006 PAKDD competition: customer type discrimination 3 rd place </li></ul><ul><ul><li>Model built in one day. 1 st place accuracy 81.9% TreeNet Accuracy 81.2% </li></ul></ul><ul><li>2005 BI-CUP Sponsored by University of Chile attracted 60 competitors </li></ul><ul><li>2004 KDD Cup “Most Accurate” </li></ul><ul><li>2003 “Duke University/NCR Teradata CRM modeling competition </li></ul><ul><ul><li>Most Accurate” and “Best Top Decile Lift” on both in and out of time samples </li></ul></ul><ul><li>A major financial services company has tested TreeNet across a broad range of targeted marketing and risk models for the past 2 years </li></ul><ul><ul><li>TreeNet consistently outperforms previous best models (around 10% AUROC) </li></ul></ul><ul><ul><li>TreeNet models can be built in a fraction of the time previously devoted </li></ul></ul><ul><ul><li>TreeNet reveals previously undetected predictive power in data </li></ul></ul>TN Successes
  23. 23. <ul><li>Trees are kept small (2-6 nodes common) </li></ul><ul><li>Updates are small – can be as small as .01, .001, .0001 </li></ul><ul><li>Use random subsets of the training data in each cycle </li></ul><ul><ul><li>Never train on all the training data in any one cycle </li></ul></ul><ul><li>Highly problematic cases are IGNORED </li></ul><ul><ul><li>If model prediction starts to diverge substantially from observed data, that data will not be used in further updates </li></ul></ul><ul><li>TN allows very flexible control over interactions: </li></ul><ul><ul><li>Strictly Additive Models (no interactions allowed) </li></ul></ul><ul><ul><li>Low level interactions allowed </li></ul></ul><ul><ul><li>High level interactions allowed </li></ul></ul><ul><ul><li>Constraints: only specific interactions allowed (TN PRO) </li></ul></ul>Key Controls
  24. 24. <ul><li>As TN models consist of hundreds or even thousands of trees there is no useful way to represent the model via a display of one or two trees </li></ul><ul><li>However, the model can be summarized in a variety of ways </li></ul><ul><ul><li>Partial Dependency Plots : These exhibit the relationship between the target and any predictor – as captured by the model </li></ul></ul><ul><ul><li>Variable Importance Rankings : These stable rankings give an excellent assessment of the relative importance of predictors </li></ul></ul><ul><ul><li>ROC and Gains Curves : TN models produce scores that are typically unique fore ach scored record </li></ul></ul><ul><ul><li>Confusion Matrix : Using an adjustable score threshold this matrix displays the model false positive and false negative rates </li></ul></ul><ul><li>TreeNet models based on 2-node trees by definition EXCLUDE interactions </li></ul><ul><ul><li>Model may be highly nonlinear but is by definition strictly additive </li></ul></ul><ul><ul><li>Every term in the model is based on a single variable (single split) </li></ul></ul><ul><li>Build TreeNet on a larger tree (default is 6 nodes) </li></ul><ul><ul><li>Permits up to 5-way interaction but in practice is more like 3-way interaction </li></ul></ul><ul><li>Can conduct informal likelihood ratio test TN(2-node) versus TN(6-node) </li></ul><ul><li>Large differences signal important interactions </li></ul>Interpreting TN Models
  25. 25. Example: Boston Housing <ul><li>The results of running TN on the Boston Housing dataset are shown </li></ul><ul><li>All of the key insights agree with similar findings by MARS and CART </li></ul>Variable Score LSTAT 100.00 |||||||||||||||||||||||||||||||||||||||||| RM 83.71 ||||||||||||||||||||||||||||||||||| DIS 45.45 ||||||||||||||||||| CRIM 31.91 ||||||||||||| NOX 30.69 |||||||||||| AGE 28.62 ||||||||||| PT 22.81 ||||||||| TAX 19.74 ||||||| INDUS 12.19 |||| CHAS 11.93 ||||
  26. 26. References <ul><li>Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: Wadsworth </li></ul><ul><li>Breiman, L. (1996). Bagging predictors. Machine Learning , 24, 123-140. </li></ul><ul><li>Hastie, T., Tibshirani, R., and Friedman, J.H (2000). The Elements of Statistical Learning. Springer. </li></ul><ul><li>Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth National Conference , Morgan Kaufmann, pp. 148-156. </li></ul><ul><li>Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: Statistics Department, Stanford University. </li></ul><ul><li>Friedman, J.H. (1999). Greedy function approximation: a gradient boosting machine. Stanford: Statistics Department, Stanford University. </li></ul>