Tree net and_randomforests_2009


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • January 9, 2012
  • January 9, 2012
  • Tree net and_randomforests_2009

    1. 1. Introduction to Random Forests and Stochastic Gradient Boosting Dan Steinberg Mykhaylo Golovnya [email_address] August, 2009
    2. 2. Initial Ideas on Combining Trees <ul><li>Idea that combining good methods could yield promising results was suggested by researchers more than a decade ago </li></ul><ul><ul><li>In tree-structured analysis, suggestion stems from: </li></ul></ul><ul><ul><ul><li>Wray Buntine (1991) </li></ul></ul></ul><ul><ul><ul><li>Kwok and Carter (1990) </li></ul></ul></ul><ul><ul><ul><li>Heath, Kasif and Salzberg (1993) </li></ul></ul></ul><ul><li>Notion is that if the trees can somehow get at different aspects of the data, the combination will be “better” </li></ul><ul><ul><li>Better in this context means more accurate in classification and prediction for future cases </li></ul></ul><ul><li>The original implementation of CART already included bagging ( B ootstrap A ggregation) and ARCing ( A daptive R esampling and C ombining) approaches to build tree ensembles </li></ul>
    3. 3. Past Decade Development <ul><li>The original bagging and boosting approaches relied on sampling with replacement techniques to obtain a new modeling dataset </li></ul><ul><li>Subsequent approaches focused on refining the sampling machinery or changing the modeling emphasis from the original dependent variable to current model generalized residuals </li></ul><ul><li>Most important variants (and dates of published articles) are: </li></ul><ul><ul><li>Bagging (Breiman, 1996, “ B ootstrap Ag gregation”) </li></ul></ul><ul><ul><li>Boosting (Freund and Schapire, 1995) </li></ul></ul><ul><ul><li>M ultiple A dditive R egression T rees (Friedman, 1999, aka MART™ or TreeNet™) </li></ul></ul><ul><ul><li>RandomForests™ (Breiman, 2001) </li></ul></ul><ul><li>Work continues with major refinements underway (Friedman in collaboration with Salford Systems) </li></ul>
    4. 4. <ul><li>Simplest example: </li></ul><ul><ul><li>Grow a tree on training data </li></ul></ul><ul><ul><li>Find a way to grow another tree, different from currently available (change something in set up) </li></ul></ul><ul><ul><li>Repeat many times, say 500 replications </li></ul></ul><ul><ul><li>Average results or create voting scheme; for example, relate PD to fraction of trees predicting default for a given </li></ul></ul>Multi Tree Methods <ul><li>Beauty of the method is that every new tree starts with a complete set of data </li></ul><ul><li>Any one tree can run out of data, but when that happens we just start again with a new tree and all the data (before sampling) </li></ul>Prediction Via Voting
    5. 5. Random Forest <ul><li>A random forest is a collection of single trees grown in a special way </li></ul><ul><li>The overall prediction is determined by voting (in classification) or averaging (in regression) </li></ul><ul><li>Accuracy is achieved by using a large number of trees </li></ul><ul><ul><li>The Law of Large Numbers ensures convergence </li></ul></ul><ul><ul><li>The key to accuracy is low correlation and bias </li></ul></ul><ul><ul><li>To keep bias and correlation low, trees are grown to maximum depth </li></ul></ul><ul><ul><li>Using more trees does not lead to overfitting, because each tree is grown independently </li></ul></ul><ul><li>Correlation is kept low through explicitly introduced randomness </li></ul><ul><li>RandomForests™ often works well when other methods work poorly </li></ul><ul><ul><li>The reasons for this are poorly understood </li></ul></ul><ul><ul><li>Sometimes other methods work well and RandomForests™ doesn’t </li></ul></ul>
    6. 6. Randomness is introduced in order to keep correlation low <ul><li>Randomness is introduced in two distinct ways </li></ul><ul><li>Each tree is grown on a bootstrap sample from the learning set </li></ul><ul><ul><li>Default bootstrap sample size equals original sample size </li></ul></ul><ul><ul><li>Smaller bootstrap sample sizes are sometimes useful </li></ul></ul><ul><li>A number R is specified (square root by default) such that it is noticeably smaller than the total number of available predictors </li></ul><ul><li>During tree growing phase, at each node only R predictors are randomly selected and tried </li></ul><ul><li>Randomness also reduces the signal to noise ratio in a single tree </li></ul><ul><ul><li>A low correlation between trees is more important than a high signal when many trees contribute to forming the model </li></ul></ul><ul><ul><li>RandomForests™ trees often have very low signal strength, even when the signal strength of the forest is high </li></ul></ul>
    7. 7. Important to Keep Correlation Low <ul><li>Averaging many base learners improves the signal to noise ratio dramatically provided that the correlation of errors is kept low </li></ul><ul><li>Hundreds of base learners are needed for the most noticeable effect </li></ul>
    8. 8. Randomness in Split Selection <ul><li>Topic discussed by several Machine Learning researchers </li></ul><ul><li>Possibilities: </li></ul><ul><ul><li>Select splitter, split point, or both at random </li></ul></ul><ul><ul><li>Choose splitter at random from the top K splitters </li></ul></ul><ul><li>Random Forests: Suppose we have M available predictors </li></ul><ul><ul><li>Select R eligible splitters at random and let best split node </li></ul></ul><ul><ul><li>If R =1 this is just random splitter selection </li></ul></ul><ul><ul><li>If R=M this becomes Brieman’s bagger </li></ul></ul><ul><ul><li>If R << M then we get Breiman’s Random Forests </li></ul></ul><ul><ul><ul><li>Breiman suggests R=sqrt( M ) as a good rule of thumb </li></ul></ul></ul>
    9. 9. Performance as a Function of R <ul><li>In this experiment, we ran RF with 100 trees on sample data (772x111) using different values for the number of variables R (N Vars) searched at each split </li></ul><ul><li>Combining trees always improves performance, with the optimal number of sampled predictors already establishing around 11 </li></ul>
    10. 10. Usage Notes <ul><li>RF does not require an explicit test sample </li></ul><ul><li>Capable of capturing high-order interactions </li></ul><ul><li>Both running speed and resources consumed for the most part depends on the row dimension of the data </li></ul><ul><ul><li>Trees are grown using in as simple as feasible way to keep run times low (no surrogates, no priors, etc.) </li></ul></ul><ul><li>Classification models produce pseudo-probability scores (percent of votes) </li></ul><ul><li>Performance-wise is capable of matching the performance of modern boosting techniques, including MART (described later) </li></ul><ul><li>Naturally allows parallel processing </li></ul><ul><li>The final model code is usually bulky, voluminous, and impossible to interpret directly </li></ul><ul><li>Current stable implementations include multinomial classification and least squares regression with an on-going research in the more advanced fields of predictive modeling (survival, choice, etc.) </li></ul>
    11. 11. Proximity Matrix – Raw Material for Further Advances <ul><li>RF introduces a novel way to define proximity between two observations: </li></ul><ul><ul><li>For a dataset of size N define an N x N matrix of proximities </li></ul></ul><ul><ul><li>Initialize all proximities to zeroes </li></ul></ul><ul><ul><li>For any given tree, apply the tree to the dataset </li></ul></ul><ul><ul><li>If case i and case j both end up in the same node, increase proximity Prox i j between i and j by one </li></ul></ul><ul><ul><li>Accumulate over all trees in RF and normalize by twice the number of trees in RF </li></ul></ul><ul><li>The resulting matrix provides intrinsic measure of proximity </li></ul><ul><ul><li>Observations that are “alike” will have proximities close to one </li></ul></ul><ul><ul><li>The closer the proximity to 0, the more dissimilar cases i and j are </li></ul></ul><ul><ul><li>The measure is invariant to monotone transformations </li></ul></ul><ul><ul><li>The measure is clearly defined for any type of independent variables, including categorical </li></ul></ul>
    12. 12. <ul><li>Based on proximities one can: </li></ul><ul><ul><li>Proceed with a well-defined clustering solution </li></ul></ul><ul><ul><ul><li>Note: the solution is guided by the target variable used in the RF model </li></ul></ul></ul><ul><ul><li>Detect outliers </li></ul></ul><ul><ul><ul><li>By computing average proximity between the current observation and all the remaining observations sharing the same class </li></ul></ul></ul><ul><ul><li>Generate informative data views/projections using scaling coordinates </li></ul></ul><ul><ul><ul><li>Non-metric multidimensional scaling produces most satisfactory results here </li></ul></ul></ul><ul><ul><li>Do missing value imputation using current proximities as weights in the nearest neighbor imputation techniques </li></ul></ul><ul><li>Ongoing work on possible expansion of the above to the unsupervised learning area of data mining </li></ul>Post Processing and Interpretation
    13. 13. Introduction to Stochastic Gradient Boosting <ul><li>TreeNet (TN) is a new approach to machine learning and function approximation developed by Jerome H. Friedman at Stanford University </li></ul><ul><ul><li>Co-author of CART® with Breiman, Olshen and Stone </li></ul></ul><ul><ul><li>Author of MARS®, PRIM, Projection Pursuit, COSA, RuleFit™ and more </li></ul></ul><ul><li>Also known as Stochastic Gradient Boosting and MART ( Multiple Additive Regression Trees ) </li></ul><ul><li>Naturally supports the following classes of predictive models </li></ul><ul><ul><li>Regression (continuous target, LS and LAD loss functions) </li></ul></ul><ul><ul><li>Binary classification (binary target, logistic likelihood loss function) </li></ul></ul><ul><ul><li>Multinomial classification (multiclass target, multinomial likelihood loss function) </li></ul></ul><ul><ul><li>Poisson regression (counting target, Poisson likelihood loss function) </li></ul></ul><ul><ul><li>Exponential survival (positive target with censoring) </li></ul></ul><ul><ul><li>Proportional hazard Cox survival model </li></ul></ul><ul><li>TN builds on the notions of committees of experts and boosting but is substantially different in key implementation details </li></ul>
    14. 14. Predictive Modeling <ul><li>We are interested in studying the conditional distribution of the dependent variable Y given X in the predictor space </li></ul><ul><li>We assume that some quantity f can be used to fully or partially describe such distribution </li></ul><ul><ul><li>In regression problems f is usually the mean or the median </li></ul></ul><ul><ul><li>In binary classification problems f is the log-odds of Y =1 </li></ul></ul><ul><ul><li>In Cox survival problems f is the scaling factor in the unknown hazard function </li></ul></ul><ul><li>Thus we want to construct a “nice” function f ( X ) which in turn can be used to study the behavior of y at the given point in the predictor space </li></ul><ul><ul><li>Function f ( X ) is sometimes referred to as “ response surface ” </li></ul></ul><ul><li>We need to define how “nice” can be measured </li></ul>Model X f
    15. 15. Loss Functions <ul><li>In predictive modeling the problem is usually attacked by introducing a well chosen loss function L ( Y , X , f ( X )) </li></ul><ul><ul><li>In stochastic gradient boosting we need a loss function for which gradients can easily be computed and used to construct good base learners </li></ul></ul><ul><ul><li>The loss function used on the test data does not need the same properties </li></ul></ul><ul><li>Practical ways of constructing loss functions </li></ul><ul><ul><li>Direct interpretation of f ( X i ) as an estimate Y i or a population statistic of the distribution of Y conditional on X </li></ul></ul><ul><ul><ul><li>Least Squares Loss (LS), f i is an estimate of E( Y| X i ) </li></ul></ul></ul><ul><ul><ul><li>Least Absolute Deviation Loss (LAD), f i is an estimate of median( Y | X i ) </li></ul></ul></ul><ul><ul><ul><li>Huber-M Loss, f i is an estimate of Y i </li></ul></ul></ul><ul><ul><li>Choosing a conditional distribution for Y | X, defining f ( X ) as a parameter of that distribution and using the negative log-likelihood as the loss function </li></ul></ul><ul><ul><ul><li>Logistic Loss (conditional Bernoulli, f ( X ) is the half log-odds of Y =1) </li></ul></ul></ul><ul><ul><ul><li>Poisson Loss (conditional Poisson, f ( X ) is the log(  )) </li></ul></ul></ul><ul><ul><ul><li>Exponential Loss (conditional Exponential, f ( X ) is the log(  )) </li></ul></ul></ul><ul><ul><li>More general likelihood functions, for example, multinomial discrete choice, the Cox model </li></ul></ul>
    16. 16. Regression and Classification Losses <ul><li>Huber-M regression loss is a reasonable compromise between the classical LS loss and robust LAD loss </li></ul><ul><li>Logistic log-likelihood based loss strikes the middle ground between the extremely sensitive exponential loss on one side and conventional LS and LAD losses on the other side </li></ul>
    17. 17. Practical Estimate <ul><li>In reality, we have a set of N observed pairs ( X i , y i ) from the population, not the entire population </li></ul><ul><li>Hence, we use sample-based estimates of L ( Y , X , f ( X )) </li></ul><ul><li>To avoid biased estimates, one usually partitions the data into independent learn and test samples using the latter to compute an unbiased estimate of the population loss </li></ul><ul><li>In Stochastic gradient boosting the problem is attacked by acting like we are trying to minimize the loss function on the learn sample. But doing so in a slow constrained way </li></ul><ul><li>This results in a series of models that move closer and closer to the f(X) function that minimizes the loss on the learn sample. Eventually new models become overfit to the learn sample </li></ul><ul><li>From this sequence the function f ( X ) with the lowest loss on the test sample is chosen </li></ul><ul><ul><li>By choosing from a fixed set of models overfitting to the test data is avoided </li></ul></ul><ul><ul><li>Sometimes the loss functions used on the test data and learn data differ </li></ul></ul>
    18. 18. Parametric Approach <ul><li>The function f ( X ) is introduced as a known function of a fixed set of unknown parameters </li></ul><ul><li>The problem then reduces to finding a set of optimal parameter estimates using classical optimization techniques </li></ul><ul><li>In linear regression and logistic regression: f ( X ) is a linear combination of fixed predictors; the parameters are the intercept and the slope coefficients </li></ul><ul><li>Major problem: the function and predictors need to be specified beforehand – this can result in a lengthy specification search process by trial and error </li></ul><ul><ul><li>If this trial-and error-process uses the same data as the final model, that model will be overfit. This is the classical overfitting problem </li></ul></ul><ul><ul><li>If new data are used to estimate the final model and the model performs poorly, the specification search process must be repeated </li></ul></ul><ul><li>This approach shows most benefits on small datasets where only simple specifications can be justified, or on datasets where there is strong a priori knowledge of the correct specification </li></ul>
    19. 19. Non-parametric Approach <ul><li>Construct f ( X ) using data driven incremental approach </li></ul><ul><li>Start with a constant, then at each stage adjust the values of f ( X ) by small increments in various regions of data </li></ul><ul><li>It is important to keep the adjustment rate low – the resulting model will become smoother and be less subject to overfitting </li></ul><ul><li>Treating f i = f ( X i ) at all individual observed data points as separate parameters, the negative of the gradient points in the direction of change in f ( X ) that results in the steepest reduction of the loss </li></ul><ul><li> G = { g i = - d R / d f i ; i =1,…, N } . </li></ul><ul><ul><li>The components of the negative gradient will be called generalized residuals </li></ul></ul><ul><li>We want to limit the number of currently allowed separate adjustments to a small number M – a natural way to proceed then is to find an orthogonal partition of the X -space into M mutually exclusive regions such that the variance of the residuals within each region is minimized </li></ul><ul><ul><li>This job is accomplished by building a fixed size M-node regression tree using the generalized residuals as the current target variable </li></ul></ul>
    20. 20. TreeNet Process <ul><li>Begin with the sample mean (e.g., logit - for all observations set p=sample share) </li></ul><ul><li>Add one very small tree as initial model based on gradients </li></ul><ul><ul><li>For regression and logit, residuals are gradients </li></ul></ul><ul><ul><li>Could be as small as ONE split generating 2 terminal nodes </li></ul></ul><ul><ul><li>Typical model will have 3-5 splits in a tree, generating 4-6 terminal nodes </li></ul></ul><ul><ul><li>Output is a continuous response surface (e.g. log-odds for binary classification) </li></ul></ul><ul><ul><li>Model is intentionally “weak” </li></ul></ul><ul><ul><li>Multiply contribution by a learning factor  before adding it to model </li></ul></ul><ul><ul><li>Model is now: mean +    Tree 1 </li></ul></ul><ul><li>Compute new gradients (residuals) </li></ul><ul><ul><li>The actual definition of the residual is driven by the type of the loss function </li></ul></ul><ul><li>Grow second small tree to predict the residuals from the first tree </li></ul><ul><li>New model is now: mean +    Tree 1 +    Tree 2 </li></ul><ul><li>Repeat iteratively while checking performance on an independent test sample </li></ul>
    21. 21. Benefits of TreeNet <ul><li>Built on CART trees and thus </li></ul><ul><ul><li>immune to outliers </li></ul></ul><ul><ul><li>selects variables, </li></ul></ul><ul><ul><li>results invariant with monotone transformations of variables </li></ul></ul><ul><ul><li>handles missing values automatically </li></ul></ul><ul><li>Resistant to mislabeled target data </li></ul><ul><ul><li>In medicine cases are commonly misdiagnosed </li></ul></ul><ul><ul><li>In business, occasionally non-responders flagged as “responders” </li></ul></ul><ul><li>Resistant to over training – generalizes very well </li></ul><ul><li>Can be remarkably accurate with little effort </li></ul><ul><li>Trains very rapidly; comparable to CART </li></ul>
    22. 22. <ul><li>2009 KDD Cup 2 nd place “Fast Scoring on Large Database” </li></ul><ul><li>2007 PAKDD competition: home loans up-sell to credit card owners 2 nd place </li></ul><ul><ul><li>Model built in half a day using previous year submission as a blueprint </li></ul></ul><ul><li>2006 PAKDD competition: customer type discrimination 3 rd place </li></ul><ul><ul><li>Model built in one day. 1 st place accuracy 81.9% TreeNet Accuracy 81.2% </li></ul></ul><ul><li>2005 BI-CUP Sponsored by University of Chile attracted 60 competitors </li></ul><ul><li>2004 KDD Cup “Most Accurate” </li></ul><ul><li>2003 “Duke University/NCR Teradata CRM modeling competition </li></ul><ul><ul><li>Most Accurate” and “Best Top Decile Lift” on both in and out of time samples </li></ul></ul><ul><li>A major financial services company has tested TreeNet across a broad range of targeted marketing and risk models for the past 2 years </li></ul><ul><ul><li>TreeNet consistently outperforms previous best models (around 10% AUROC) </li></ul></ul><ul><ul><li>TreeNet models can be built in a fraction of the time previously devoted </li></ul></ul><ul><ul><li>TreeNet reveals previously undetected predictive power in data </li></ul></ul>TN Successes
    23. 23. <ul><li>Trees are kept small (2-6 nodes common) </li></ul><ul><li>Updates are small – can be as small as .01, .001, .0001 </li></ul><ul><li>Use random subsets of the training data in each cycle </li></ul><ul><ul><li>Never train on all the training data in any one cycle </li></ul></ul><ul><li>Highly problematic cases are IGNORED </li></ul><ul><ul><li>If model prediction starts to diverge substantially from observed data, that data will not be used in further updates </li></ul></ul><ul><li>TN allows very flexible control over interactions: </li></ul><ul><ul><li>Strictly Additive Models (no interactions allowed) </li></ul></ul><ul><ul><li>Low level interactions allowed </li></ul></ul><ul><ul><li>High level interactions allowed </li></ul></ul><ul><ul><li>Constraints: only specific interactions allowed (TN PRO) </li></ul></ul>Key Controls
    24. 24. <ul><li>As TN models consist of hundreds or even thousands of trees there is no useful way to represent the model via a display of one or two trees </li></ul><ul><li>However, the model can be summarized in a variety of ways </li></ul><ul><ul><li>Partial Dependency Plots : These exhibit the relationship between the target and any predictor – as captured by the model </li></ul></ul><ul><ul><li>Variable Importance Rankings : These stable rankings give an excellent assessment of the relative importance of predictors </li></ul></ul><ul><ul><li>ROC and Gains Curves : TN models produce scores that are typically unique fore ach scored record </li></ul></ul><ul><ul><li>Confusion Matrix : Using an adjustable score threshold this matrix displays the model false positive and false negative rates </li></ul></ul><ul><li>TreeNet models based on 2-node trees by definition EXCLUDE interactions </li></ul><ul><ul><li>Model may be highly nonlinear but is by definition strictly additive </li></ul></ul><ul><ul><li>Every term in the model is based on a single variable (single split) </li></ul></ul><ul><li>Build TreeNet on a larger tree (default is 6 nodes) </li></ul><ul><ul><li>Permits up to 5-way interaction but in practice is more like 3-way interaction </li></ul></ul><ul><li>Can conduct informal likelihood ratio test TN(2-node) versus TN(6-node) </li></ul><ul><li>Large differences signal important interactions </li></ul>Interpreting TN Models
    25. 25. Example: Boston Housing <ul><li>The results of running TN on the Boston Housing dataset are shown </li></ul><ul><li>All of the key insights agree with similar findings by MARS and CART </li></ul>Variable Score LSTAT 100.00 |||||||||||||||||||||||||||||||||||||||||| RM 83.71 ||||||||||||||||||||||||||||||||||| DIS 45.45 ||||||||||||||||||| CRIM 31.91 ||||||||||||| NOX 30.69 |||||||||||| AGE 28.62 ||||||||||| PT 22.81 ||||||||| TAX 19.74 ||||||| INDUS 12.19 |||| CHAS 11.93 ||||
    26. 26. References <ul><li>Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: Wadsworth </li></ul><ul><li>Breiman, L. (1996). Bagging predictors. Machine Learning , 24, 123-140. </li></ul><ul><li>Hastie, T., Tibshirani, R., and Friedman, J.H (2000). The Elements of Statistical Learning. Springer. </li></ul><ul><li>Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth National Conference , Morgan Kaufmann, pp. 148-156. </li></ul><ul><li>Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: Statistics Department, Stanford University. </li></ul><ul><li>Friedman, J.H. (1999). Greedy function approximation: a gradient boosting machine. Stanford: Statistics Department, Stanford University. </li></ul>