Introduction to RandomForests 2004


Published on

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to RandomForests 2004

  1. 1. An Introduction to RandomForests™ Salford Systems Dan Steinberg, Mikhail Golovnya, N. Scott Cardell
  2. 2.  New approach for many data analytical tasks developed by Leo Breiman of University of California, Berkeley ◦ Co-author of CART® with Friedman, Olshen, and Stone ◦ Author of Bagging and Arcing approaches to combining trees  Good for classification and regression problems ◦ Also for clustering, density estimation ◦ Outlier and anomaly detection ◦ Explicit missing value imputation  Builds on the notions of committees of experts but is substantially different in key implementation details
  3. 3.  The term usually refers to pattern discovery in large data bases  Initially appeared in the late twentieth century and directly associated with the PC boom ◦ Spread of data collection devices ◦ Dramatically increased data storage capacity ◦ Exponential growth in computational power of CPUs  The necessity to go way beyond standard statistical techniques in data analysis ◦ Dealing with extremely large numbers of variables ◦ Dealing with highly non-linear dependency structures ◦ Dealing with missing values and dirty data
  4. 4.  The following major classes of problems are usually considered: ◦ Supervised Learning (interested in predicting some outcome variable based on observed predictors)  Regression (quantitative outcome)  Classification (nominal or categorical outcome) ◦ Unsupervised Learning (no single target variable available- interested in partitioning data into cluster, finding association rules, etc.)
  5. 5.  Relating gene expressions to the presence of a certain decease based upon microarray data  Indentifying potential fraud cases in credit card transactions (binary target)  Predicting level of user satisfaction as poor, average, good, excellent (4-level target)  Optical Digit Recognition (10-level target)  Predicting consumer preferences towards different kinds of vehicles (could be as many as several hundred level target)
  6. 6.  Predicting efficacy of a drug based upon demographic factors  Predicting the amount of sales (target) based on current observed conditions  Predicting user energy consumption (target) depending on the season, business type, location, etc.  Predicting medium house value (target) based on the crime rate, pollution level, proximity, age, industrialization level, etc.
  7. 7.  DNA Microarray Data- which samples cluster together? Which genes cluster together?  Market Basket Analysis- which products do customers tend to buy together?  Clustering For Classification- Handwritten zip code problem: can we find prototype digits for 1,2, etc. to use for classification?
  8. 8.  The answer usually has two sides: ◦ Understanding the relationship ◦ Predictive accuracy  Some algorithms dominate one side (understanding) ◦ Classical methods ◦ Single trees ◦ Nearest neighbor ◦ MARS  Others dominate the other side (predicting) ◦ Neural nets ◦ TreeNet ◦ Random Forests
  9. 9.  Leo Breiman says: ◦ Framing the question as the choice between accuracy and interpretability is an incorrect interpretation of what the goal of a statistical analysis is  The goal is NOT interpretability, but accurate information  Nature’s mechanisms are generally complex and cannot be summarized by a relatively simple stochastic model, even as a first approximation  The better the model fits the data, the more sound the inferences about the phenomenon are
  10. 10.  The only way to attain the best predictive accuracy o real life data is to build a complex model  Analyzing this model will also provide the most accurate insight!  At the same time, the model complexity makes it far more difficult to analyze it ◦ A random forest may contain 3,000 trees jointly contributing to the overall prediction ◦ There could be 5,000 association rules found in a typical unsupervised learning algorithm
  11. 11.  (Insert table)  Example of a classification tree for UCSD heart decease study
  12. 12.  Relatively fast  Requires minimal supervision by analyst  Produces easy to understand models  Conducts automatic variable selection  Handles missing values via surrogate splits  Invariant to monotonic transformations of predictors  Impervious to outliers
  13. 13.  Piece-wise constant models  “Sharp” decision boundaries  Exponential data exhaustion  Difficulties capturing global linear patterns  Models tend to evolve around the strongest effects  Not the best predictive accuracy
  14. 14.  A random forest is a collection of single trees grown in a special way  The overall prediction is determined by voting (in classification) or averaging (in regression)  The law of Large Numbers ensures convergence  The key to accuracy is low correlation and bias  To keep bias low, trees are grown to maximum depth
  15. 15.  Each tree is grown on a bootstrap sample from the learning set  A number R us specified (square root by default) such that it is noticeably smaller than the total number of available predictors  During tree growing phase, at each node only R predictors are randomly selected and tried
  16. 16.  All major advantages of a single tree are automatically preserved  Since each tree is grown on a bootstrap sample, one can ◦ Use out of bag samples to compute an unbiased estimate of the accuracy ◦ Use out of bag samples to determine variable importances  There is no overfitting as the number of trees increases
  17. 17.  It is possible to compute generalized proximity between any pair of cases  Based on proximities one can ◦ Proceed with a well-defined clustering solution ◦ Detect outliers ◦ Generate informative data views/projections using scaling coordinates ◦ Do missing value imputation  Easy expansion into the unsupervised learning domain
  18. 18.  High levels of predictive accuracy delivered automatically ◦ Only a few control parameters to experiment with ◦ Strong for both regression and classification  Resistant to overtraining (overfitting)- generalizes well to new data  Trains rapidly even with thousands of potential predictors ◦ No need for prior feature (variable) selection  Diagnostic pinpoint multivariate outliers  Offers a revolutionary new approach to clustering using tree-based between-record distance measures  Built on CART® inspired trees and thus ◦ Results invariant to monotone transformations of variables
  19. 19.  Method intended to generate a large number of substantially different models ◦ Randomness introduced in two simultaneous ways ◦ By row: records selected for training at random with replacement (as in bootstrap resampling of the bagger) ◦ By column: candidate predictors at any node are chosen at random and best splitter selected from the random subset  Each tree is grown out to maximal size and left unpruned ◦ Trees are deliberately overfit, becoming a form of nearest neighbor predictor ◦ Experiments convincingly show that pruning these trees hurt performance ◦ Overfit individual trees combine to yield properly fit ensembles
  20. 20.  Self-testing possible even if all data is used for training ◦ Only 63% of available training data will be used to grow any one tree ◦ A 37% portion of training data always unused  The unused portion of the training data is known as Out-Of-Bag (OOB) data and can be used to provide an ongoing dynamic assessment of model performance ◦ Allows fitting to small data sets without explicitly holding back any data for testing ◦ All training data is used cumulatively in training, but only a 63% portion used at any one time  Similar to cross-validation but unstructured
  21. 21.  Intensive post processing of data to extract more insight into data ◦ Most important is introduction of distance metric between any two data records ◦ The more similar two records are the more often they will land in same terminal node of a tree ◦ With a large number of different trees simply count the number of times they co-locate in same leaf nodes ◦ Distance metric can be used to construct dissimilarity matrix input into hierarchical clustering
  22. 22.  Ultimately in modeling our goal is to produce a single score, prediction, forecast, or class assignment  The motivation generating multiple models is the hope that by somehow combining models results will be better than if we relied on a single model  When multiple models are generated they are normally combined by ◦ Voting in classification problems, perhaps weighted ◦ Averaging in regression problems, perhaps weighted
  23. 23.  Combining trees via averaging or voting will only be beneficial if the trees are different from each other  In original bootstrap aggregation paper Breiman noted bagging worked best for high variance (unstable) techniques ◦ If results of each model are near identical little to be gained by averaging  Resampling of the bagger from the training data intended to induce differences in trees ◦ Accomplished essentially varying the weight on any data record
  24. 24.  Bootstrap sample is fairly similar to taking a 65% sample from the original training data  If you grow many trees each based on a different 65% random sample of your data you expect some variation in the trees produced  Bootstrap sample goes a bit further in ensuring that the new sample is of the same size as the original by allowing some records to be selected multiple times  In practice the different samples induce different trees but trees are not that different
  25. 25.  The bagger was limited by the fact that even with resampling trees are likely to be somewhat similar to each other, particularly with strong data structure  Random Forests induces vastly more between tree differences by forcing splits to be based on different predictors ◦ Accomplished by introducing randomness into split selection
  26. 26.  Breiman points out tradeoff: ◦ As R increases strength of individual tree should increase ◦ However, correlation between trees also increases reducing advantage of combining  Want to select R to optimally balance the two effects ◦ Can only be determined via experimentation  Breiman has suggested three values to test: ◦ R= 1/2sqrt(M) ◦ R= sqrt(M) ◦ R= 2sqrt(M) ◦ For M= 100 test values for R: 5,10,20 ◦ For M= 400 test values for R: 10, 20, 40
  27. 27.  Random Forests machinery unlike CART in that ◦ Only one splitting rule: Gini ◦ Class weight concept but no explicit priors or costs ◦ No surrogates: Missing values imputed for data first automatically  Default fast imputation just uses means  Compute intensive method uses tree-based nearest neighbors to base imputation on (discussed later) ◦ None of the display and reporting machinery are tree refinement services of CART  Does follow CART in that all splits are binary
  28. 28.  Trees combined via voting (classification) or averaging (regression)  Classification trees “vote” ◦ Recall that classification trees classify  Assign each case to ONE class only ◦ With 50 trees, 50 class assignments for each case ◦ Winner is the class with the most votes ◦ Votes could be weighted- say by accuracy of individual trees  Regression trees assign a real predicted value for each case ◦ Predictions are combined via averaging ◦ Results will be much smoother than from a single tree
  29. 29.  Probability of being omitted in a single draw is (1-1/n)  Probability of being omitted in all n draws is (1-1/n)n  Limit of series as n increases is (1/e)= 0.368 ◦ Approximately 36.8% sample excluded 0% of resample ◦ 36.8% sample included once 36.8% of resample ◦ 18.4% sample included twice thus represent…36.8% of resample ◦ 6.1% sample included three times…18.4% of resample ◦ 1.9% sample included four or more times…8% if resample 100% ◦ Example: distribution of weights in a 2,000 record resample: ◦ (insert table)
  30. 30.  Want to use mass spectrometer data to classify different types of prostate cancer ◦ 772 observations available  398- healthy samples  178- 1st type of cancer samples  196- 2nd type of cancer samples ◦ 111 mass spectra measurements are recorded for each sample
  31. 31.  (insert table)  The above table shows cross-validated prediction success results of a single CART tree for the prostate data  The run was conducted under PRIORS DATA to facilitate comparisons with subsequent RF run ◦ The relative error corresponds to the absolute error of 30.4%
  32. 32.  Topic discussed by several Machine Learning researchers  Possibilities: ◦ Select splitter, split point, or both at random ◦ Choose splitter at random from the top K splitters  Random Forests: Suppose we have M available predictors ◦ Select R eligible splitters at random and let best split node ◦ If R=1 this is just random splitter selection ◦ If R=M this becomes Brieman’s bagger ◦ If R<< M then we get Breian’s Random Forests  Breiman suggests R=sqrt(M) as a good rule of thumb
  33. 33.  A performance of a single tree will be somewhat driven by the number of candidate predictors allowed at each node  Consider R=1: the splitter is always chosen at random + performance could be quite weak  As relevant splitters get into tree and tree is allowed to grow massively, single tree can be predictive even if R=1  As R is allowed to increase quality of splits can improve as there will be better (and more relevant) splitters
  34. 34.  (insert graph)  In this experiment, we ran RF with 100 trees on the prostate data using different values for the number of variables Nvars searched at each split
  35. 35.  RF clearly outperforms single tree for any number of Nvars ◦ We saw above that a properly pruned tree gives cross-validated absolute error of 30.4% (the very right end of the red curve)  The performance of a single tree tends to deviate substantially with the number of predictors allowed to be searched (a single tree is a high variance object)  The RF reaches the nearly stable error rate of about 20% when only 10 variables are searched in each node (marked by the blue color)  Discounting the minor fluctuations, the error rate also remains stable for Nvars above 10 ◦ This generally agrees with Breiman’s suggestion to use square root N=111 as a rough estimate of the optimal value for Nvars  The performance for small Nvars can be usually further improved by increasing the number of runs
  36. 36.  (insert graph)
  37. 37.  (insert table)  The above results correspond to a standard RF run with 500 trees, Nvars=15, and unit class weights  Note that the overall error rate is 19.4% which is 2/3 of the baseline CART error of 30.4%
  38. 38.  RF does not use a test dataset to report accuracy  For every tree grown, about 30% of data are left out-of-bag (OOB)  This means that these cases can be safely used in place of the test data to evaluate the performance of the current tree  For any tree in RF, its own OOB sample is used- hence no bias is ever introduced into the estimates  The final OOB estimate for the entire RF can be simply obtained by averaging individual OOB estimates  Consequently, this estimate is unbiased and behaves as if we had an independent test sample of the same size as the learn sample
  39. 39.  (insert table)
  40. 40.  The prostate dataset is somewhat partially unbalanced- class 1 contains fewer records than the remaining classes  Under the default RF settings, the minority classes will have higher misclassification rates than the dominant classes  Misbalance in the individual class error rates may also be caused by other data specific issues  Class weights are used in RF to boost the accuracy of the specified classes  General Rule of Thumb: to increase accuracy in the given class, one should increase the corresponding class weight  In many ways this is similar to the PRIORS control used in CART for the same purpose
  41. 41.  Our next run sets the weight for class one to 2  As a result, class 1 is classified with a much better accuracy at the cost of slightly reduced accuracy in the remaining classes
  42. 42.  At the end of an RF run, the proportion of votes for each class is recorded  We can define Margin of a case simply as the proportion of votes for the true class minus the maximum proportion of votes for the other classes  The larger the margin, the higher the confidence of classification
  43. 43.  (insert table)  This extract shows percent votes for the top 30 records in the dataset along with the corresponding margins  The green lines have high margins and therefore high confidence of predictions  The pink lines have negative margins, which means that these observations are not classified correctly
  44. 44.  The concept of margin allows new “unbiased” definition of variable importance  To estimate the importance of the mth variable: ◦ Take the OOB cases for the ldh tree, assume that we already know the margin for those cases M ◦ Randomly permute all values of the variable m ◦ Apply the ldh tree to the OOB cases with the permuted values ◦ Compute the new margin M ◦ Compute the difference M-M  The variable importance is defined as the average lowering of the margin across all OOB cases and all trees in the RF  This procedure is fundamentally different from the intrinsic variable importance scored computed by CART- the latter are always based on the LEARN data and are subject to the overfitting issues
  45. 45.  The top portion of the variable importance list for the data is shown here  Analysis of the complete list reveals that all 111 variables are nearly equally strongly contributing to the model predictions  This is in a striking contrast with the single CART tree that has no choice but to use a limited subset of variables by tree’s construction  The above explains why the RF model has a significantly lower error rate (20%) when compared to a single CART tree (30%)
  46. 46.  RF introduces a novel way to define proximity between two observations ◦ Initialize proximities to zeroes ◦ For any given tree, apply the tree to all cases ◦ If case I and j both end up in the same node, increase proximity prox(ij) between I and j by one ◦ Accumulate over all trees in RF and normalize by twice the number of trees in RF  The resulting matrix of size NxN provides intrinsic measure of proximity ◦ The measure is invariant to monotone transformations ◦ The measure is clearly defined for any type of independent variables, including categorical
  47. 47.  (insert graph)  The above extract shows the proximity matrix for the top 10 records of the prostate dataset ◦ Note ones on the main diagonal- any case has “perfect” proximity to itself ◦ Observations that are “alike” will have proximities close to one  these cells have green background ◦ The closer proximity to 0, the more dissimilar cases i and j are  These cells have pink B
  48. 48.  Having the full intrinsic proximity matrix opens new horizons ◦ Informative data views using metric scaling ◦ Missing value imputation ◦ Outlier detection  Unfortunately, things get out of control when dataset size exceeds 5,000 observations (25,000,000+ cells are needed)  RF switches to “compressed” form of the proximity matrix to handle large datasets- for any case, only M closest cases are recorded. M is usually less than 100.
  49. 49.  The values 1-prox(ij) can be treated as Euclidean distances in a high dimensional space  The theory of metric scaling solves the problem of finding the most representative projections of the underlying data “cloud” onto low dimensional space using the data proximities ◦ The theory is similar in spirit to the principal components analysis and discriminant analysis  The solution is given in the form of ordered “scaling coordinates”  Looking at the scatter plots of the top scaling coordinates provides informative views of the data
  50. 50.  (insert graph)  This extract shows five initial scaling coordinates for the top 30 records of the prostate data  We will look at the scatter plots among the first, second, and third scaling coordinates  The following color codes will be used for the target classes: ◦ Green- class 0 ◦ Red- class 1 ◦ Blue- class 2
  51. 51.  (insert graphs)  A nearly perfect separation of all three classes is clearly seen  From this we conclude that the outcome variable admits clear prediction using RF model which utilizes 111 original predictors  The residual error is mostly due to the presence of the “focal” point where all the three rays meet
  52. 52.  (insert graph)
  53. 53.  (insert graphs)  Again, three distinct target classes show up as separate clusters  The “focal” point represents a cluster of records that can’t be distinguished from each other
  54. 54.  Outliers are defined as cases having small proximities to all other cases belonging to the same target class  The following algorithm is used: ◦ For a case n, compute the sum of the squares of prox(nk) for all k in the same class as n ◦ Take the inverse- it will be large if the case is “far away” from the rest ◦ Standardize using the median and standard deviation ◦ ◦ Look at the cases with the largest values- those are potential outliers  Generally, a value above 10 is reason to suspect the case of being an outlier
  55. 55.  This extract shows top 30 records of the prostate dataset sorted descending by the outlier measure  Clearly the top 6 cases (class 2 with IDs: 771, 683, 539, and class 0 with IDs 127, 281, 282) are suspicious  All of these seem to be located at the “focal point” on the corresponding scaling coordinate plots
  56. 56.  (insert graph)
  57. 57.  RF offers two ways of missing value imputation  The Cheap Way- conventional median imputation for continuous variables and mode imputation for categorical variables  The Right Way: ◦ Suppose case n has x coordinate missing ◦ Do the Cheap Way imputation for starters ◦ Grow a full size RF ◦ We can now re-estimate the missing value by a weighted average ◦ over all cases k with non-missing x using weights prox(nk) ◦ Repeat steps 2 and 3 several times to ensure convergence
  58. 58.  An alternative display to view how the target classes are different with respect to the individual predictors ◦ Recall, at the end of an RF run all cases in the dataset, obtain K separate votes for the class membership (assuming K target classes) ◦ Take any target class and sort all observations by the count of votes for this class descending ◦ Take the top 50 observations and the bottom 50 observations, those are correspondingly the most likely and the least likely members of the given target class ◦ Parallel coordinate plots report uniformly (0,1) scaled values of all predictors for the top 50 and bottom 50 sorted records, along with the 25th, 50th and j percentiles within each predictor
  59. 59.  (insert graph)  This is a detailed display of the normalized values of the initial 20 predictors for the top voted 50 records in each target class (this gives 50x3=150 graphs)  Class 0 generally has normalized values of the initial 20 predictors close to 0 (left side 0tt, lw, y, o, ragg, wp) except perhaps M9X11
  60. 60.  (insert graph)  It is easier to see this when looking at the quartile plots only  Note that class 2 tends to have the largest values of the corresponding predictors  The graph can be scrolled forward to view all of the 111 predictors
  61. 61.  (insert graph)  The least likely plots roughly result to the similar conclusions: small predictor values are the least likely for class 2, etc.
  62. 62.  RF admits an interesting possibility to solve unsupervised learning problems, in particular, clustering problems and missing value imputation in the general sense  Recall that in the unsupervised learning the concept of target is not defined  RF generates a synthetic target variable in order to proceed with a regular run: ◦ Give class label 1 to the original data ◦ Create a copy of the data such that each variable is sampled independently from the values available in the original dataset ◦ Give class label 2 to the copy of the data ◦ Note that the second copy has marginal distributions identical to the first copy, whereas the possible dependency among predictors is completely destroyed ◦ ◦ A necessary drawback is that the resulting dataset is twice as large as the original
  63. 63.  We now have a clear binary supervised learning problem  Running an RF on this dataset may provide the following insights: ◦ When the resulting misclassification error is high (above 50%), the variables are basically independent- no interesting structure exists ◦ Otherwise, the dependency structure can be further studied by looking at the scaling coordinates and exploiting the proximity matrix in other ways ◦ For instance, the resulting proximity matrix can be used as an important starting point for the subsequent hierarchical clustering analysis  Recall that the proximity measures are invariant to monotone transformations and naturally support categorical variables  The same missing value imputation procedure as before can now be employed  These techniques work extremely well for small datasets
  64. 64.  We generated a synthetic dataset based on the prostate data  The resulting dataset still has 111 predictors but twice the number of records- the first half being the exact replica of the original data  The final error is only 0.2% which is an indication of a very strong dependency among the predictors
  65. 65.  (insert graph)  The resulting plots resemble what we had before  However, this distance is in terms of how dependent the predictors are, whereas previously it was in terms of having the same target class  In view of this, the non cancerous tissue (green) appears to stand apart from the cancerous
  66. 66.  + Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.  + Breiman, L. (1996). Arcing classifiers (Technical Report). Berkeley: Statistics Department, University of California.  + Buntine, W. (1991). Learning classification trees. In D.J. Hand, ed., Artificial Intelligence Frontiers in Statistics, Chapman and Hall: London, 182-201.  + Dietterich, T. (1998). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, Boosting, and Randomization. Machine Learning, 40, 139-158.  + Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth National Conference, Morgan Kaufmann, pp. 148-156.  + Friedman, J.H. (1999). RandomForests. Stanford: Statistics Department, Stanford University.  + Friedman, J.H. (1999). Greedy function approximation: a gradient boosting machine. Stanford: Statistics Department, Stanford University.  + Heath, D., Kasif, S., and Salzberg, S. (1993) k-dt: A multi-tree learning method. Proceedings of the Second International Workshop on Multistrategy Learning, 1002-1007, Morgan Kaufman: Chambery, France.  + Kwok, S., and Carter, C. (1990). Multiple decision trees. In Shachter, R., Levitt, T., Kanal, L., and Lemmer, J., eds. Uncertainty in Artificial Intelligence 4, North- Holland, 327-335.