An Introduction to RandomForests™
Salford Systems
http://www.salford-systems.com
golomi@salford-systems.com
Dan Steinberg, Mikhail Golovnya, N. Scott Cardell
 New approach for many data analytical tasks developed by
Leo Breiman of University of California, Berkeley
◦ Co-author of CART® with Friedman, Olshen, and Stone
◦ Author of Bagging and Arcing approaches to combining trees
 Good for classification and regression problems
◦ Also for clustering, density estimation
◦ Outlier and anomaly detection
◦ Explicit missing value imputation
 Builds on the notions of committees of experts but is
substantially different in key implementation details
 The term usually refers to pattern discovery in large data bases
 Initially appeared in the late twentieth century and directly
associated with the PC boom
◦ Spread of data collection devices
◦ Dramatically increased data storage capacity
◦ Exponential growth in computational power of CPUs
 The necessity to go way beyond standard statistical techniques
in data analysis
◦ Dealing with extremely large numbers of variables
◦ Dealing with highly non-linear dependency structures
◦ Dealing with missing values and dirty data
 The following major classes of problems are
usually considered:
◦ Supervised Learning (interested in predicting some
outcome variable based on observed predictors)
 Regression (quantitative outcome)
 Classification (nominal or categorical outcome)
◦ Unsupervised Learning (no single target variable
available- interested in partitioning data into cluster,
finding association rules, etc.)
 Relating gene expressions to the presence of a
certain decease based upon microarray data
 Indentifying potential fraud cases in credit card
transactions (binary target)
 Predicting level of user satisfaction as poor, average,
good, excellent (4-level target)
 Optical Digit Recognition (10-level target)
 Predicting consumer preferences towards different
kinds of vehicles (could be as many as several
hundred level target)
 Predicting efficacy of a drug based upon demographic factors
 Predicting the amount of sales (target) based on current
observed conditions
 Predicting user energy consumption (target) depending on
the season, business type, location, etc.
 Predicting medium house value (target) based on the crime
rate, pollution level, proximity, age, industrialization level,
etc.
 DNA Microarray Data- which samples cluster together? Which
genes cluster together?
 Market Basket Analysis- which products do customers tend to
buy together?
 Clustering For Classification- Handwritten zip code problem:
can we find prototype digits for 1,2, etc. to use for
classification?
 The answer usually has two sides:
◦ Understanding the relationship
◦ Predictive accuracy
 Some algorithms dominate one side (understanding)
◦ Classical methods
◦ Single trees
◦ Nearest neighbor
◦ MARS
 Others dominate the other side (predicting)
◦ Neural nets
◦ TreeNet
◦ Random Forests
 Leo Breiman says:
◦ Framing the question as the choice between accuracy
and interpretability is an incorrect interpretation of what
the goal of a statistical analysis is
 The goal is NOT interpretability, but accurate information
 Nature’s mechanisms are generally complex and cannot be
summarized by a relatively simple stochastic model, even as
a first approximation
 The better the model fits the data, the more sound the
inferences about the phenomenon are
 The only way to attain the best predictive accuracy o
real life data is to build a complex model
 Analyzing this model will also provide the most
accurate insight!
 At the same time, the model complexity makes it far
more difficult to analyze it
◦ A random forest may contain 3,000 trees jointly
contributing to the overall prediction
◦ There could be 5,000 association rules found in a typical
unsupervised learning algorithm
 (Insert table)
 Example of a classification tree for UCSD
heart decease study
 Relatively fast
 Requires minimal supervision by analyst
 Produces easy to understand models
 Conducts automatic variable selection
 Handles missing values via surrogate splits
 Invariant to monotonic transformations of predictors
 Impervious to outliers
 Piece-wise constant models
 “Sharp” decision boundaries
 Exponential data exhaustion
 Difficulties capturing global linear patterns
 Models tend to evolve around the strongest effects
 Not the best predictive accuracy
 A random forest is a collection of single trees grown in a
special way
 The overall prediction is determined by voting (in
classification) or averaging (in regression)
 The law of Large Numbers ensures convergence
 The key to accuracy is low correlation and bias
 To keep bias low, trees are grown to maximum depth
 Each tree is grown on a bootstrap sample from the learning
set
 A number R us specified (square root by default) such that it
is noticeably smaller than the total number of available
predictors
 During tree growing phase, at each node only R predictors are
randomly selected and tried
 All major advantages of a single tree are automatically
preserved
 Since each tree is grown on a bootstrap sample, one can
◦ Use out of bag samples to compute an unbiased estimate of
the accuracy
◦ Use out of bag samples to determine variable importances
 There is no overfitting as the number of trees increases
 It is possible to compute generalized proximity between any pair
of cases
 Based on proximities one can
◦ Proceed with a well-defined clustering solution
◦ Detect outliers
◦ Generate informative data views/projections using scaling
coordinates
◦ Do missing value imputation
 Easy expansion into the unsupervised learning domain
 High levels of predictive accuracy delivered automatically
◦ Only a few control parameters to experiment with
◦ Strong for both regression and classification
 Resistant to overtraining (overfitting)- generalizes well to new data
 Trains rapidly even with thousands of potential predictors
◦ No need for prior feature (variable) selection
 Diagnostic pinpoint multivariate outliers
 Offers a revolutionary new approach to clustering using tree-based
between-record distance measures
 Built on CART® inspired trees and thus
◦ Results invariant to monotone transformations of variables
 Method intended to generate a large number of substantially
different models
◦ Randomness introduced in two simultaneous ways
◦ By row: records selected for training at random with replacement (as in
bootstrap resampling of the bagger)
◦ By column: candidate predictors at any node are chosen at random and
best splitter selected from the random subset
 Each tree is grown out to maximal size and left unpruned
◦ Trees are deliberately overfit, becoming a form of nearest neighbor
predictor
◦ Experiments convincingly show that pruning these trees hurt performance
◦ Overfit individual trees combine to yield properly fit ensembles
 Self-testing possible even if all data is used for training
◦ Only 63% of available training data will be used to grow any one
tree
◦ A 37% portion of training data always unused
 The unused portion of the training data is known as Out-Of-Bag (OOB)
data and can be used to provide an ongoing dynamic assessment of
model performance
◦ Allows fitting to small data sets without explicitly holding back
any data for testing
◦ All training data is used cumulatively in training, but only a 63%
portion used at any one time
 Similar to cross-validation but unstructured
 Intensive post processing of data to extract more
insight into data
◦ Most important is introduction of distance metric
between any two data records
◦ The more similar two records are the more often they
will land in same terminal node of a tree
◦ With a large number of different trees simply count the
number of times they co-locate in same leaf nodes
◦ Distance metric can be used to construct dissimilarity
matrix input into hierarchical clustering
 Ultimately in modeling our goal is to produce a single
score, prediction, forecast, or class assignment
 The motivation generating multiple models is the
hope that by somehow combining models results will
be better than if we relied on a single model
 When multiple models are generated they are
normally combined by
◦ Voting in classification problems, perhaps weighted
◦ Averaging in regression problems, perhaps weighted
 Combining trees via averaging or voting will only be
beneficial if the trees are different from each other
 In original bootstrap aggregation paper Breiman noted
bagging worked best for high variance (unstable)
techniques
◦ If results of each model are near identical little to be
gained by averaging
 Resampling of the bagger from the training data
intended to induce differences in trees
◦ Accomplished essentially varying the weight on any
data record
 Bootstrap sample is fairly similar to taking a 65% sample from
the original training data
 If you grow many trees each based on a different 65% random
sample of your data you expect some variation in the trees
produced
 Bootstrap sample goes a bit further in ensuring that the new
sample is of the same size as the original by allowing some
records to be selected multiple times
 In practice the different samples induce different trees but
trees are not that different
 The bagger was limited by the fact that even with resampling
trees are likely to be somewhat similar to each other,
particularly with strong data structure
 Random Forests induces vastly more between tree differences
by forcing splits to be based on different predictors
◦ Accomplished by introducing randomness into split
selection
 Breiman points out tradeoff:
◦ As R increases strength of individual tree should increase
◦ However, correlation between trees also increases reducing advantage of
combining
 Want to select R to optimally balance the two effects
◦ Can only be determined via experimentation
 Breiman has suggested three values to test:
◦ R= 1/2sqrt(M)
◦ R= sqrt(M)
◦ R= 2sqrt(M)
◦ For M= 100 test values for R: 5,10,20
◦ For M= 400 test values for R: 10, 20, 40
 Random Forests machinery unlike CART in that
◦ Only one splitting rule: Gini
◦ Class weight concept but no explicit priors or costs
◦ No surrogates: Missing values imputed for data first automatically
 Default fast imputation just uses means
 Compute intensive method uses tree-based nearest neighbors to base
imputation on (discussed later)
◦ None of the display and reporting machinery are tree refinement
services of CART
 Does follow CART in that all splits are binary
 Trees combined via voting (classification) or averaging
(regression)
 Classification trees “vote”
◦ Recall that classification trees classify
 Assign each case to ONE class only
◦ With 50 trees, 50 class assignments for each case
◦ Winner is the class with the most votes
◦ Votes could be weighted- say by accuracy of individual trees
 Regression trees assign a real predicted value for each case
◦ Predictions are combined via averaging
◦ Results will be much smoother than from a single tree
 Probability of being omitted in a single draw is (1-1/n)
 Probability of being omitted in all n draws is (1-1/n)n
 Limit of series as n increases is (1/e)= 0.368
◦ Approximately 36.8% sample excluded 0% of resample
◦ 36.8% sample included once 36.8% of resample
◦ 18.4% sample included twice thus represent…36.8% of resample
◦ 6.1% sample included three times…18.4% of resample
◦ 1.9% sample included four or more times…8% if resample 100%
◦ Example: distribution of weights in a 2,000 record resample:
◦ (insert table)
 Want to use mass spectrometer data to classify
different types of prostate cancer
◦ 772 observations available
 398- healthy samples
 178- 1st type of cancer samples
 196- 2nd type of cancer samples
◦ 111 mass spectra measurements are recorded for each
sample
 (insert table)
 The above table shows cross-validated prediction success
results of a single CART tree for the prostate data
 The run was conducted under PRIORS DATA to facilitate
comparisons with subsequent RF run
◦ The relative error corresponds to the absolute error of
30.4%
 Topic discussed by several Machine Learning researchers
 Possibilities:
◦ Select splitter, split point, or both at random
◦ Choose splitter at random from the top K splitters
 Random Forests: Suppose we have M available predictors
◦ Select R eligible splitters at random and let best split node
◦ If R=1 this is just random splitter selection
◦ If R=M this becomes Brieman’s bagger
◦ If R<< M then we get Breian’s Random Forests
 Breiman suggests R=sqrt(M) as a good rule of thumb
 A performance of a single tree will be somewhat driven by the
number of candidate predictors allowed at each node
 Consider R=1: the splitter is always chosen at random +
performance could be quite weak
 As relevant splitters get into tree and tree is allowed to grow
massively, single tree can be predictive even if R=1
 As R is allowed to increase quality of splits can improve as
there will be better (and more relevant) splitters
 (insert graph)
 In this experiment, we ran RF with 100 trees on the
prostate data using different values for the number
of variables Nvars searched at each split
 RF clearly outperforms single tree for any number of Nvars
◦ We saw above that a properly pruned tree gives cross-validated absolute
error of 30.4% (the very right end of the red curve)
 The performance of a single tree tends to deviate substantially
with the number of predictors allowed to be searched (a single
tree is a high variance object)
 The RF reaches the nearly stable error rate of about 20% when
only 10 variables are searched in each node (marked by the blue
color)
 Discounting the minor fluctuations, the error rate also remains
stable for Nvars above 10
◦ This generally agrees with Breiman’s suggestion to use square root N=111
as a rough estimate of the optimal value for Nvars
 The performance for small Nvars can be usually further improved
by increasing the number of runs
 (insert graph)
 (insert table)
 The above results correspond to a standard RF run
with 500 trees, Nvars=15, and unit class weights
 Note that the overall error rate is 19.4% which is
2/3 of the baseline CART error of 30.4%
 RF does not use a test dataset to report accuracy
 For every tree grown, about 30% of data are left out-of-bag
(OOB)
 This means that these cases can be safely used in place of the
test data to evaluate the performance of the current tree
 For any tree in RF, its own OOB sample is used- hence no bias is
ever introduced into the estimates
 The final OOB estimate for the entire RF can be simply obtained
by averaging individual OOB estimates
 Consequently, this estimate is unbiased and behaves as if we had
an independent test sample of the same size as the learn sample
 (insert table)
 The prostate dataset is somewhat partially unbalanced- class 1
contains fewer records than the remaining classes
 Under the default RF settings, the minority classes will have
higher misclassification rates than the dominant classes
 Misbalance in the individual class error rates may also be caused
by other data specific issues
 Class weights are used in RF to boost the accuracy of the
specified classes
 General Rule of Thumb: to increase accuracy in the given class,
one should increase the corresponding class weight
 In many ways this is similar to the PRIORS control used in CART
for the same purpose
 Our next run sets the weight for class one to
2
 As a result, class 1 is classified with a much
better accuracy at the cost of slightly reduced
accuracy in the remaining classes
 At the end of an RF run, the proportion of votes for
each class is recorded
 We can define Margin of a case simply as the
proportion of votes for the true class minus the
maximum proportion of votes for the other classes
 The larger the margin, the higher the confidence of
classification
 (insert table)
 This extract shows percent votes for the top 30
records in the dataset along with the
corresponding margins
 The green lines have high margins and therefore
high confidence of predictions
 The pink lines have negative margins, which means
that these observations are not classified correctly
 The concept of margin allows new “unbiased” definition of variable
importance
 To estimate the importance of the mth variable:
◦ Take the OOB cases for the ldh tree, assume that we already know the margin for
those cases M
◦ Randomly permute all values of the variable m
◦ Apply the ldh tree to the OOB cases with the permuted values
◦ Compute the new margin M
◦ Compute the difference M-M
 The variable importance is defined as the average lowering of the margin
across all OOB cases and all trees in the RF
 This procedure is fundamentally different from the intrinsic variable
importance scored computed by CART- the latter are always based on
the LEARN data and are subject to the overfitting issues
 The top portion of the variable importance list for the
data is shown here
 Analysis of the complete list reveals that all 111
variables are nearly equally strongly contributing to
the model predictions
 This is in a striking contrast with the single CART tree
that has no choice but to use a limited subset of
variables by tree’s construction
 The above explains why the RF model has a
significantly lower error rate (20%) when compared to
a single CART tree (30%)
 RF introduces a novel way to define proximity between two
observations
◦ Initialize proximities to zeroes
◦ For any given tree, apply the tree to all cases
◦ If case I and j both end up in the same node, increase proximity prox(ij)
between I and j by one
◦ Accumulate over all trees in RF and normalize by twice the number of trees
in RF
 The resulting matrix of size NxN provides intrinsic measure of
proximity
◦ The measure is invariant to monotone transformations
◦ The measure is clearly defined for any type of independent variables,
including categorical
 (insert graph)
 The above extract shows the proximity matrix for the
top 10 records of the prostate dataset
◦ Note ones on the main diagonal- any case has
“perfect” proximity to itself
◦ Observations that are “alike” will have proximities
close to one
 these cells have green background
◦ The closer proximity to 0, the more dissimilar cases i
and j are
 These cells have pink B
 Having the full intrinsic proximity matrix opens new horizons
◦ Informative data views using metric scaling
◦ Missing value imputation
◦ Outlier detection
 Unfortunately, things get out of control when dataset size
exceeds 5,000 observations (25,000,000+ cells are needed)
 RF switches to “compressed” form of the proximity matrix to
handle large datasets- for any case, only M closest cases are
recorded. M is usually less than 100.
 The values 1-prox(ij) can be treated as Euclidean distances
in a high dimensional space
 The theory of metric scaling solves the problem of finding
the most representative projections of the underlying data
“cloud” onto low dimensional space using the data
proximities
◦ The theory is similar in spirit to the principal components analysis
and discriminant analysis
 The solution is given in the form of ordered “scaling
coordinates”
 Looking at the scatter plots of the top scaling coordinates
provides informative views of the data
 (insert graph)
 This extract shows five initial scaling coordinates for
the top 30 records of the prostate data
 We will look at the scatter plots among the first,
second, and third scaling coordinates
 The following color codes will be used for the target
classes:
◦ Green- class 0
◦ Red- class 1
◦ Blue- class 2
 (insert graphs)
 A nearly perfect separation of all three classes is clearly seen
 From this we conclude that the outcome variable admits clear
prediction using RF model which utilizes 111 original
predictors
 The residual error is mostly due to the presence of the “focal”
point where all the three rays meet
 (insert graph)
 (insert graphs)
 Again, three distinct target classes show up as
separate clusters
 The “focal” point represents a cluster of records
that can’t be distinguished from each other
 Outliers are defined as cases having small proximities to
all other cases belonging to the same target class
 The following algorithm is used:
◦ For a case n, compute the sum of the squares of prox(nk) for all k
in the same class as n
◦ Take the inverse- it will be large if the case is “far away” from the
rest
◦ Standardize using the median and standard deviation
◦
◦ Look at the cases with the largest values- those are potential
outliers
 Generally, a value above 10 is reason to suspect the case
of being an outlier
 This extract shows top 30 records of the prostate
dataset sorted descending by the outlier measure
 Clearly the top 6 cases (class 2 with IDs: 771, 683,
539, and class 0 with IDs 127, 281, 282) are
suspicious
 All of these seem to be located at the “focal point”
on the corresponding scaling coordinate plots
 (insert graph)
 RF offers two ways of missing value imputation
 The Cheap Way- conventional median imputation for continuous
variables and mode imputation for categorical variables
 The Right Way:
◦ Suppose case n has x coordinate missing
◦ Do the Cheap Way imputation for starters
◦ Grow a full size RF
◦ We can now re-estimate the missing value by a weighted average
◦ over all cases k with non-missing x using weights prox(nk)
◦ Repeat steps 2 and 3 several times to ensure convergence
 An alternative display to view how the target classes are
different with respect to the individual predictors
◦ Recall, at the end of an RF run all cases in the dataset, obtain K
separate votes for the class membership (assuming K target
classes)
◦ Take any target class and sort all observations by the count of
votes for this class descending
◦ Take the top 50 observations and the bottom 50 observations,
those are correspondingly the most likely and the least likely
members of the given target class
◦ Parallel coordinate plots report uniformly (0,1) scaled values of all
predictors for the top 50 and bottom 50 sorted records, along
with the 25th, 50th and j percentiles within each predictor
 (insert graph)
 This is a detailed display of the normalized values
of the initial 20 predictors for the top voted 50
records in each target class (this gives 50x3=150
graphs)
 Class 0 generally has normalized values of the
initial 20 predictors close to 0 (left side 0tt, lw, y,
o, ragg, wp) except perhaps M9X11
 (insert graph)
 It is easier to see this when looking at the quartile
plots only
 Note that class 2 tends to have the largest values
of the corresponding predictors
 The graph can be scrolled forward to view all of the
111 predictors
 (insert graph)
 The least likely plots roughly result to the similar
conclusions: small predictor values are the least
likely for class 2, etc.
 RF admits an interesting possibility to solve unsupervised learning
problems, in particular, clustering problems and missing value
imputation in the general sense
 Recall that in the unsupervised learning the concept of target is not
defined
 RF generates a synthetic target variable in order to proceed with a
regular run:
◦ Give class label 1 to the original data
◦ Create a copy of the data such that each variable is sampled independently from the
values available in the original dataset
◦ Give class label 2 to the copy of the data
◦ Note that the second copy has marginal distributions identical to the first copy,
whereas the possible dependency among predictors is completely destroyed
◦
◦ A necessary drawback is that the resulting dataset is twice as large as the original
 We now have a clear binary supervised learning problem
 Running an RF on this dataset may provide the following
insights:
◦ When the resulting misclassification error is high (above 50%), the
variables are basically independent- no interesting structure exists
◦ Otherwise, the dependency structure can be further studied by looking at
the scaling coordinates and exploiting the proximity matrix in other ways
◦ For instance, the resulting proximity matrix can be used as an important
starting point for the subsequent hierarchical clustering analysis
 Recall that the proximity measures are invariant to monotone
transformations and naturally support categorical variables
 The same missing value imputation procedure as before can now
be employed
 These techniques work extremely well for small datasets
 We generated a synthetic dataset based on the
prostate data
 The resulting dataset still has 111 predictors but
twice the number of records- the first half being
the exact replica of the original data
 The final error is only 0.2% which is an indication of
a very strong dependency among the predictors
 (insert graph)
 The resulting plots resemble what we had before
 However, this distance is in terms of how
dependent the predictors are, whereas previously it
was in terms of having the same target class
 In view of this, the non cancerous tissue (green)
appears to stand apart from the cancerous
 + Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
 + Breiman, L. (1996). Arcing classifiers (Technical Report). Berkeley: Statistics
Department, University of California.
 + Buntine, W. (1991). Learning classification trees. In D.J. Hand, ed., Artificial
Intelligence Frontiers in Statistics, Chapman and Hall: London, 182-201.
 + Dietterich, T. (1998). An experimental comparison of three methods for
constructing ensembles of decision trees: Bagging, Boosting, and Randomization.
Machine Learning, 40, 139-158.
 + Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm.
In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth National
Conference, Morgan Kaufmann, pp. 148-156.
 + Friedman, J.H. (1999). RandomForests. Stanford: Statistics Department, Stanford
University.
 + Friedman, J.H. (1999). Greedy function approximation: a gradient boosting
machine. Stanford: Statistics Department, Stanford University.
 + Heath, D., Kasif, S., and Salzberg, S. (1993) k-dt: A multi-tree learning method.
Proceedings of the Second International Workshop on Multistrategy Learning,
1002-1007, Morgan Kaufman: Chambery, France.
 + Kwok, S., and Carter, C. (1990). Multiple decision trees. In Shachter, R., Levitt,
T., Kanal, L., and Lemmer, J., eds. Uncertainty in Artificial Intelligence 4, North-
Holland, 327-335.

Introduction to RandomForests 2004

  • 1.
    An Introduction toRandomForests™ Salford Systems http://www.salford-systems.com golomi@salford-systems.com Dan Steinberg, Mikhail Golovnya, N. Scott Cardell
  • 2.
     New approachfor many data analytical tasks developed by Leo Breiman of University of California, Berkeley ◦ Co-author of CART® with Friedman, Olshen, and Stone ◦ Author of Bagging and Arcing approaches to combining trees  Good for classification and regression problems ◦ Also for clustering, density estimation ◦ Outlier and anomaly detection ◦ Explicit missing value imputation  Builds on the notions of committees of experts but is substantially different in key implementation details
  • 3.
     The termusually refers to pattern discovery in large data bases  Initially appeared in the late twentieth century and directly associated with the PC boom ◦ Spread of data collection devices ◦ Dramatically increased data storage capacity ◦ Exponential growth in computational power of CPUs  The necessity to go way beyond standard statistical techniques in data analysis ◦ Dealing with extremely large numbers of variables ◦ Dealing with highly non-linear dependency structures ◦ Dealing with missing values and dirty data
  • 4.
     The followingmajor classes of problems are usually considered: ◦ Supervised Learning (interested in predicting some outcome variable based on observed predictors)  Regression (quantitative outcome)  Classification (nominal or categorical outcome) ◦ Unsupervised Learning (no single target variable available- interested in partitioning data into cluster, finding association rules, etc.)
  • 5.
     Relating geneexpressions to the presence of a certain decease based upon microarray data  Indentifying potential fraud cases in credit card transactions (binary target)  Predicting level of user satisfaction as poor, average, good, excellent (4-level target)  Optical Digit Recognition (10-level target)  Predicting consumer preferences towards different kinds of vehicles (could be as many as several hundred level target)
  • 6.
     Predicting efficacyof a drug based upon demographic factors  Predicting the amount of sales (target) based on current observed conditions  Predicting user energy consumption (target) depending on the season, business type, location, etc.  Predicting medium house value (target) based on the crime rate, pollution level, proximity, age, industrialization level, etc.
  • 7.
     DNA MicroarrayData- which samples cluster together? Which genes cluster together?  Market Basket Analysis- which products do customers tend to buy together?  Clustering For Classification- Handwritten zip code problem: can we find prototype digits for 1,2, etc. to use for classification?
  • 8.
     The answerusually has two sides: ◦ Understanding the relationship ◦ Predictive accuracy  Some algorithms dominate one side (understanding) ◦ Classical methods ◦ Single trees ◦ Nearest neighbor ◦ MARS  Others dominate the other side (predicting) ◦ Neural nets ◦ TreeNet ◦ Random Forests
  • 9.
     Leo Breimansays: ◦ Framing the question as the choice between accuracy and interpretability is an incorrect interpretation of what the goal of a statistical analysis is  The goal is NOT interpretability, but accurate information  Nature’s mechanisms are generally complex and cannot be summarized by a relatively simple stochastic model, even as a first approximation  The better the model fits the data, the more sound the inferences about the phenomenon are
  • 10.
     The onlyway to attain the best predictive accuracy o real life data is to build a complex model  Analyzing this model will also provide the most accurate insight!  At the same time, the model complexity makes it far more difficult to analyze it ◦ A random forest may contain 3,000 trees jointly contributing to the overall prediction ◦ There could be 5,000 association rules found in a typical unsupervised learning algorithm
  • 11.
     (Insert table) Example of a classification tree for UCSD heart decease study
  • 12.
     Relatively fast Requires minimal supervision by analyst  Produces easy to understand models  Conducts automatic variable selection  Handles missing values via surrogate splits  Invariant to monotonic transformations of predictors  Impervious to outliers
  • 13.
     Piece-wise constantmodels  “Sharp” decision boundaries  Exponential data exhaustion  Difficulties capturing global linear patterns  Models tend to evolve around the strongest effects  Not the best predictive accuracy
  • 14.
     A randomforest is a collection of single trees grown in a special way  The overall prediction is determined by voting (in classification) or averaging (in regression)  The law of Large Numbers ensures convergence  The key to accuracy is low correlation and bias  To keep bias low, trees are grown to maximum depth
  • 15.
     Each treeis grown on a bootstrap sample from the learning set  A number R us specified (square root by default) such that it is noticeably smaller than the total number of available predictors  During tree growing phase, at each node only R predictors are randomly selected and tried
  • 16.
     All majoradvantages of a single tree are automatically preserved  Since each tree is grown on a bootstrap sample, one can ◦ Use out of bag samples to compute an unbiased estimate of the accuracy ◦ Use out of bag samples to determine variable importances  There is no overfitting as the number of trees increases
  • 17.
     It ispossible to compute generalized proximity between any pair of cases  Based on proximities one can ◦ Proceed with a well-defined clustering solution ◦ Detect outliers ◦ Generate informative data views/projections using scaling coordinates ◦ Do missing value imputation  Easy expansion into the unsupervised learning domain
  • 18.
     High levelsof predictive accuracy delivered automatically ◦ Only a few control parameters to experiment with ◦ Strong for both regression and classification  Resistant to overtraining (overfitting)- generalizes well to new data  Trains rapidly even with thousands of potential predictors ◦ No need for prior feature (variable) selection  Diagnostic pinpoint multivariate outliers  Offers a revolutionary new approach to clustering using tree-based between-record distance measures  Built on CART® inspired trees and thus ◦ Results invariant to monotone transformations of variables
  • 19.
     Method intendedto generate a large number of substantially different models ◦ Randomness introduced in two simultaneous ways ◦ By row: records selected for training at random with replacement (as in bootstrap resampling of the bagger) ◦ By column: candidate predictors at any node are chosen at random and best splitter selected from the random subset  Each tree is grown out to maximal size and left unpruned ◦ Trees are deliberately overfit, becoming a form of nearest neighbor predictor ◦ Experiments convincingly show that pruning these trees hurt performance ◦ Overfit individual trees combine to yield properly fit ensembles
  • 20.
     Self-testing possibleeven if all data is used for training ◦ Only 63% of available training data will be used to grow any one tree ◦ A 37% portion of training data always unused  The unused portion of the training data is known as Out-Of-Bag (OOB) data and can be used to provide an ongoing dynamic assessment of model performance ◦ Allows fitting to small data sets without explicitly holding back any data for testing ◦ All training data is used cumulatively in training, but only a 63% portion used at any one time  Similar to cross-validation but unstructured
  • 21.
     Intensive postprocessing of data to extract more insight into data ◦ Most important is introduction of distance metric between any two data records ◦ The more similar two records are the more often they will land in same terminal node of a tree ◦ With a large number of different trees simply count the number of times they co-locate in same leaf nodes ◦ Distance metric can be used to construct dissimilarity matrix input into hierarchical clustering
  • 22.
     Ultimately inmodeling our goal is to produce a single score, prediction, forecast, or class assignment  The motivation generating multiple models is the hope that by somehow combining models results will be better than if we relied on a single model  When multiple models are generated they are normally combined by ◦ Voting in classification problems, perhaps weighted ◦ Averaging in regression problems, perhaps weighted
  • 23.
     Combining treesvia averaging or voting will only be beneficial if the trees are different from each other  In original bootstrap aggregation paper Breiman noted bagging worked best for high variance (unstable) techniques ◦ If results of each model are near identical little to be gained by averaging  Resampling of the bagger from the training data intended to induce differences in trees ◦ Accomplished essentially varying the weight on any data record
  • 24.
     Bootstrap sampleis fairly similar to taking a 65% sample from the original training data  If you grow many trees each based on a different 65% random sample of your data you expect some variation in the trees produced  Bootstrap sample goes a bit further in ensuring that the new sample is of the same size as the original by allowing some records to be selected multiple times  In practice the different samples induce different trees but trees are not that different
  • 25.
     The baggerwas limited by the fact that even with resampling trees are likely to be somewhat similar to each other, particularly with strong data structure  Random Forests induces vastly more between tree differences by forcing splits to be based on different predictors ◦ Accomplished by introducing randomness into split selection
  • 26.
     Breiman pointsout tradeoff: ◦ As R increases strength of individual tree should increase ◦ However, correlation between trees also increases reducing advantage of combining  Want to select R to optimally balance the two effects ◦ Can only be determined via experimentation  Breiman has suggested three values to test: ◦ R= 1/2sqrt(M) ◦ R= sqrt(M) ◦ R= 2sqrt(M) ◦ For M= 100 test values for R: 5,10,20 ◦ For M= 400 test values for R: 10, 20, 40
  • 27.
     Random Forestsmachinery unlike CART in that ◦ Only one splitting rule: Gini ◦ Class weight concept but no explicit priors or costs ◦ No surrogates: Missing values imputed for data first automatically  Default fast imputation just uses means  Compute intensive method uses tree-based nearest neighbors to base imputation on (discussed later) ◦ None of the display and reporting machinery are tree refinement services of CART  Does follow CART in that all splits are binary
  • 28.
     Trees combinedvia voting (classification) or averaging (regression)  Classification trees “vote” ◦ Recall that classification trees classify  Assign each case to ONE class only ◦ With 50 trees, 50 class assignments for each case ◦ Winner is the class with the most votes ◦ Votes could be weighted- say by accuracy of individual trees  Regression trees assign a real predicted value for each case ◦ Predictions are combined via averaging ◦ Results will be much smoother than from a single tree
  • 29.
     Probability ofbeing omitted in a single draw is (1-1/n)  Probability of being omitted in all n draws is (1-1/n)n  Limit of series as n increases is (1/e)= 0.368 ◦ Approximately 36.8% sample excluded 0% of resample ◦ 36.8% sample included once 36.8% of resample ◦ 18.4% sample included twice thus represent…36.8% of resample ◦ 6.1% sample included three times…18.4% of resample ◦ 1.9% sample included four or more times…8% if resample 100% ◦ Example: distribution of weights in a 2,000 record resample: ◦ (insert table)
  • 30.
     Want touse mass spectrometer data to classify different types of prostate cancer ◦ 772 observations available  398- healthy samples  178- 1st type of cancer samples  196- 2nd type of cancer samples ◦ 111 mass spectra measurements are recorded for each sample
  • 31.
     (insert table) The above table shows cross-validated prediction success results of a single CART tree for the prostate data  The run was conducted under PRIORS DATA to facilitate comparisons with subsequent RF run ◦ The relative error corresponds to the absolute error of 30.4%
  • 32.
     Topic discussedby several Machine Learning researchers  Possibilities: ◦ Select splitter, split point, or both at random ◦ Choose splitter at random from the top K splitters  Random Forests: Suppose we have M available predictors ◦ Select R eligible splitters at random and let best split node ◦ If R=1 this is just random splitter selection ◦ If R=M this becomes Brieman’s bagger ◦ If R<< M then we get Breian’s Random Forests  Breiman suggests R=sqrt(M) as a good rule of thumb
  • 33.
     A performanceof a single tree will be somewhat driven by the number of candidate predictors allowed at each node  Consider R=1: the splitter is always chosen at random + performance could be quite weak  As relevant splitters get into tree and tree is allowed to grow massively, single tree can be predictive even if R=1  As R is allowed to increase quality of splits can improve as there will be better (and more relevant) splitters
  • 34.
     (insert graph) In this experiment, we ran RF with 100 trees on the prostate data using different values for the number of variables Nvars searched at each split
  • 35.
     RF clearlyoutperforms single tree for any number of Nvars ◦ We saw above that a properly pruned tree gives cross-validated absolute error of 30.4% (the very right end of the red curve)  The performance of a single tree tends to deviate substantially with the number of predictors allowed to be searched (a single tree is a high variance object)  The RF reaches the nearly stable error rate of about 20% when only 10 variables are searched in each node (marked by the blue color)  Discounting the minor fluctuations, the error rate also remains stable for Nvars above 10 ◦ This generally agrees with Breiman’s suggestion to use square root N=111 as a rough estimate of the optimal value for Nvars  The performance for small Nvars can be usually further improved by increasing the number of runs
  • 36.
  • 37.
     (insert table) The above results correspond to a standard RF run with 500 trees, Nvars=15, and unit class weights  Note that the overall error rate is 19.4% which is 2/3 of the baseline CART error of 30.4%
  • 38.
     RF doesnot use a test dataset to report accuracy  For every tree grown, about 30% of data are left out-of-bag (OOB)  This means that these cases can be safely used in place of the test data to evaluate the performance of the current tree  For any tree in RF, its own OOB sample is used- hence no bias is ever introduced into the estimates  The final OOB estimate for the entire RF can be simply obtained by averaging individual OOB estimates  Consequently, this estimate is unbiased and behaves as if we had an independent test sample of the same size as the learn sample
  • 39.
  • 40.
     The prostatedataset is somewhat partially unbalanced- class 1 contains fewer records than the remaining classes  Under the default RF settings, the minority classes will have higher misclassification rates than the dominant classes  Misbalance in the individual class error rates may also be caused by other data specific issues  Class weights are used in RF to boost the accuracy of the specified classes  General Rule of Thumb: to increase accuracy in the given class, one should increase the corresponding class weight  In many ways this is similar to the PRIORS control used in CART for the same purpose
  • 41.
     Our nextrun sets the weight for class one to 2  As a result, class 1 is classified with a much better accuracy at the cost of slightly reduced accuracy in the remaining classes
  • 42.
     At theend of an RF run, the proportion of votes for each class is recorded  We can define Margin of a case simply as the proportion of votes for the true class minus the maximum proportion of votes for the other classes  The larger the margin, the higher the confidence of classification
  • 43.
     (insert table) This extract shows percent votes for the top 30 records in the dataset along with the corresponding margins  The green lines have high margins and therefore high confidence of predictions  The pink lines have negative margins, which means that these observations are not classified correctly
  • 44.
     The conceptof margin allows new “unbiased” definition of variable importance  To estimate the importance of the mth variable: ◦ Take the OOB cases for the ldh tree, assume that we already know the margin for those cases M ◦ Randomly permute all values of the variable m ◦ Apply the ldh tree to the OOB cases with the permuted values ◦ Compute the new margin M ◦ Compute the difference M-M  The variable importance is defined as the average lowering of the margin across all OOB cases and all trees in the RF  This procedure is fundamentally different from the intrinsic variable importance scored computed by CART- the latter are always based on the LEARN data and are subject to the overfitting issues
  • 45.
     The topportion of the variable importance list for the data is shown here  Analysis of the complete list reveals that all 111 variables are nearly equally strongly contributing to the model predictions  This is in a striking contrast with the single CART tree that has no choice but to use a limited subset of variables by tree’s construction  The above explains why the RF model has a significantly lower error rate (20%) when compared to a single CART tree (30%)
  • 46.
     RF introducesa novel way to define proximity between two observations ◦ Initialize proximities to zeroes ◦ For any given tree, apply the tree to all cases ◦ If case I and j both end up in the same node, increase proximity prox(ij) between I and j by one ◦ Accumulate over all trees in RF and normalize by twice the number of trees in RF  The resulting matrix of size NxN provides intrinsic measure of proximity ◦ The measure is invariant to monotone transformations ◦ The measure is clearly defined for any type of independent variables, including categorical
  • 47.
     (insert graph) The above extract shows the proximity matrix for the top 10 records of the prostate dataset ◦ Note ones on the main diagonal- any case has “perfect” proximity to itself ◦ Observations that are “alike” will have proximities close to one  these cells have green background ◦ The closer proximity to 0, the more dissimilar cases i and j are  These cells have pink B
  • 48.
     Having thefull intrinsic proximity matrix opens new horizons ◦ Informative data views using metric scaling ◦ Missing value imputation ◦ Outlier detection  Unfortunately, things get out of control when dataset size exceeds 5,000 observations (25,000,000+ cells are needed)  RF switches to “compressed” form of the proximity matrix to handle large datasets- for any case, only M closest cases are recorded. M is usually less than 100.
  • 49.
     The values1-prox(ij) can be treated as Euclidean distances in a high dimensional space  The theory of metric scaling solves the problem of finding the most representative projections of the underlying data “cloud” onto low dimensional space using the data proximities ◦ The theory is similar in spirit to the principal components analysis and discriminant analysis  The solution is given in the form of ordered “scaling coordinates”  Looking at the scatter plots of the top scaling coordinates provides informative views of the data
  • 50.
     (insert graph) This extract shows five initial scaling coordinates for the top 30 records of the prostate data  We will look at the scatter plots among the first, second, and third scaling coordinates  The following color codes will be used for the target classes: ◦ Green- class 0 ◦ Red- class 1 ◦ Blue- class 2
  • 51.
     (insert graphs) A nearly perfect separation of all three classes is clearly seen  From this we conclude that the outcome variable admits clear prediction using RF model which utilizes 111 original predictors  The residual error is mostly due to the presence of the “focal” point where all the three rays meet
  • 52.
  • 53.
     (insert graphs) Again, three distinct target classes show up as separate clusters  The “focal” point represents a cluster of records that can’t be distinguished from each other
  • 54.
     Outliers aredefined as cases having small proximities to all other cases belonging to the same target class  The following algorithm is used: ◦ For a case n, compute the sum of the squares of prox(nk) for all k in the same class as n ◦ Take the inverse- it will be large if the case is “far away” from the rest ◦ Standardize using the median and standard deviation ◦ ◦ Look at the cases with the largest values- those are potential outliers  Generally, a value above 10 is reason to suspect the case of being an outlier
  • 55.
     This extractshows top 30 records of the prostate dataset sorted descending by the outlier measure  Clearly the top 6 cases (class 2 with IDs: 771, 683, 539, and class 0 with IDs 127, 281, 282) are suspicious  All of these seem to be located at the “focal point” on the corresponding scaling coordinate plots
  • 56.
  • 57.
     RF offerstwo ways of missing value imputation  The Cheap Way- conventional median imputation for continuous variables and mode imputation for categorical variables  The Right Way: ◦ Suppose case n has x coordinate missing ◦ Do the Cheap Way imputation for starters ◦ Grow a full size RF ◦ We can now re-estimate the missing value by a weighted average ◦ over all cases k with non-missing x using weights prox(nk) ◦ Repeat steps 2 and 3 several times to ensure convergence
  • 58.
     An alternativedisplay to view how the target classes are different with respect to the individual predictors ◦ Recall, at the end of an RF run all cases in the dataset, obtain K separate votes for the class membership (assuming K target classes) ◦ Take any target class and sort all observations by the count of votes for this class descending ◦ Take the top 50 observations and the bottom 50 observations, those are correspondingly the most likely and the least likely members of the given target class ◦ Parallel coordinate plots report uniformly (0,1) scaled values of all predictors for the top 50 and bottom 50 sorted records, along with the 25th, 50th and j percentiles within each predictor
  • 59.
     (insert graph) This is a detailed display of the normalized values of the initial 20 predictors for the top voted 50 records in each target class (this gives 50x3=150 graphs)  Class 0 generally has normalized values of the initial 20 predictors close to 0 (left side 0tt, lw, y, o, ragg, wp) except perhaps M9X11
  • 60.
     (insert graph) It is easier to see this when looking at the quartile plots only  Note that class 2 tends to have the largest values of the corresponding predictors  The graph can be scrolled forward to view all of the 111 predictors
  • 61.
     (insert graph) The least likely plots roughly result to the similar conclusions: small predictor values are the least likely for class 2, etc.
  • 62.
     RF admitsan interesting possibility to solve unsupervised learning problems, in particular, clustering problems and missing value imputation in the general sense  Recall that in the unsupervised learning the concept of target is not defined  RF generates a synthetic target variable in order to proceed with a regular run: ◦ Give class label 1 to the original data ◦ Create a copy of the data such that each variable is sampled independently from the values available in the original dataset ◦ Give class label 2 to the copy of the data ◦ Note that the second copy has marginal distributions identical to the first copy, whereas the possible dependency among predictors is completely destroyed ◦ ◦ A necessary drawback is that the resulting dataset is twice as large as the original
  • 63.
     We nowhave a clear binary supervised learning problem  Running an RF on this dataset may provide the following insights: ◦ When the resulting misclassification error is high (above 50%), the variables are basically independent- no interesting structure exists ◦ Otherwise, the dependency structure can be further studied by looking at the scaling coordinates and exploiting the proximity matrix in other ways ◦ For instance, the resulting proximity matrix can be used as an important starting point for the subsequent hierarchical clustering analysis  Recall that the proximity measures are invariant to monotone transformations and naturally support categorical variables  The same missing value imputation procedure as before can now be employed  These techniques work extremely well for small datasets
  • 64.
     We generateda synthetic dataset based on the prostate data  The resulting dataset still has 111 predictors but twice the number of records- the first half being the exact replica of the original data  The final error is only 0.2% which is an indication of a very strong dependency among the predictors
  • 65.
     (insert graph) The resulting plots resemble what we had before  However, this distance is in terms of how dependent the predictors are, whereas previously it was in terms of having the same target class  In view of this, the non cancerous tissue (green) appears to stand apart from the cancerous
  • 66.
     + Breiman,L. (1996). Bagging predictors. Machine Learning, 24, 123-140.  + Breiman, L. (1996). Arcing classifiers (Technical Report). Berkeley: Statistics Department, University of California.  + Buntine, W. (1991). Learning classification trees. In D.J. Hand, ed., Artificial Intelligence Frontiers in Statistics, Chapman and Hall: London, 182-201.  + Dietterich, T. (1998). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, Boosting, and Randomization. Machine Learning, 40, 139-158.  + Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth National Conference, Morgan Kaufmann, pp. 148-156.  + Friedman, J.H. (1999). RandomForests. Stanford: Statistics Department, Stanford University.  + Friedman, J.H. (1999). Greedy function approximation: a gradient boosting machine. Stanford: Statistics Department, Stanford University.  + Heath, D., Kasif, S., and Salzberg, S. (1993) k-dt: A multi-tree learning method. Proceedings of the Second International Workshop on Multistrategy Learning, 1002-1007, Morgan Kaufman: Chambery, France.  + Kwok, S., and Carter, C. (1990). Multiple decision trees. In Shachter, R., Levitt, T., Kanal, L., and Lemmer, J., eds. Uncertainty in Artificial Intelligence 4, North- Holland, 327-335.