Why You Should Watch: Learn the fundamentals of tree-based machine learning algorithms and how to easily fine tune and improve your Random Forest regression models.
Abstract: In this webinar we'll introduce you to two tree-based machine learning algorithms, CARTยฎ decision trees and RandomForestsยฎ. We will discuss the advantages of tree based techniques including their ability to automatically handle variable selection, variable interactions, nonlinear relationships, outliers, and missing values. We'll explore the CART algorithm, bootstrap sampling, and the Random Forest algorithm (all with animations) and compare their predictive performance using a real world dataset.
2. Outline
Applications of CART and Random Forests
Ordinary Least Squares Regression
โ A review
โ Common issues in standard linear regression
Data Description
Improving your regression with an applied example
โ CART decision tree
โ Random Forest
Conclusions
Salford Systems ยฉ 2016 2
3. Applications
In this webinar we use CARTยฎ software and RandomForestsยฎ software to predict
concrete strength, but as we will see these techniques can be applied to any field
Quantitative Targets: Number of Cavities, Blood Pressure, Income etc.
Qualitative Targets: Disease or No Disease; Buy or Not Buy; Lend or Do Not Lend;
Buy Product A vs Product B vs. Product C vs. Product D
Examples
Credit Risk
Glaucoma Screening
Insurance Fraud
Customer Loyalty
Drug Discovery
Early Identification of Reading Disabilities
Biodiversity and Wildlife Conservation
4. Preview: CART and Random Forest Advantages
As we will see in this presentation both CART and Random
Forests have desirable properties that allow you to build
accurate predictive models with dirty data (i.e. missing values,
lots of variables, nonlinear relationships, outliers etc.)
Preview: Geometry of a CART tree (1 split) Preview: Geometry of a CART tree (2 splits)
5. Preview: Model Performance
Salford Systems ยฉ 2016 5
Method
Linear Regression 109.04
Linear Regression with
interactions
67.35
Min 1 SE CART (default
settings 65.05
CART (default settings)
55.99
RandomForestsยฎ
(default settings)
37.570
Improved
RandomForestsยฎ using
an SPM Automate
36.02
๐๐๐ธ =
1
๐
๐=1
๐
๐๐ โ ๐๐
2
6. What is OLS?
OLS โ ordinary least squares regression
โ Discovered by Legendre (1805) and Gauss (1809) to solve
problems in astronomy using pen and paper
The model is of the form
๐ท ๐ โ the intercept term
๐ท ๐, ๐ท ๐, ๐ท ๐โฆ โ coefficient estimates
๐ ๐, ๐ ๐, ๐ ๐, โฆ ๐ ๐ - predictor variables (i.e. columns in the dataset)
Example: Income= 20,000 + 2,500*WorkExperience + 1,000*EducationYears
Salford Systems ยฉ 2016 6
Y = ๐ท ๐ + ๐ท ๐ ๐ ๐+ ๐ท ๐ ๐ ๐+ ๐ท ๐ ๐ ๐ + โฆ + ๐ท ๐ ๐ ๐
7. Common Issues in Regression
Missing values
โ Requires imputation OR
โ Results in record deletion
Nonlinearities and Local Effects
โ Example: Y = 10 + 3๐ฅ1 + ๐ฅ2 โ .3๐ฅ1
2
โ Modeled via manual transformations or they are automatically added and
then selected via forward, backward, stepwise, or regularization
โ Ignores local effects unless specified by the analyst, but this is very
difficult/impossible in practice without subject matter expertise or prior
knowledge
Interactions
โ Example: ๐ = 10 + 3๐ฅ1 โ 2๐ฅ2 + .25๐ฅ1 ๐ฅ2
โ Manually added to the model (or through some automated procedure)
โ Add interactions then use variable selection (i.e. regularized regression or
forward, backward, or stepwise selection)
Variable selection
โ Usually accomplished manually or in combination with automated
selection procedures
Salford Systems ยฉ 2016 7
8. Solutions to OLS Problems
Two methods that do not suffer from the
drawbacks of linear regression are CART and
Random Forests
These methods automatically
โ Handle missing values
โ Model nonlinear relationships and local effects
โ Select variables
โ Model variable interactions
Salford Systems ยฉ 2016 8
9. Concrete Strength
Target:
โ STRENGTH
Compressive strength of concrete in megapascals
Predictors:
โ CEMENT
โ BLAST_FURNACE_SLAG
โ FLY_ASH
โ WATER
โ SUPERPLASTICIZER
โ COARSE_AGGREGATE
โ FINE_AGGREGATE
โ AGE
Salford Systems ยฉ 2016 9
I-Cheng Yeh, "Modeling of strength of high performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998)
10. Why predict concrete strength?
Concrete is one of the most important materials in our society
and is a key ingredient in important infrastructure projects like
bridges, roads, buildings, and dams (MATSE)
Predicting the strength of concrete is important because its
concrete strength is a key component of the overall stability
these structures
Source: http://matse1.matse.illinois.edu/concrete/prin.html
12. Regression Results
Salford Systems ยฉ 2016 12
Method
Linear Regression 109.04
Linear Regression with
interactions
67.35
Strength = -9.70 + .115*Cement + .01*BlastFurnaceSlag + .014*FlyAsh - .172*Water +
.10*Superplasticizer + .01*CoarseAggregate + .01*FineAggregate + .11*Age
๐๐๐ ๐ก ๐๐๐ธ =
1
๐
๐=1
๐
๐๐ โ ๐๐
2
**Test sample: 20% of observations were randomly selected for the testing dataset
**This same test dataset was used to evaluate all models for the purpose of comparisons
13. Classification And Regression Trees
Authors: Breiman, Friedman, Olshen, and Stone (1984)
CART is a decision tree algorithm used for both regression and classification problems
1. Classification: tries to separate classes by choosing variables and points that best
separate them
2. Regression: chooses the best variables and split points for reducing the squared or
absolute error criterion
CART is available exclusively in the SPMยฎ 8 Software Suite and was developed in close
consultation with the original authors
CART: Introduction
14. CART Introduction
Main Idea: divide the predictor variables (often people say โpartitionโ instead of โdivideโ)
into different regions so that the dependent variable can be predicted more accurately.
The following shows the predicted values from a CART tree (i.e. the red horizontal bars) to
the curve ๐ = ๐ฅ2
+ ๐๐๐๐ ๐.
โnoiseโ is from a N(0,1)
๐
๐ฅ
15. CART: Terminology
A tree split occurs when
a variable is partitioned (in-depth example starts
after the next slide). This tree has two splits:
1. AGE_DAY <=21
2. CEMENT_AMT <=355.95
The node at the top of the
tree is called the root node
A node that has no sub-branch is
a terminal node
This tree has three terminal
nodes (i.e. red boxes in the tree)
AGE_DAY <= 21.00
Terminal
Node 1
STD = 12.441
Avg = 23.944
W = 260.000
N = 260
CEMENT_AMT <= 355.95
Terminal
Node 2
STD = 12.683
Avg = 37.036
W = 436.000
N = 436
CEMENT_AMT > 355.95
Terminal
Node 3
STD = 13.452
Avg = 57.026
W = 129.000
N = 129
AGE_DAY > 21.00
Node 2
CEMENT_AMT <= 355.95
STD = 15.358
Avg = 41.600
W = 565.000
N = 565
Node 1
AGE_DAY <= 21.00
STD = 16.661
Avg = 36.036
W = 825.000
N = 825
The predicted value in a CART
regression model is the
average of the target
variable (i.e. โYโ) for the
records that fall into one of the terminal nodes
Example: If Age = 26 days and the amount of cement is 400
then the predicted strength is 57.026 megapasucals
16. CART: AlgorithmStep 1: Grow a large tree
This is done for you automatically
All variables are considered
at each split in the tree
Each split is made using one
variable and a specific value or set of values.
Splits are chosen so as to minimize model
error
The tree is grown until either a user-specified
criterion is met or until the tree cannot be
grown further
Step2: Prune the large tree
This is also done for you automatically
Use either a test sample or cross validation to prune subtrees
17. CART: Splitting Procedure
Consider the following CART tree grown on this dataset
How exactly do we get this tree? Y ๐1 ๐2
79.9861 162 28
61.8874 162 28
40.2695 228 270
41.0528 228 365
44.2961 192 360
47.0298 228 90
43.6983 228 365
36.4478 228 28
45.8543 228 28
39.2898 228 28
38.0742 192 90
28.0217 192 28
43.013 228 270
42.3269 228 90
47.8138 228 28
52.9083 228 90
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 <= 210.00
Terminal
Node 2
STD = 6.705
Avg = 36.797
W = 3.000
N = 3
X1 > 210.00
Terminal
Node 3
STD = 4.375
Avg = 43.609
W = 11.000
N = 11
X1 > 177.00
Node 2
X1 <= 210.00
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
20. CART: Splitting Procedure
At this point CART has evaluated all possible split points for our two variables,
๐1 and ๐2, and determined the optimal split points for each.
Splitting on either ๐1 or ๐2 will yield a different tree, so what is the best split?
The one with the largest split improvement.
Best split for ๐ฟ ๐: ๐1 โค 177
Improvement Value: 90.64
Best split for ๐ฟ ๐: ๐2 โค 59
Improvement Value: 5.77
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
X2 <= 59.00
Terminal
Node 1
STD = 16.158
Avg = 48.472
W = 7.000
N = 7
X2 > 59.00
Terminal
Node 2
STD = 4.069
Avg = 43.630
W = 9.000
N = 9
Node 1
X2 <= 59.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
Split Improvement:
โ๐ ๐ , ๐ก = ๐ ๐ก โ ๐ ๐ก ๐ฟ โ ๐ (๐ก ๐ )
๐ ๐ก =
1
๐
๐ฅ ๐ ๐๐ก
(๐ฆ๐ โ ๐ฆ ๐ก )2
๐ฟ๐๐๐ ๐ก ๐๐๐ข๐๐๐๐
21. CART: 1st Split
Our best first split in the tree is ๐1 โค 177 which leads to the
following tree and partitioned dataset
Y ๐1 ๐2
79.99 162 28
61.89 162 28
Y ๐1 ๐2
28.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
Y ๐1 ๐2
79.99 162 28
61.89 162 28
28.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
Note: the predicted
values for this tree are
the respective
averages in each
terminal node.
Terminal Node 1
predicted value:
79.99+61.89 โ 70.94
Terminal Node 1
Terminal Node 2
22. CART Geometry
Y
๐1
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
๐1
๐2
23. CART: Splitting Procedure
So how do we get to our final tree?
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 <= 210.00
Terminal
Node 2
STD = 6.705
Avg = 36.797
W = 3.000
N = 3
X1 > 210.00
Terminal
Node 3
STD = 4.375
Avg = 43.609
W = 11.000
N = 11
X1 > 177.00
Node 2
X1 <= 210.00
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
24. CART: Splitting Procedure
We now perform the same procedure again, but this time for each partition
of the data (we can only split one partition at a time)
Y ๐1 ๐2
79.99 162 28
61.89 162 28
Y ๐1 ๐2
28.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
Y ๐1 ๐2
28.02 192 28
38.07 192 90
44.3 192 360
Y ๐1 ๐2
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
Best Split: Split
Partition 2 at
๐1 โค 210Partition 1 Partition 2
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 <= 210.00
Terminal
Node 2
STD = 6.705
Avg = 36.797
W = 3.000
N = 3
X1 > 210.00
Terminal
Node 3
STD = 4.375
Avg = 43.609
W = 11.000
N = 11
X1 > 177.00
Node 2
X1 <= 210.00
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
25. CART Geometry
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
๐1
๐2
Y
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 <= 210.00
Terminal
Node 2
STD = 6.705
Avg = 36.797
W = 3.000
N = 3
X1 > 210.00
Terminal
Node 3
STD = 4.375
Avg = 43.609
W = 11.000
N = 11
X1 > 177.00
Node 2
X1 <= 210.00
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
26. Where are we?
๏ผ CART Splitting Process
๏ฑ CART Pruning
๏ฑ Advantages of CART
๏ฑ Interpreting CART Output
๏ฑ Applied Example using CART
๏ฑ Random Forest Section
27. CART: Algorithm
Step 1: Grow a large tree
Step2: Prune the large tree
This is also done for you automatically
Use either a test sample or cross validation to
prune subtrees
28. CART: Pruning with a Test Sample
Test sample- randomly select a certain
percentage of data (often ~20%-30%)
to be used to assess the model error
Prune the CART tree
1. Run the test data down the large tree and the
smaller trees (the smaller trees are called
โsubtreesโ)
2. Compute the test error for each tree
3. The final tree shown to the user is the tree with
the smallest test error
Subtree Error
1 200
2 125
3 100
4 83
5 113
6 137
29. Where are we?
๏ผ CART Splitting Process
๏ผ CART Pruning
๏ฑ Advantages of CART (this is what allows you to build
models with dirty data)
๏ฑ Interpreting CART Output
๏ฑ Applied Example using CART
๏ฑ Random Forest Section
30. CART Advantages
In practice, you can build CART models with dirty data (i.e.
missing values, lots of variables, nonlinear relationships, outliers,
and numerous local effects)
This is due to CARTโs desirable properties:
1. Easy to interpret
2. Automatic handling of the following:
a) Variable selection
b) Variable interaction modeling
c) Local effect modeling
d) Nonlinear relationship modeling
e) Missing values
f) Outliers
3. Not affected by monotonic transformations of variables
31. CART: Interpretation and Automatic
Variable Selection
Interpretation: CART trees have a simple
interpretation and only require that someone ask
themselves a series of โyes or noโ questions like โIs
Age_Day <= 21?โ etc.
Variable Selection:
All variables will be considered for each split
but not all variables will be used. Some
variables will be used more than others.
Only one variable is used for each split
The variables chosen are those that reduce
the error the most
AGE_DAY <= 21.00
Terminal
Node 1
STD = 12.441
Avg = 23.944
W = 260.000
N = 260
CEMENT_AMT <= 355.95
Terminal
Node 2
STD = 12.683
Avg = 37.036
W = 436.000
N = 436
CEMENT_AMT > 355.95
Terminal
Node 3
STD = 13.452
Avg = 57.026
W = 129.000
N = 129
AGE_DAY > 21.00
Node 2
CEMENT_AMT <= 355.95
STD = 15.358
Avg = 41.600
W = 565.000
N = 565
Node 1
AGE_DAY <= 21.00
STD = 16.661
Avg = 36.036
W = 825.000
N = 825
32. CART: Automatic Variable Interactions
and Local Effects
In regression interaction terms modeled globally in
the form x1*x2 or x1*x2*x3 (global means that
the interaction is present everywhere).
In CART interactions are automatically modeled
over certain regions of the data (i.e. locally) so
you do not have to worry about adding interaction
terms or local terms to your model
Example: Notice how the prediction changes for different
amounts of cement given that the Age is over 21 days (i.e. this
is the interaction)
1. If Age > 21 and Cement Amount <= 355. 95 then the
average strength is 37 megapascuals
2. If Age > 21 and Cement Amount > 355.95 then the
average strength is 57 megapascuals
AGE_DAY <= 21.00
Terminal
Node 1
STD = 12.441
Avg = 23.944
W = 260.000
N = 260
CEMENT_AMT <= 355.95
Terminal
Node 2
STD = 12.683
Avg = 37.036
W = 436.000
N = 436
CEMENT_AMT > 355.95
Terminal
Node 3
STD = 13.452
Avg = 57.026
W = 129.000
N = 129
AGE_DAY > 21.00
Node 2
CEMENT_AMT <= 355.95
STD = 15.358
Avg = 41.600
W = 565.000
N = 565
Node 1
AGE_DAY <= 21.00
STD = 16.661
Avg = 36.036
W = 825.000
N = 825
33. CART: Automatic Nonlinear Modeling
Nonlinear functions (and linear) are approximated via step functions, so in practice
you do not need to worry about adding terms like ๐ ๐ ๐๐ ๐๐ ๐ to capture
nonlinear relationships. The picture below is the CART fit to ๐ = ๐2 + noise. CART
modeled this data automatically. No data pre-processing. Just CART.
๐
๐
34. CART: Automatic Missing Value Handling
CART automatically handles missing values while building the
model, so you do not need to impute missing values yourself
The missing values are handled using a surrogate split
Surrogate Split- find another variable whose split is โsimilarโ to the
variable with the missing values and split on the variable that does
not have missing values
Reference: see Section 5.3 in Breiman, Friedman, Olshen, and
Stone for more information
35. CART: Outliers in the Target Variable
Two types of outliers are
1. Outliers in the target variable (i.e. โYโ)
2. Outliers in the predictor variable (i.e. โxโ)
CART is more sensitive to outliers with
respect to the target variable
1. More severe in a regression context
than a classification context
2. CART may treat target variable outliers
by isolating them in small terminal
nodes which can limit their effect
Reference: Pages 197-200 and 253 in
Breiman, Friedman, Olshen, and Stone (1984)
๐1
๐2
Y
Here the target outliers are
isolated in terminal node 1
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
36. CART: Outliers in the Predictor Variables
CART is more robust to outliers in the predictor variables partly due to nature of the splitting process
Reference: Pages 197-200 and 253 in Breiman, Friedman, Olshen, and Stone (1984)
Y
X1 <= 177.00
Terminal
Node 1
STD = 8.532
Avg = 73.957
W = 3.000
N = 3
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.149
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 13.661
Avg = 47.762
W = 17.000
N = 17
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
๐1
๐2
37. CART: Monotonic Transformations of Variables
Monotonic transformation- is a transformation that does not change the order of a variable.
โ CART, unlike linear regression, is not affected by this so if a transformation does not affect the order of a variable
then you do not need to worry about adding it to a CART model
โ Example: Our best first split in the example tree was ๐1 โค 177. What happens if we square ๐1? The split point value
changes, but nothing else does including the predicted values. This happens because the same Y values fall into the
same partition (i.e. their order has not changed after we squared and sorted ๐1)
Y ๐1 ๐2
79.99 162 28
61.89 162 28
Y ๐1 ๐2
28.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
Y ๐1 ๐2
79.99 162 28
61.89 162 28
28.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
Y ๐1 ๐2
79.99 26,244 28
61.89 26,244 28
28.02 36,864 28
38.07 36,864 90
44.3 36,864 360
36.45 51,984 28
45.85 51,984 28
39.29 51,984 28
47.81 51,984 28
47.03 51,984 90
42.33 51,984 90
52.91 51,984 90
40.27 51,984 270
43.01 51,984 270
41.05 51,984 365
43.7 51,984 365
Y ๐1 ๐2
79.99 26,244 28
61.89 26,244 28
Y ๐1 ๐2
28.02 36,864 28
38.07 36,864 90
44.3 36,864 360
36.45 51,984 28
45.85 51,984 28
39.29 51,984 28
47.81 51,984 28
47.03 51,984 90
42.33 51,984 90
52.91 51,984 90
40.27 51,984 270
43.01 51,984 270
41.05 51,984 365
43.7 51,984 365
X1_SQ <= 31554.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1_SQ > 31554.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1_SQ <= 31554.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
38. Where are we?
๏ผ CART Splitting Process
๏ผ CART Pruning
๏ผ Advantages of CART (this is what allows you to build
models with dirty data)
๏ฑ Interpreting CART Output
๏ฑ Applied Example using CART
๏ฑ Random Forest Section
39. CART: Relative Error
Relative error- used to determine the optimal model complexity for CART models
GOOD: Relative error values close to zero mean that CART is doing a better job than predicting only the
overall average (or median) for all records in the data
BAD: Relative error values equal to one means that that CART is no better than predicting the overall
average (or median) of the target variable for every record. Note: the relative error can be greater than one
which is especially bad.
The relative error can be computed for both Least Squares: LS = ๐ ๐ฆ๐ โ ๐ฆ๐
2
and Least Absolute Deviation:
LAD = ๐ ๐ฆ๐ โ ๐ฆ๐
Relative Error =
๐ถ๐ด๐ ๐ ๐๐๐๐๐ ๐ธ๐๐๐๐ ๐ข๐ ๐๐๐ ๐๐๐กโ๐๐ ๐ฟ๐ ๐๐ ๐ฟ๐ด๐ท
๐ธ๐๐๐๐ ๐๐๐ ๐๐๐๐๐๐๐ก๐๐๐ ๐กโ๐ ๐๐ฃ๐๐๐๐๐ ๐๐ฃ๐๐๐๐๐ ๐๐๐ ๐๐๐ ๐๐๐๐๐๐๐
LAD Relative Error =
Relative Error: 0.129
40. CART: Variable Importance
CART Variable Importance: sum each
variableโs split improvement score across the
splits in the tree. The importance scores for
variables are increased in two ways:
1) When the variable is actually used
to a split a node
2) When the variable is as the
surrogate split (i.e. the backup
splitting variable when the primary
splitting variables has a missing
value)
โConsider Only Primary Splittersโ (green
rectangle on the right) removes the
surrogate splitting variables from the
variable importance calculation
โDiscount Surrogatesโ allows you to
discount surrogates in a more specific
manner
42. CART Performance
Salford Systems ยฉ 2016 42
Method
Linear Regression 109.04
Linear Regression with
interactions
67.35
Min 1 SE CART (default
settings) 65.05
CART (default settings)
55.99
Random Forest(default
settings)
Improved Random Forest
using an SPM Automate
๐๐๐ธ =
1
๐
๐=1
๐
๐๐ โ ๐๐
2
***More information
about the 1 Standard
Error Rule for CART
can be found in the
appendix
43. Where are we?
๏ผ CART Splitting Process
๏ผ CART Pruning
๏ผ Advantages of CART (this is what allows you to build
models with dirty data)
๏ผ Interpreting CART Output
๏ผ Applied Example using CART
๏ฑ Random Forest Section
44. Introduction to Random Forests
Main Idea: fit multiple CART trees to independent โbootstrapโ
samples of the data and then combine the predictions
Leo Breiman, one of the co-creators of CART,
also created Random Forests and published a
paper on this method in 2001
Our RandomForestsยฎ software was developed in close consultation
with Breiman himself
45. What is a bootstrap sample?
A bootstrap sample is a random sample conducted with replacement
Steps:
1. Randomly select an observation from the original data
2. โWrite it downโ
3. โPut it backโ (i.e. any observation can be selected more than once)
Repeat steps 1-3 N times; N is the number of observations in the original sample
FINAL RESULT: One โbootstrap sampleโ with N observations
โฆ
Bootstrap SampleOriginal Data
0 48 3
0 48 3
0 37 1
0 37 1
0 37 1
0 . 10 . 1
0 24 4
0 24 4
0 37 1
Y X1 X2
46. Original Data
โฆ..Bootstrap 1 Bootstrap 2 Bootstrap 199 Bootstrap 200
Final Prediction for a New Record: take the average of the 200 individual predictions
10.5 + 9.8 โฆ + 10.73 + 12
200
Predict a New Record : run the record down each tree, each time computing a prediction
1. Draw a bootstrap sample
2. Fit a large, unpruned, CART tree to this bootstrap
sample
-At each split in the tree consider only k randomly
selected variables instead of all of them
3. Average the predictions to predict a new record
Repeat Steps 1-2 at least 200 times
Tree 1: 10.5 Tree 2: 9.8 Tree 199: 10.73 Tree 200: 12โฆ..
47. CART and Random Forests
When you build a Random Forest model just keep this picture in
the back of your mind:
The reason is because a Random Forest is really just an average of
CART trees constructed on bootstrap samples of the original data
= ++ +โฆ +1 2 3 B
[ ]
1
๐ต
48. CART and Random Forests
Random Forests generally have superior predictive performance versus CART trees
because Random Forests have lower variance than a single CART tree
Since Random Forests are a combination of CART trees they inherit many of CARTโs
properties:
Automatic
Variable selection
Variable interaction detection
Nonlinear relationship detection
Missing value handling
Outlier handling
Modeling of local effects
Invariant to monotone transformations of predictors
One drawback is that a Random Forest is not as interpretable as a single CART tree
49. Random Forests: Tuning Parameters
The performance of a Random Forest is
dependent upon the values of certain model
parameters
Two of these parameters are
1. Number of trees
2. Random number of variables chosen at each split
50. Random Forests: Number of Trees
Number of trees
Default: 200
The number of trees should be large
enough so that the model error no
longer meaningfully declines as the
number of trees increases
(Experimentation will be required)
In Random Forests the optimal
number of trees tends to be the
maximum value allotted (due to the
Law of Large Numbers) Default Setting: 200 Trees My Setting: 400 Trees
There is not much of a difference between the error for a forest with
200 trees and one with 400 trees, so, at least for this dataset, a
larger number of trees will not improve the model meaningfully
51. Random Forests: Tuning Parameters
The performance of a Random Forest is
dependent upon the values of certain model
parameters
Two of these parameters are
1. Number of trees
2. Random number of variables chosen at each
split
52. Random Forest Parameters:
Random variable subset size
Random number of variables k chosen
at each split in each tree in the forest
Default: k=3
Experimentation will be required to
find the optimal value and this can be
done using Automate RFNPREDS
Automate RFNPREDS- automatically
build multiple Random Forests: each
time the forest is the same except
that the number of randomly selected
variables at each split in each CART
tree changes
This allows us to conveniently
determine the optimal number
of variables to randomly select
at each split in each tree
This output is telling you that optimal number of
randomly selected variables at each split in each
tree in the forest is 5.
53. Interpreting a Random Forest
Since a Random Forest is a collection of hundreds or
even thousands of CART trees, the simple
interpretation is lost because we now have hundreds of
trees and are averaging the predictions
One method used to interpret a Random Forest is
variable importance
54. Random Forest for Regression:
Variable Importance
CART Variable Importance: sum each variableโs split
improvement score across the splits in the tree
The importance scores for variables are increased in two way: 1)
when the variable is actually used to a split a node and 2) when the
variable is as the surrogate split (i.e. the backup splitting variable
when the primary splitting variables has a missing value)
Random Forest Variable Importance for Regression:
1. Compute a score for every split the variable generates, sum the
scores across all splits made
๏ง Relative Importance: divide all variable importance scores by the
maximum variable importance score (i.e. the most important variable
has a relative importance value of 100)
Note: For classification models, the preferred method is the random permutation method (see appendix for more details)
56. Random Forests: Model Performance
Salford Systems ยฉ 2016 56
Method
Linear Regression 109.04
Linear Regression with
interactions
67.35
Min 1 SE CART (default
settings 65.05
CART (default settings)
55.99
Random Forest (default
settings)
37.70
Improved Random Forest
using an SPM Automate
36.02
๐๐๐ธ =
1
๐
๐=1
๐
๐๐ โ ๐๐
2
57. Conclusion
CART produces an interpretable model that is more resistant to
outliers, predicts future data well, and automatically handles
1. Variables interactions
2. Missing values
3. Nonlinear relationships
4. Local effects
Random Forests are fundamentally a combination of individual
CART trees and thus inherit all of the advantages of CART above
(except the nice interpretation)
*Generally is superior to a single CART tree in terms of predictive
accuracy
58. Next in the seriesโฆ
Improve Your Regression with TreeNetยฎ Gradient Boosting
59. Try CART and Random Forest
Download SPMยฎ 8 now to start building CART and
Random Forest models on your data
We will be more than happy to personally help you
if you have any questions or need assistance
My Email: charrison@salford-systems.com
Support Email: support@salford-systems.com
The appendix follows this slide
61. 1SE Rule in SPM
Optimal Tree 1 Standard Error Rule Tree
1SE Rule Tree: the smallest tree whose error is within one standard deviation of the minimum error
Figures: Upper Left: Optimal Tree has 188 terminal nodes and a relative error of .177; Upper Right: the 1SE
tree has 85 terminal nodes and a relative error .209
Smaller trees are preferred because they are less likely to overfit the data (i.e. the 1SE tree in this case is competitive
in terms of accuracy and is much less complex) and they are easier to interpret
Relative Error =
๐ถ๐ด๐ ๐ ๐๐๐๐๐ ๐ธ๐๐๐๐ ๐ข๐ ๐๐๐ ๐๐๐กโ๐๐ ๐ฟ๐ ๐๐ ๐ฟ๐ด๐ท
๐ธ๐๐๐๐ ๐๐๐ ๐๐๐๐๐๐๐ก๐๐๐ ๐กโ๐ ๐๐ฃ๐๐๐๐๐ ๐๐ฃ๐๐๐๐๐ (๐๐ ๐๐๐๐๐๐ ๐๐ ๐ข๐ ๐๐๐ ๐ฟ๐ด๐ท) ๐๐๐ ๐๐๐ ๐๐๐๐๐๐๐