Improve Your Regression with CART and RandomForests

Improve Your Regression
CART® and RandomForests®
Charles Harrison
Marketing Statistician

Outline
Applications of CART and Random Forests
Ordinary Least Squares Regression
– A review
– Common issues in standard linear regression
Data Description
Improving your regression with an applied example
– CART decision tree
– Random Forest
Conclusions
Salford Systems © 2016 2

Applications
In this webinar we use CART® software and RandomForests® software to predict
concrete strength, but as we will see these techniques can be applied to any field
Quantitative Targets: Number of Cavities, Blood Pressure, Income etc.
Qualitative Targets: Disease or No Disease; Buy or Not Buy; Lend or Do Not Lend;
Buy Product A vs Product B vs. Product C vs. Product D
Examples
Credit Risk
Glaucoma Screening
Insurance Fraud
Customer Loyalty
Drug Discovery
Early Identification of Reading Disabilities
Biodiversity and Wildlife Conservation

Preview: CART and Random Forest Advantages
As we will see in this presentation both CART and Random
Forests have desirable properties that allow you to build
accurate predictive models with dirty data (i.e. missing values,
lots of variables, nonlinear relationships, outliers etc.)
Preview: Geometry of a CART tree (1 split) Preview: Geometry of a CART tree (2 splits)

Preview: Model Performance
Method
Linear Regression 109.04
Linear Regression with
interactions
67.35
Min 1 SE CART (default
settings 65.05
CART (default settings)
55.99
RandomForests®
(default settings)
37.570
Improved
RandomForests® using
an SPM Automate
36.02
𝑀𝑆𝐸 =
1
𝑛
𝑖=1
𝑛
𝑌𝑖 − 𝑌𝑖
2

What is OLS?
OLS – ordinary least squares regression
– Discovered by Legendre (1805) and Gauss (1809) to solve
problems in astronomy using pen and paper
The model is of the form
𝜷 𝟎 – the intercept term
𝜷 𝟏, 𝜷 𝟐, 𝜷 𝟑… – coefficient estimates
𝒙 𝟏, 𝒙 𝟐, 𝒙 𝟑, … 𝒙 𝒑 - predictor variables (i.e. columns in the dataset)
Example: Income= 20,000 + 2,500*WorkExperience + 1,000*EducationYears
Y = 𝜷 𝟎 + 𝜷 𝟏 𝒙 𝟏+ 𝜷 𝟐 𝒙 𝟐+ 𝜷 𝟑 𝒙 𝟑 + … + 𝜷 𝒑 𝒙 𝒑

Common Issues in Regression
Missing values
– Requires imputation OR
– Results in record deletion
Nonlinearities and Local Effects
– Example: Y = 10 + 3𝑥1 + 𝑥2 − .3𝑥1
2
– Modeled via manual transformations or they are automatically added and
then selected via forward, backward, stepwise, or regularization
– Ignores local effects unless specified by the analyst, but this is very
difficult/impossible in practice without subject matter expertise or prior
knowledge
Interactions
– Example: 𝑌 = 10 + 3𝑥1 − 2𝑥2 + .25𝑥1 𝑥2
– Manually added to the model (or through some automated procedure)
– Add interactions then use variable selection (i.e. regularized regression or
forward, backward, or stepwise selection)
Variable selection
– Usually accomplished manually or in combination with automated
selection procedures

Solutions to OLS Problems
Two methods that do not suffer from the
drawbacks of linear regression are CART and
Random Forests
These methods automatically
– Handle missing values
– Model nonlinear relationships and local effects
– Select variables
– Model variable interactions

Concrete Strength
Target:
– STRENGTH
Compressive strength of concrete in megapascals
Predictors:
– CEMENT
– BLAST_FURNACE_SLAG
– FLY_ASH
– WATER
– SUPERPLASTICIZER
– COARSE_AGGREGATE
– FINE_AGGREGATE
– AGE
I-Cheng Yeh, "Modeling of strength of high performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998)

Why predict concrete strength?
Concrete is one of the most important materials in our society
and is a key ingredient in important infrastructure projects like
bridges, roads, buildings, and dams (MATSE)
Predicting the strength of concrete is important because its
concrete strength is a key component of the overall stability
these structures
Source: http://matse1.matse.illinois.edu/concrete/prin.html

Data Sample
Cement Blast Furnace Slag Fly Ash Water Superplasticizer
Coarse
Aggregate
Fine Aggregate Age Strength
540 0 0 162 2.5 1040 676 28 79.98611076
540 0 0 162 2.5 1055 676 28 61.88736576
332.5 142.5 0 228 0 932 594 270 40.26953526
332.5 142.5 0 228 0 932 594 365 41.05277999
198.6 132.4 0 192 0 978.4 825.5 360 44.2960751
266 114 0 228 0 932 670 90 47.02984744
380 95 0 228 0 932 594 365 43.6982994
380 95 0 228 0 932 594 28 36.44776979
266 114 0 228 0 932 670 28 45.85429086
475 0 0 228 0 932 594 28 39.28978986
198.6 132.4 0 192 0 978.4 825.5 90 38.07424367
198.6 132.4 0 192 0 978.4 825.5 28 28.02168359
427.5 47.5 0 228 0 932 594 270 43.01296026
190 190 0 228 0 932 670 90 42.32693164
304 76 0 228 0 932 670 28 47.81378165
380 0 0 228 0 932 670 90 52.90831981

Regression Results
Method
interactions
67.35
Strength = -9.70 + .115*Cement + .01*BlastFurnaceSlag + .014*FlyAsh - .172*Water +
.10*Superplasticizer + .01*CoarseAggregate + .01*FineAggregate + .11*Age
𝑇𝑒𝑠𝑡 𝑀𝑆𝐸 =
1
𝑛
𝑖=1
𝑛
2
**Test sample: 20% of observations were randomly selected for the testing dataset
**This same test dataset was used to evaluate all models for the purpose of comparisons

Classification And Regression Trees
Authors: Breiman, Friedman, Olshen, and Stone (1984)
CART is a decision tree algorithm used for both regression and classification problems
1. Classification: tries to separate classes by choosing variables and points that best
separate them
2. Regression: chooses the best variables and split points for reducing the squared or
absolute error criterion
CART is available exclusively in the SPM® 8 Software Suite and was developed in close
consultation with the original authors
CART: Introduction

CART Introduction
Main Idea: divide the predictor variables (often people say “partition” instead of “divide”)
into different regions so that the dependent variable can be predicted more accurately.
The following shows the predicted values from a CART tree (i.e. the red horizontal bars) to
the curve 𝑌 = 𝑥2
+ 𝑛𝑜𝑖𝑠𝑒.
“noise” is from a N(0,1)
𝑌
𝑥

CART: Terminology
A tree split occurs when
a variable is partitioned (in-depth example starts
after the next slide). This tree has two splits:
1. AGE_DAY <=21
2. CEMENT_AMT <=355.95
The node at the top of the
tree is called the root node
A node that has no sub-branch is
a terminal node
This tree has three terminal
nodes (i.e. red boxes in the tree)
AGE_DAY <= 21.00
Terminal
Node 1
STD = 12.441
Avg = 23.944
W = 260.000
N = 260
CEMENT_AMT <= 355.95
Terminal
Node 2
STD = 12.683
Avg = 37.036
W = 436.000
N = 436
CEMENT_AMT > 355.95
Terminal
Node 3
STD = 13.452
Avg = 57.026
W = 129.000
N = 129
AGE_DAY > 21.00
Node 2
STD = 15.358
Avg = 41.600
W = 565.000
N = 565
Node 1
AGE_DAY <= 21.00
STD = 16.661
Avg = 36.036
W = 825.000
N = 825
The predicted value in a CART
regression model is the
average of the target
variable (i.e. “Y”) for the
records that fall into one of the terminal nodes
Example: If Age = 26 days and the amount of cement is 400
then the predicted strength is 57.026 megapasucals

CART: AlgorithmStep 1: Grow a large tree
This is done for you automatically
All variables are considered
at each split in the tree
Each split is made using one
variable and a specific value or set of values.
Splits are chosen so as to minimize model
error
The tree is grown until either a user-specified
criterion is met or until the tree cannot be
grown further
Step2: Prune the large tree
This is also done for you automatically
Use either a test sample or cross validation to prune subtrees

CART: Splitting Procedure
Consider the following CART tree grown on this dataset
How exactly do we get this tree? Y 𝑋1 𝑋2
79.9861 162 28
61.8874 162 28
40.2695 228 270
41.0528 228 365
44.2961 192 360
47.0298 228 90
43.6983 228 365
36.4478 228 28
45.8543 228 28
39.2898 228 28
38.0742 192 90
28.0217 192 28
43.013 228 270
42.3269 228 90
47.8138 228 28
52.9083 228 90
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 <= 210.00
Terminal
Node 2
STD = 6.705
Avg = 36.797
W = 3.000
N = 3
X1 > 210.00
Terminal
Node 3
STD = 4.375
Avg = 43.609
W = 11.000
N = 11
X1 > 177.00
Node 2
X1 <= 210.00
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16

Step 1: Find the best split point for the variable 𝑋1
 Sort the variable 𝑋1
 Compute the split improvement for each split point
 Best split for 𝑋1 :
 𝑿 𝟏 ≤ 177
Note: the midpoint between
𝑋1 = 192 and 𝑋1 = 162 is 177
Y 𝑋1 𝑋2
79.99 162 28
61.89 162 28
40.27 228 270
41.05 228 365
44.30 192 360
47.03 228 90
43.70 228 365
36.45 228 28
45.85 228 28
39.29 228 28
38.07 192 90
28.02 192 28
43.01 228 270
42.33 228 90
47.81 228 28
52.91 228 90
Y 𝑋1 𝑋2
79.99 162 28
61.89 162 28
28.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
Y 𝑋1 𝑋2
79.99 162 28
61.89 162 28
Y 𝑋1 𝑋2
28.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
Split Improvement:
∆𝑅 𝑠, 𝑡 = 𝑅 𝑡 − 𝑅 𝑡 𝐿 − 𝑅(𝑡 𝑅)
𝑅 𝑡 =
1
𝑁
𝑥 𝑛 𝜖𝑡
(𝑦𝑛 − 𝑦 𝑡 )2
𝐿𝑒𝑎𝑠𝑡 𝑆𝑞𝑢𝑎𝑟𝑒𝑠

Step 2: Find the best split point for the variable 𝑋2
 Sort the variable 𝑋2
 Compute the split improvement for each split point
 Best Split for 𝑋2:
 𝑿 𝟐 ≤ 59
Note: the midpoint
between 𝑋2 = 28 and
𝑋2 = 90 is 59
Y 𝑋1 𝑋2
79.9861 162 28
61.8874 162 28
40.2695 228 270
41.0528 228 365
44.2961 192 360
47.0298 228 90
43.6983 228 365
36.4478 228 28
45.8543 228 28
39.2898 228 28
38.0742 192 90
28.0217 192 28
43.013 228 270
42.3269 228 90
47.8138 228 28
52.9083 228 90
Y 𝑋1 𝑋2
79.99 162 28
61.89 162 28
28.02 192 28
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
38.07 192 90
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
44.30 192 360
41.05 228 365
43.70 228 365
X2 <= 59.00
Terminal
Node 1
STD = 16.158
Avg = 48.472
W = 7.000
N = 7
X2 > 59.00
Terminal
Node 2
STD = 4.069
Avg = 43.630
W = 9.000
N = 9
Node 1
X2 <= 59.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16 Y 𝑋1 𝑋2
38.07 192 90
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
44.30 192 360
41.05 228 365
43.70 228 365
Y 𝑋1 𝑋2
79.99 162 28
61.89 162 28
28.02 192 28
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
Split Improvement:
∆𝑅 𝑠, 𝑡 = 𝑅 𝑡 − 𝑅 𝑡 𝐿 − 𝑅(𝑡 𝑅)
𝑅 𝑡 =
1
𝑁
𝑥 𝑛 𝜖𝑡
(𝑦𝑛 − 𝑦 𝑡 )2

At this point CART has evaluated all possible split points for our two variables,
𝑋1 and 𝑋2, and determined the optimal split points for each.
Splitting on either 𝑋1 or 𝑋2 will yield a different tree, so what is the best split?
The one with the largest split improvement.
Best split for 𝑿 𝟏: 𝑋1 ≤ 177
Improvement Value: 90.64
Best split for 𝑿 𝟐: 𝑋2 ≤ 59
Improvement Value: 5.77
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
X2 <= 59.00
Terminal
Node 1
STD = 16.158
Avg = 48.472
W = 7.000
N = 7
X2 > 59.00
Terminal
Node 2
STD = 4.069
Avg = 43.630
W = 9.000
N = 9
Node 1
X2 <= 59.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
Split Improvement:
∆𝑅 𝑠, 𝑡 = 𝑅 𝑡 − 𝑅 𝑡 𝐿 − 𝑅(𝑡 𝑅)
𝑅 𝑡 =
1
𝑁
𝑥 𝑛 𝜖𝑡
(𝑦𝑛 − 𝑦 𝑡 )2

CART: 1st Split
Our best first split in the tree is 𝑋1 ≤ 177 which leads to the
following tree and partitioned dataset
Y 𝑋1 𝑋2
79.99 162 28
61.89 162 28
Y 𝑋1 𝑋2
28.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
Y 𝑋1 𝑋2
79.99 162 28
61.89 162 28
28.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
Note: the predicted
values for this tree are
the respective
averages in each
terminal node.
Terminal Node 1
predicted value:
79.99+61.89 ≈ 70.94
Terminal Node 1
Terminal Node 2

CART Geometry
Y
𝑋1
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
𝑋1
𝑋2

So how do we get to our final tree?
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 <= 210.00
Terminal
Node 2
STD = 6.705
Avg = 36.797
W = 3.000
N = 3
X1 > 210.00
Terminal
Node 3
STD = 4.375
Avg = 43.609
W = 11.000
N = 11
X1 > 177.00
Node 2
X1 <= 210.00
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16

We now perform the same procedure again, but this time for each partition
of the data (we can only split one partition at a time)
Y 𝑋1 𝑋2
79.99 162 28
61.89 162 28
Y 𝑋1 𝑋2
28.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
Y 𝑋1 𝑋2
28.02 192 28
38.07 192 90
44.3 192 360
Y 𝑋1 𝑋2
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
Best Split: Split
Partition 2 at
𝑋1 ≤ 210Partition 1 Partition 2
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 <= 210.00
Terminal
Node 2
STD = 6.705
Avg = 36.797
W = 3.000
N = 3
X1 > 210.00
Terminal
Node 3
STD = 4.375
Avg = 43.609
W = 11.000
N = 11
X1 > 177.00
Node 2
X1 <= 210.00
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16

CART Geometry
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
𝑋1
𝑋2
Y
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 <= 210.00
Terminal
Node 2
STD = 6.705
Avg = 36.797
W = 3.000
N = 3
X1 > 210.00
Terminal
Node 3
STD = 4.375
Avg = 43.609
W = 11.000
N = 11
X1 > 177.00
Node 2
X1 <= 210.00
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16

Where are we?
 CART Splitting Process
 CART Pruning
 Advantages of CART
 Interpreting CART Output
 Applied Example using CART
 Random Forest Section

CART: Algorithm
Step 1: Grow a large tree
Step2: Prune the large tree
This is also done for you automatically
Use either a test sample or cross validation to
prune subtrees

CART: Pruning with a Test Sample
Test sample- randomly select a certain
percentage of data (often ~20%-30%)
to be used to assess the model error
Prune the CART tree
1. Run the test data down the large tree and the
smaller trees (the smaller trees are called
“subtrees”)
2. Compute the test error for each tree
3. The final tree shown to the user is the tree with
the smallest test error
Subtree Error
1 200
2 125
3 100
4 83
5 113
6 137

Where are we?
 CART Pruning
 Advantages of CART (this is what allows you to build
models with dirty data)

CART Advantages
In practice, you can build CART models with dirty data (i.e.
missing values, lots of variables, nonlinear relationships, outliers,
and numerous local effects)
This is due to CART’s desirable properties:
1. Easy to interpret
2. Automatic handling of the following:
a) Variable selection
b) Variable interaction modeling
c) Local effect modeling
d) Nonlinear relationship modeling
e) Missing values
f) Outliers
3. Not affected by monotonic transformations of variables

CART: Interpretation and Automatic
Variable Selection
Interpretation: CART trees have a simple
interpretation and only require that someone ask
themselves a series of “yes or no” questions like “Is
Age_Day <= 21?” etc.
Variable Selection:
All variables will be considered for each split
but not all variables will be used. Some
variables will be used more than others.
Only one variable is used for each split
The variables chosen are those that reduce
the error the most
AGE_DAY <= 21.00
Terminal
Node 1
STD = 12.441
Avg = 23.944
W = 260.000
N = 260
Terminal
Node 2
STD = 12.683
Avg = 37.036
W = 436.000
N = 436
CEMENT_AMT > 355.95
Terminal
Node 3
STD = 13.452
Avg = 57.026
W = 129.000
N = 129
AGE_DAY > 21.00
Node 2
STD = 15.358
Avg = 41.600
W = 565.000
N = 565
Node 1
AGE_DAY <= 21.00
STD = 16.661
Avg = 36.036
W = 825.000
N = 825

CART: Automatic Variable Interactions
and Local Effects
In regression interaction terms modeled globally in
the form x1*x2 or x1*x2*x3 (global means that
the interaction is present everywhere).
In CART interactions are automatically modeled
over certain regions of the data (i.e. locally) so
you do not have to worry about adding interaction
terms or local terms to your model
Example: Notice how the prediction changes for different
amounts of cement given that the Age is over 21 days (i.e. this
is the interaction)
1. If Age > 21 and Cement Amount <= 355. 95 then the
average strength is 37 megapascuals
2. If Age > 21 and Cement Amount > 355.95 then the
average strength is 57 megapascuals
AGE_DAY <= 21.00
Terminal
Node 1
STD = 12.441
Avg = 23.944
W = 260.000
N = 260
Terminal
Node 2
STD = 12.683
Avg = 37.036
W = 436.000
N = 436
CEMENT_AMT > 355.95
Terminal
Node 3
STD = 13.452
Avg = 57.026
W = 129.000
N = 129
AGE_DAY > 21.00
Node 2
STD = 15.358
Avg = 41.600
W = 565.000
N = 565
Node 1
AGE_DAY <= 21.00
STD = 16.661
Avg = 36.036
W = 825.000
N = 825

CART: Automatic Nonlinear Modeling
Nonlinear functions (and linear) are approximated via step functions, so in practice
you do not need to worry about adding terms like 𝒙 𝟐 𝒐𝒓 𝒍𝒏 𝒙 to capture
nonlinear relationships. The picture below is the CART fit to 𝑌 = 𝑋2 + noise. CART
modeled this data automatically. No data pre-processing. Just CART.
𝑌
𝑋

CART: Automatic Missing Value Handling
CART automatically handles missing values while building the
model, so you do not need to impute missing values yourself
The missing values are handled using a surrogate split
Surrogate Split- find another variable whose split is “similar” to the
variable with the missing values and split on the variable that does
not have missing values
Reference: see Section 5.3 in Breiman, Friedman, Olshen, and
Stone for more information

CART: Outliers in the Target Variable
Two types of outliers are
1. Outliers in the target variable (i.e. “Y”)
2. Outliers in the predictor variable (i.e. “x”)
CART is more sensitive to outliers with
respect to the target variable
1. More severe in a regression context
than a classification context
2. CART may treat target variable outliers
by isolating them in small terminal
nodes which can limit their effect
Reference: Pages 197-200 and 253 in
Breiman, Friedman, Olshen, and Stone (1984)
𝑋1
𝑋2
Y
Here the target outliers are
isolated in terminal node 1
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16

CART: Outliers in the Predictor Variables
CART is more robust to outliers in the predictor variables partly due to nature of the splitting process
Reference: Pages 197-200 and 253 in Breiman, Friedman, Olshen, and Stone (1984)
Y
X1 <= 177.00
Terminal
Node 1
STD = 8.532
Avg = 73.957
W = 3.000
N = 3
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.149
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 13.661
Avg = 47.762
W = 17.000
N = 17
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
𝑋1
𝑋2

CART: Monotonic Transformations of Variables
Monotonic transformation- is a transformation that does not change the order of a variable.
– CART, unlike linear regression, is not affected by this so if a transformation does not affect the order of a variable
then you do not need to worry about adding it to a CART model
– Example: Our best first split in the example tree was 𝑋1 ≤ 177. What happens if we square 𝑋1? The split point value
changes, but nothing else does including the predicted values. This happens because the same Y values fall into the
same partition (i.e. their order has not changed after we squared and sorted 𝑋1)
Y 𝑋1 𝑋2
79.99 162 28
61.89 162 28
Y 𝑋1 𝑋2
28.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
Y 𝑋1 𝑋2
79.99 162 28
61.89 162 28
28.02 192 28
38.07 192 90
44.3 192 360
36.45 228 28
45.85 228 28
39.29 228 28
47.81 228 28
47.03 228 90
42.33 228 90
52.91 228 90
40.27 228 270
43.01 228 270
41.05 228 365
43.7 228 365
X1 <= 177.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1 > 177.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1 <= 177.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16
Y 𝑋1 𝑋2
79.99 26,244 28
61.89 26,244 28
28.02 36,864 28
38.07 36,864 90
44.3 36,864 360
36.45 51,984 28
45.85 51,984 28
39.29 51,984 28
47.81 51,984 28
47.03 51,984 90
42.33 51,984 90
52.91 51,984 90
40.27 51,984 270
43.01 51,984 270
41.05 51,984 365
43.7 51,984 365
Y 𝑋1 𝑋2
79.99 26,244 28
61.89 26,244 28
Y 𝑋1 𝑋2
28.02 36,864 28
38.07 36,864 90
44.3 36,864 360
36.45 51,984 28
45.85 51,984 28
39.29 51,984 28
47.81 51,984 28
47.03 51,984 90
42.33 51,984 90
52.91 51,984 90
40.27 51,984 270
43.01 51,984 270
41.05 51,984 365
43.7 51,984 365
X1_SQ <= 31554.00
Terminal
Node 1
STD = 9.049
Avg = 70.937
W = 2.000
N = 2
X1_SQ > 31554.00
Terminal
Node 2
STD = 5.700
Avg = 42.150
W = 14.000
N = 14
Node 1
X1_SQ <= 31554.00
STD = 11.371
Avg = 45.748
W = 16.000
N = 16

Where are we?
 CART Pruning
 Advantages of CART (this is what allows you to build

CART: Relative Error
Relative error- used to determine the optimal model complexity for CART models
GOOD: Relative error values close to zero mean that CART is doing a better job than predicting only the
overall average (or median) for all records in the data
BAD: Relative error values equal to one means that that CART is no better than predicting the overall
average (or median) of the target variable for every record. Note: the relative error can be greater than one
which is especially bad.
The relative error can be computed for both Least Squares: LS = 𝑖 𝑦𝑖 − 𝑦𝑖
2
and Least Absolute Deviation:
LAD = 𝑖 𝑦𝑖 − 𝑦𝑖
Relative Error =
𝐶𝐴𝑅𝑇 𝑀𝑜𝑑𝑒𝑙 𝐸𝑟𝑟𝑜𝑟 𝑢𝑠𝑖𝑛𝑔 𝑒𝑖𝑡ℎ𝑒𝑟 𝐿𝑆 𝑜𝑟 𝐿𝐴𝐷
𝐸𝑟𝑟𝑜𝑟 𝑓𝑜𝑟 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔 𝑡ℎ𝑒 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑟𝑒𝑐𝑜𝑟𝑑𝑠
LAD Relative Error =
Relative Error: 0.129

CART: Variable Importance
CART Variable Importance: sum each
variable’s split improvement score across the
splits in the tree. The importance scores for
variables are increased in two ways:
1) When the variable is actually used
to a split a node
2) When the variable is as the
surrogate split (i.e. the backup
splitting variable when the primary
splitting variables has a missing
value)
“Consider Only Primary Splitters” (green
rectangle on the right) removes the
surrogate splitting variables from the
variable importance calculation
“Discount Surrogates” allows you to
discount surrogates in a more specific
manner

CART Performance
Method
interactions
67.35
settings) 65.05
55.99
Random Forest(default
settings)
Improved Random Forest
using an SPM Automate
𝑀𝑆𝐸 =
1
𝑛
𝑖=1
𝑛
2
***More information
about the 1 Standard
Error Rule for CART
can be found in the
appendix

Where are we?
 CART Pruning
 Advantages of CART (this is what allows you to build
 Interpreting CART Output
 Applied Example using CART

Introduction to Random Forests
Main Idea: fit multiple CART trees to independent “bootstrap”
samples of the data and then combine the predictions
Leo Breiman, one of the co-creators of CART,
also created Random Forests and published a
paper on this method in 2001
Our RandomForests® software was developed in close consultation
with Breiman himself

What is a bootstrap sample?
A bootstrap sample is a random sample conducted with replacement
Steps:
1. Randomly select an observation from the original data
2. “Write it down”
3. “Put it back” (i.e. any observation can be selected more than once)
Repeat steps 1-3 N times; N is the number of observations in the original sample
FINAL RESULT: One “bootstrap sample” with N observations
…
Bootstrap SampleOriginal Data
0 48 3
0 48 3
0 37 1
0 37 1
0 37 1
0 . 10 . 1
0 24 4
0 24 4
0 37 1
Y X1 X2

Original Data
…..Bootstrap 1 Bootstrap 2 Bootstrap 199 Bootstrap 200
Final Prediction for a New Record: take the average of the 200 individual predictions
10.5 + 9.8 … + 10.73 + 12
200
Predict a New Record : run the record down each tree, each time computing a prediction
1. Draw a bootstrap sample
2. Fit a large, unpruned, CART tree to this bootstrap
sample
-At each split in the tree consider only k randomly
selected variables instead of all of them
3. Average the predictions to predict a new record
Repeat Steps 1-2 at least 200 times
Tree 1: 10.5 Tree 2: 9.8 Tree 199: 10.73 Tree 200: 12…..

CART and Random Forests
When you build a Random Forest model just keep this picture in
the back of your mind:
The reason is because a Random Forest is really just an average of
CART trees constructed on bootstrap samples of the original data
= ++ +… +1 2 3 B
[ ]
1
𝐵

CART and Random Forests
Random Forests generally have superior predictive performance versus CART trees
because Random Forests have lower variance than a single CART tree
Since Random Forests are a combination of CART trees they inherit many of CART’s
properties:
Automatic
Variable selection
Variable interaction detection
Nonlinear relationship detection
Missing value handling
Outlier handling
Modeling of local effects
Invariant to monotone transformations of predictors
One drawback is that a Random Forest is not as interpretable as a single CART tree

Random Forests: Tuning Parameters
The performance of a Random Forest is
dependent upon the values of certain model
parameters
Two of these parameters are
1. Number of trees
2. Random number of variables chosen at each split

Random Forests: Number of Trees
Number of trees
Default: 200
The number of trees should be large
enough so that the model error no
longer meaningfully declines as the
number of trees increases
(Experimentation will be required)
In Random Forests the optimal
number of trees tends to be the
maximum value allotted (due to the
Law of Large Numbers) Default Setting: 200 Trees My Setting: 400 Trees
There is not much of a difference between the error for a forest with
200 trees and one with 400 trees, so, at least for this dataset, a
larger number of trees will not improve the model meaningfully

Random Forests: Tuning Parameters
The performance of a Random Forest is
dependent upon the values of certain model
parameters
Two of these parameters are
1. Number of trees
2. Random number of variables chosen at each
split

Random Forest Parameters:
Random variable subset size
Random number of variables k chosen
at each split in each tree in the forest
Default: k=3
Experimentation will be required to
find the optimal value and this can be
done using Automate RFNPREDS
Automate RFNPREDS- automatically
build multiple Random Forests: each
time the forest is the same except
that the number of randomly selected
variables at each split in each CART
tree changes
This allows us to conveniently
determine the optimal number
of variables to randomly select
at each split in each tree
This output is telling you that optimal number of
randomly selected variables at each split in each
tree in the forest is 5.

Interpreting a Random Forest
Since a Random Forest is a collection of hundreds or
even thousands of CART trees, the simple
interpretation is lost because we now have hundreds of
trees and are averaging the predictions
One method used to interpret a Random Forest is
variable importance

Random Forest for Regression:
Variable Importance
CART Variable Importance: sum each variable’s split
improvement score across the splits in the tree
The importance scores for variables are increased in two way: 1)
when the variable is actually used to a split a node and 2) when the
variable is as the surrogate split (i.e. the backup splitting variable
when the primary splitting variables has a missing value)
Random Forest Variable Importance for Regression:
1. Compute a score for every split the variable generates, sum the
scores across all splits made
 Relative Importance: divide all variable importance scores by the
maximum variable importance score (i.e. the most important variable
has a relative importance value of 100)
Note: For classification models, the preferred method is the random permutation method (see appendix for more details)

Random Forest Demonstration in SPM

Random Forests: Model Performance
Method
interactions
67.35
settings 65.05
55.99
Random Forest (default
settings)
37.70
Improved Random Forest
using an SPM Automate
36.02
𝑀𝑆𝐸 =
1
𝑛
𝑖=1
𝑛
2

Conclusion
CART produces an interpretable model that is more resistant to
outliers, predicts future data well, and automatically handles
1. Variables interactions
2. Missing values
3. Nonlinear relationships
4. Local effects
Random Forests are fundamentally a combination of individual
CART trees and thus inherit all of the advantages of CART above
(except the nice interpretation)
*Generally is superior to a single CART tree in terms of predictive
accuracy

Next in the series…
Improve Your Regression with TreeNet® Gradient Boosting

Try CART and Random Forest
Download SPM® 8 now to start building CART and
Random Forest models on your data
We will be more than happy to personally help you
if you have any questions or need assistance
My Email: charrison@salford-systems.com
Support Email: support@salford-systems.com
The appendix follows this slide

Appendix
1 Standard Error Rule for CART trees

1SE Rule in SPM
Optimal Tree 1 Standard Error Rule Tree
1SE Rule Tree: the smallest tree whose error is within one standard deviation of the minimum error
Figures: Upper Left: Optimal Tree has 188 terminal nodes and a relative error of .177; Upper Right: the 1SE
tree has 85 terminal nodes and a relative error .209
Smaller trees are preferred because they are less likely to overfit the data (i.e. the 1SE tree in this case is competitive
in terms of accuracy and is much less complex) and they are easier to interpret
Relative Error =
𝐶𝐴𝑅𝑇 𝑀𝑜𝑑𝑒𝑙 𝐸𝑟𝑟𝑜𝑟 𝑢𝑠𝑖𝑛𝑔 𝑒𝑖𝑡ℎ𝑒𝑟 𝐿𝑆 𝑜𝑟 𝐿𝐴𝐷
𝐸𝑟𝑟𝑜𝑟 𝑓𝑜𝑟 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔 𝑡ℎ𝑒 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 (𝑜𝑟 𝑚𝑒𝑑𝑖𝑎𝑛 𝑖𝑓 𝑢𝑠𝑖𝑛𝑔 𝐿𝐴𝐷) 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑟𝑒𝑐𝑜𝑟𝑑𝑠

Improve Your Regression with CART and RandomForests

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Improve Your Regression with CART and RandomForests

Similar to Improve Your Regression with CART and RandomForests (20)

More from Salford Systems

More from Salford Systems (20)

Recently uploaded

Recently uploaded (20)

Improve Your Regression with CART and RandomForests

Editor's Notes