1. Leonardo Auslender Copyright 2004Leonardo Auslender9/8/2019
Tree World.
By Leonardo Auslender.
Copyright 2019.
Leonardo ‘dot’ auslender ‘at’ gmail ‘dot’ com
2. Leonardo Auslender Copyright 2004Leonardo Auslender 29/8/2019
Contents
Varieties of trees
CART algorithm.
Tree variable selection
Tree Pruning
Tree variable importance.
Tree model diagnostics.
Sections marked with *** can be skipped at first reading.
3. Leonardo Auslender Copyright 2004 Ch. 1.4-39/8/2019
Varieties of Tree Methodologies.
CART
Tree (S+)
AID
THAID
CHAID ID3
C4.5
C5.0
We’ll focus on CART methodology
4. Leonardo Auslender Copyright 20049/8/2019
Basic References.
Breiman L. et al, 1984.
Quinlan J. (1993).
“Easy Reading” Auslender L. (1998, 1999, 2000a, 2001)
Bayesian Perspective: Chipman et al (1998).
Many, many other references.
5. Leonardo Auslender Copyright 20049/8/2019
Basic CART Algorithm: binary dependent
variable or target (0,1): Classification Trees.
Range of Continuous Variable A
“0”
“0”
70%
“1”
“1”
20%
50%
Original % of ‘0’s and ‘1’s of dep. var
Splitting point
With continuous dep var, decrease in variance
from root to nodes: Regression Trees.
6. Leonardo Auslender Copyright 2004 Ch. 1.4-69/8/2019
Divide and Conquer: recursive
partitioning.
n = 5,000
10% Event
n = 3,350 n = 1,650
Debits < 19
yes no
21% Event5% Event
7. Leonardo Auslender Copyright 2004Leonardo Auslender Ch. 1.4-79/8/2019
Ideal SAS code to find splits (for those who
dare).
Proc summary data = …. Nway;
class (all independent vars);
var depvar;
output out = ….. Sum = ;
run;
For large data sets (large N, large p),
hardware and software constraints may
prevent completion.
Binary Case
8. Leonardo Auslender Copyright 2004 Ch. 1.4-89/8/2019
Fitted Decision Tree: Interpretation
and structure.
VAR C
>1
25%
0-52
45%
VAR B
VAR A
<19 19
5%
0,1
21%
>52
9. Leonardo Auslender Copyright 2004 Ch. 1.4-99/8/2019
Cultivation of Trees.
• Split Search
– Which splits are to be considered?
• Splitting Criterion
– Which split is best?
• Stopping Rule
– When should splitting stop?
• Pruning Rule
– Should some branches be lopped-off?
10. Leonardo Auslender Copyright 2004 Ch. 1.4-109/8/2019
Splitting Criterion: gini, twoing, misclassification, entropy,
chi-square, etc, etc. …
A) Minimize Gini impurity criterion (favors node homogeneity)
B) Maximize Twoing impurity criterion (favors class separation)
Empirical results: for binary dependent variables, Gini and Twoing are
equivalent. For trinomial, Gini provides more accurate trees. Beyond three
categories, twoing performs better.
2
r
1
l r
P
( ) [ [ ( / ) ( / )]
4
t and t : left and right nodes, respectively.
K
l
l r
k
P
i t p k t p k t
=
= −
2
1
( ) 1 ( / ) .
( / ) Cond. prob. of class k in node t.
K
k
i t Gini impurity p k t
p k t
=
= = −
=
11. Leonardo Auslender Copyright 2004Leonardo Auslender 119/8/2019
Choosing between No_claims and Dr. Visits. No_claims yields lower impurity (0.237)
and split value at or below 0 is chosen. Dr. Visits impurity is 0.280.
14. Leonardo Auslender Copyright 2004Leonardo Auslender Ch. 1.4-149/8/2019
Tree Prediction.
Let J disjoint regions (final nodes) {R1,,,,,Rj}
Classification:
Y ε {c1, c2, ,,,, cK} , i.e., Y has K categs ➔ predictors {F1 ,,,,, FK}.
T(X) = arg max (F1 ,,,,, FK) ( category Mode is predicted value)
Regression:
Pred. Rule: obs X ε Rj ➔ T(X) = avg (yj)
15. Leonardo Auslender Copyright 2004 Ch. 1.4-159/8/2019
Benefits of Trees.
• Interpretability: tree structured presentation, easy to
conceptualize but gets crowded with large trees.
• Mixed Measurement Scales
– Nominal, ordinal, interval variables.
– Regression trees for continuous target variable.
• Robustnes. Outliers just become additional possible split value.
• Missing Values: treated as one more possible split value.
• Automatic variable selection, and even ‘coefficients’ (I.e.,
splitting points) because splitter can be undrstood as selected variable,
but not in the linear model sense.
16. Leonardo Auslender Copyright 2004 Ch. 1.4-169/8/2019
…Benefits.
• Automatically
– Detects interactions
(AID) in hierarchical
conditioning search, i.e.,
hierarchy level is all
important.
– Invariance under
monotonic
transformations. All that
matters is values rankings.
Input
Input
Prob
Multivariate
Step Function
17. Leonardo Auslender Copyright 2004 Ch. 1.4-179/8/2019
Drawbacks of Trees.
• Unstable: small perturbations in data can lead to big changes in
trees, because splitting points can change.
• Linear structures are approximated in
very rough form.
• Applications may require that rules
descriptions for different categories not
share the same attributes (e.g., finance, splitters
may appear just once).
18. Leonardo Auslender Copyright 2004 Ch. 1.4-189/8/2019
Drawbacks of Trees (cont.).
• . Tend to over-fit ➔ overly optimistic accuracy (even when
pruned).
• . Large trees very difficult to interpret.
• . Tree size conditioned by data set size.
• . No valid inferential procedures at present (matters?).
• . Greedy search algorithm (one variable at a time, one step
ahead).
• . Difficulty in accepting final fit,
especially for data near boundaries.
• . Difficulties when data contains lot of missing values (but
other methods could be far worse in this case).
19. Leonardo Auslender Copyright 2004Leonardo Auslender Ch. 1.4-199/8/2019
/* PROGRAM ALGOR8.PGM WITH 8 FINAL NODES*/
/* METHOD MISSCL ALACART TEST */
RETAIN ROOT 1;
IF ROOT & CURRDUE <= 105.38 & PASTDUE <= 90.36 & CURRDUE <= 12
THEN DO;
NODE = '4_1 ';
PRED = 0 ;
/* % NODE IMPURITY = 0.0399 ; */
/* BRANCH # = 1 ; */
/* NODE FREQ = 81 ; */
END;
ELSE IF ROOT & CURRDUE <= 105.38 & PASTDUE <= 90.36 & CURRDUE > 12
THEN DO;
NODE = '4_2 ';
PRED = 1 ;
/* % NODE IMPURITY = 0.4478 ; */
/* BRANCH # = 2 ; */
/* NODE FREQ = 212 ; */
END;
ELSE IF ROOT & CURRDUE <= 105.38 & PASTDUE > 90.36
THEN DO;
NODE = '3_2 ';
PRED = 0 ;
Scoring Recipe: example of scoring output generated by TREE
like programs.
23. Leonardo Auslender Copyright 2004Leonardo Auslender Ch. 1.4-239/8/2019
Tree Pruning.
Trained tree could be quite large and obtain seemingly low overall
misclassification rate due to over fitting. Pruning (Breiman’s et al,
1984) , aims at remedying fitting problem.
Starts from tree originally created and selectively recombines nodes
and obtains decreasing sequence of sub-trees from the bottom up.
Decision as to which final nodes to recombine depends on
comparing loss in accuracy from not splitting intermediate node
in relation to number of final nodes that that split generates.
Comparison made across all possible intermediate node splits, and
‘minimal cost-complexity’ loss in accuracy is rule for pruning.
Sequence of sub-trees generated ends up with root node. Decision
as to which tree among sub-trees to utilize is based on either one of
two methods: 1) cross-validation, or 2) a test-data set.
24. Leonardo Auslender Copyright 2004Leonardo Auslender Ch. 1.4-249/8/2019
Tree Pruning
1) Cross-validation.
Preferred when original data set is not ‘large’. ‘v’ stratified samples on
dependent variable are created, without replacement. Create ‘v’ data
sets, each one containing (v –1) of samples created, and ‘v’ test data
sets, which consists of ‘left-out’ sample. ‘v’ maximal trees are trained
on ‘v’ samples, and pruned.
For instance, let v = 10 and obtain 10 samples from original data set
without replacement. Then from the 10 samples, create 10 additional
data sets combining 9 of the 10 samples, and skipping a different one
each time. The left out sample is used as test data. Thus we obtain 10
training and 10 test samples. Create 10 maximal trees and prune them.
25. Leonardo Auslender Copyright 2004Leonardo Auslender Ch. 1.4-259/8/2019
Tree Pruning.
2) Test data set.
Test data set method preferred when size of data set is not
constraint on estimation process. Split original data set into training
and test subsets.
Once maximal tree and sequence of sub-trees due to pruning are
obtained, ‘score’ different sub-trees with test data set and obtain
the corresponding misclassification rates.
Choose that sub-tree which minimizes misclassification rate. While
this rate decreases with number of final nodes at stage of tree
development, it typically plateaus at some number of final nodes
smaller than maximal number of final nodes for test data set.
26. Leonardo Auslender Copyright 2004Leonardo Auslender Ch. 5-269/8/2019
Tree Pruning
The test data sets are then used to obtain misclassification rates of each
of pruning subsequences. Index each pruning subsequence and
corresponding misclassification rate by number of final nodes, and
obtain array of miscl. Rates by pruned-subtrees. Choose size of tree
that which minimizes overall misclassification rate.
Final tree will be taken from original pruning sequence of tree derived
with entire sample at number of final nodes just described.
27. Leonardo Auslender Copyright 2004Leonardo Auslender Ch. 5-279/8/2019
Variable Importance
Variable importance can be defined in many ways.
It can be considered as a measure of the actual splitting or the actual
and potential splitting capability of all variables.
By actual we mean variables that were used to create splits and by
potential we mean variables which mimic the primary splitter e.g.
surrogates. It involves calculating for each primary splitter and each
surrogate the improvement in the Gini or Entropy index or the chi-
square over all internal nodes weighted by the size of the node.
The final result is scaled so that the maximum value is 1.00.
l
i
j j
i = 1
N
Importance(x ) = improvement in Gini for variable x
N
43. Leonardo Auslender Copyright 2004Leonardo Auslender Ch. 5-439/8/2019
Precision + classification
Similar for VAL.
44. Leonardo Auslender Copyright 2004Leonardo Auslender Ch. 5-449/8/2019
Comparing Gains-chart info with Precision Recall.
The gains-chart provides information on cumulative # of
Events per descending percentile / bin. These bins contain a
fixed number of observations.
Precision recall instead is at probability level, not at bin
Level, and thus # of observations along the curve is not
Uniform. Thus, selecting cutoff point from gains-chart selects
invariably from within a range of probabilities.
Selecting from Precision recall, selects a specific probability
point.
45. Leonardo Auslender Copyright 2004Leonardo Auslender 459/8/2019
References
Auslender L. (1998): Alacart, poor man’s classification trees, NESUG.
Breiman L., Friedman J., Olshen R., Stone J. (1984): Classification and Regression Trees,
Wadsworth.
Chipman H., George E., McCulloch R.: BART, Bayesian additive regression Trees, The
Annals of Statistics.
Friedman, J. (2001).Greedy boosting approximation: a gradient boosting machine. Ann.Stat.
29, 1189–1232.doi:10.1214/aos/1013203451
Quinlan J. Ross (1993): C4.5: programs for machine learning, Morgan Kaufmann Publshers.