4_1_Tree World.pdf

Leonardo Auslender Copyright 2004
Leonardo Auslender
10/3/2019
Tree World.
By Leonardo Auslender.
Copyright 2019.
Leonardo ‘dot’ auslender ‘at’ gmail ‘dot’ com

Leonardo Auslender 2
10/3/2019
Contents
Varieties of trees
CART algorithm.
Tree variable selection
Tree Pruning
Tree variable importance.
Tree model diagnostics.
Sections marked with *** can be skipped at first reading.

Leonardo Auslender Copyright 2004 Ch. 1.4-3
10/3/2019
Varieties of Tree Methodologies.
CART
Tree (S+)
AID
THAID
CHAID ID3
C4.5
C5.0
We’ll focus on CART methodology

10/3/2019
Basic References.
Breiman L. et al, 1984.
Quinlan J. (1993).
“Easy Reading” Auslender L. (1998, 1999, 2000a, 2001)
Bayesian Perspective: Chipman et al (1998).
Many, many other references.

10/3/2019
Basic CART Algorithm: binary dependent
variable or target (0,1): Classification Trees.
Range of Continuous Variable A
“0”
“0”
70%
“1”
“1”
20%
50%
Original % of ‘0’s and ‘1’s of dep. var
Splitting point
With continuous dep var, decrease in variance
from root to nodes: Regression Trees.

10/3/2019
Divide and Conquer: recursive
partitioning.
n = 5,000
10% Event
n = 3,350 n = 1,650
Debits < 19
yes no
21% Event
5% Event

Leonardo Auslender Ch. 1.4-7
10/3/2019
Ideal SAS code to find splits (for those who
dare).
Proc summary data = …. Nway;
class (all independent vars);
var depvar;
output out = ….. Sum = ;
run;
For large data sets (large N, large p),
hardware and software constraints may
prevent completion.
Binary Case

10/3/2019
Fitted Decision Tree: Interpretation
and structure.
VAR C
>1
25%
0-52
45%
VAR B
VAR A
<19 19
5%
0,1
21%
>52

10/3/2019
Cultivation of Trees.
• Split Search
– Which splits are to be considered?
• Splitting Criterion
– Which split is best?
• Stopping Rule
– When should splitting stop?
• Pruning Rule
– Should some branches be lopped-off?

10/3/2019
Splitting Criterion: gini, twoing, misclassification, entropy,
chi-square, etc, etc. …
A) Minimize Gini impurity criterion (favors node homogeneity)
B) Maximize Twoing impurity criterion (favors class separation)
Empirical results: for binary dependent variables, Gini and Twoing are
equivalent. For trinomial, Gini provides more accurate trees. Beyond three
categories, twoing performs better.
2
r
1
l r
P
( ) [ [ ( / ) ( / )]
4
t and t : left and right nodes, respectively.
K
l
l r
k
P
i t p k t p k t
=
 = −

2
1
( ) 1 ( / ) .
( / ) Cond. prob. of class k in node t.
K
k
i t Gini impurity p k t
p k t
=
= = −
=


10/3/2019
Choosing between No_claims and Dr. Visits. No_claims yields lower impurity (0.237)
and split value at or below 0 is chosen. Dr. Visits impurity is 0.280.

10/3/2019
The Right-Sized Tree
Stunting
Pruning

10/3/2019

10/3/2019
Tree Prediction.
Let J disjoint regions (final nodes) {R1,,,,,Rj}
Classification:
Y ε {c1, c2, ,,,, cK} , i.e., Y has K categs ➔ predictors {F1 ,,,,, FK}.
T(X) = arg max (F1 ,,,,, FK) ( category Mode is predicted value)
Regression:
Pred. Rule: obs X ε Rj ➔ T(X) = avg (yj)

10/3/2019
Benefits of Trees.
• Interpretability: tree structured presentation, easy to
conceptualize but gets crowded with large trees.
• Mixed Measurement Scales
– Nominal, ordinal, interval variables.
– Regression trees for continuous target variable.
• Robustnes. Outliers just become additional possible split value.
• Missing Values: treated as one more possible split value.
• Automatic variable selection, and even ‘coefficients’ (I.e.,
splitting points) because splitter can be undrstood as selected variable,
but not in the linear model sense.

10/3/2019
…Benefits.
• Automatically
– Detects interactions
(AID) in hierarchical
conditioning search, i.e.,
hierarchy level is all
important.
– Invariance under
monotonic
transformations. All that
matters is values rankings.
Input
Input
Prob
Multivariate
Step Function

10/3/2019
Drawbacks of Trees.
• Unstable: small perturbations in data can lead to big changes in
trees, because splitting points can change.
• Linear structures are approximated in
very rough form.
• Applications may require that rules
descriptions for different categories not
share the same attributes (e.g., finance, splitters
may appear just once).

10/3/2019
Drawbacks of Trees (cont.).
• . Tend to over-fit ➔ overly optimistic accuracy (even when
pruned).
• . Large trees very difficult to interpret.
• . Tree size conditioned by data set size.
• . No valid inferential procedures at present (matters?).
• . Greedy search algorithm (one variable at a time, one step
ahead).
• . Difficulty in accepting final fit,
especially for data near boundaries.
• . Difficulties when data contains lot of missing values (but
other methods could be far worse in this case).

10/3/2019
/* PROGRAM ALGOR8.PGM WITH 8 FINAL NODES*/
/* METHOD MISSCL ALACART TEST */
RETAIN ROOT 1;
IF ROOT & CURRDUE <= 105.38 & PASTDUE <= 90.36 & CURRDUE <= 12
THEN DO;
NODE = '4_1 ';
PRED = 0 ;
/* % NODE IMPURITY = 0.0399 ; */
/* BRANCH # = 1 ; */
/* NODE FREQ = 81 ; */
END;
ELSE IF ROOT & CURRDUE <= 105.38 & PASTDUE <= 90.36 & CURRDUE > 12
THEN DO;
NODE = '4_2 ';
PRED = 1 ;
/* % NODE IMPURITY = 0.4478 ; */
/* BRANCH # = 2 ; */
/* NODE FREQ = 212 ; */
END;
ELSE IF ROOT & CURRDUE <= 105.38 & PASTDUE > 90.36
THEN DO;
NODE = '3_2 ';
PRED = 0 ;
Scoring Recipe: example of scoring output generated by TREE
like programs.

10/3/2019
Tree Variable Selection

10/3/2019
With same data set, partial picture of Tree found, Example
with HMEQ data set..

10/3/2019

10/3/2019
Tree Pruning.
Trained tree could be quite large and obtain seemingly low overall
misclassification rate due to over fitting. Pruning (Breiman’s et al,
1984) , aims at remedying fitting problem.
Starts from tree originally created and selectively recombines nodes
and obtains decreasing sequence of sub-trees from the bottom up.
Decision as to which final nodes to recombine depends on
comparing loss in accuracy from not splitting intermediate node
in relation to number of final nodes that that split generates.
Comparison made across all possible intermediate node splits, and
‘minimal cost-complexity’ loss in accuracy is rule for pruning.
Sequence of sub-trees generated ends up with root node. Decision
as to which tree among sub-trees to utilize is based on either one of
two methods: 1) cross-validation, or 2) a test-data set.

10/3/2019
Tree Pruning
1) Cross-validation.
Preferred when original data set is not ‘large’. ‘v’ stratified samples on
dependent variable are created, without replacement. Create ‘v’ data
sets, each one containing (v –1) of samples created, and ‘v’ test data
sets, which consists of ‘left-out’ sample. ‘v’ maximal trees are trained
on ‘v’ samples, and pruned.
For instance, let v = 10 and obtain 10 samples from original data set
without replacement. Then from the 10 samples, create 10 additional
data sets combining 9 of the 10 samples, and skipping a different one
each time. The left out sample is used as test data. Thus we obtain 10
training and 10 test samples. Create 10 maximal trees and prune them.

10/3/2019
Tree Pruning.
2) Test data set.
Test data set method preferred when size of data set is not
constraint on estimation process. Split original data set into training
and test subsets.
Once maximal tree and sequence of sub-trees due to pruning are
obtained, ‘score’ different sub-trees with test data set and obtain
the corresponding misclassification rates.
Choose that sub-tree which minimizes misclassification rate. While
this rate decreases with number of final nodes at stage of tree
development, it typically plateaus at some number of final nodes
smaller than maximal number of final nodes for test data set.

Leonardo Auslender Ch. 5-26
10/3/2019
Tree Pruning
The test data sets are then used to obtain misclassification rates of each
of pruning subsequences. Index each pruning subsequence and
corresponding misclassification rate by number of final nodes, and
obtain array of miscl. Rates by pruned-subtrees. Choose size of tree
that which minimizes overall misclassification rate.
Final tree will be taken from original pruning sequence of tree derived
with entire sample at number of final nodes just described.

10/3/2019

10/3/2019
Variable Importance
Variable importance can be defined in many ways.
It can be considered as a measure of the actual splitting or the actual
and potential splitting capability of all variables.
By actual we mean variables that were used to create splits and by
potential we mean variables which mimic the primary splitter e.g.
surrogates. It involves calculating for each primary splitter and each
surrogate the improvement in the Gini or Entropy index or the chi-
square over all internal nodes weighted by the size of the node.
The final result is scaled so that the maximum value is 1.00.
l
i
j j
i = 1
N
Importance(x ) = improvement in Gini for variable x
N


10/3/2019
Fraud Data
Example.

10/3/2019 No-event Event M1_TRN_TREES
Leaf or
Final
Nodes.
Decision or
intermediate
nodes.
Root
node.

Leonardo Auslender
Requested Tree Models: Names & Descriptions.
Pred
Level 1 + Prob. Level 2 + Prob. Level 3 + Prob. Level 4 + Prob.
0.718
no_claims < 0.5 (
0.142 )
member_duration <
180.5 ( 0.201 )
total_spend < 4250 (
0.718 )
total_spend >= 4250 (
0.189 )
optom_presc >= 4.5 (
0.444 ) 0.444
optom_presc < 4.5 (
0.177 ) 0.177
no_claims >= 0.5 (
0.447 )
no_claims < 3.5 (
0.389 )
optom_presc < 3.5 (
0.341 )
member_duration <
92.5 ( 0.672 )
0.672
member_duration >=
92.5 ( 0.299 )
0.299
optom_presc >= 3.5 (
0.813 ) 0.813
no_claims >= 3.5 (
0.825 )
no_claims < 4.5 ( 0.65
)
num_members >= 1.5
( 0.476 ) 0.476
num_members < 1.5 (
0.842 ) 0.842
no_claims >= 4.5 (
0.947 )
member_duration <
318 ( 1 ) 1.000
member_duration >=
318 ( 0.4 )
0.400
A bit easier to see.

10/3/2019

10/3/2019
Rather flat for 0 - 1

10/3/2019

10/3/2019
Final Nodes Tree Diagnostics.
Highly non-linear relations
With jagged connecting Lines.

10/3/2019

10/3/2019
Very similar TRN / VAL

10/3/2019

10/3/2019
Very good performance in terms of TRN lift, relative to logistic.
Gains Table
%
Event
Cum
%
Event
s
%
Capt.
Event
s
Cum
%
Capt.
Event
s Lift
Cum
Lift
Pctl Min
Prob
Max Prob Model Name
55.45 55.45 28.82 28.82 2.88 2.88
10 0.299 1.000 M1_VAL_TREES
0.400 1.000 M1_TRN_TREES
69.17 69.17 33.97 33.97 3.39 3.39
20 0.299 0.299 M1_TRN_TREES
29.88 49.55 14.63 48.60 1.47 2.43
M1_VAL_TREES 34.15 44.82 17.67 46.49 1.77 2.32
30 0.217 0.299 M1_TRN_TREES
24.97 41.35 12.26 60.87 1.22 2.03
M1_VAL_TREES 26.33 38.65 13.69 60.18 1.37 2.00
40 0.217 0.217 M1_TRN_TREES
21.73 36.45 10.64 71.51 1.07 1.79
M1_VAL_TREES 22.20 34.54 11.49 71.66 1.15 1.79
50 0.131 0.217 M1_TRN_TREES
15.96 32.35 7.84 79.34 0.78 1.59
M1_VAL_TREES 13.38 30.30 6.96 78.62 0.69 1.57
60 0.131 0.131 M1_TRN_TREES
13.11 29.14 6.42 85.76 0.64 1.43
M1_VAL_TREES 11.75 27.22 6.08 84.70 0.61 1.41
70 0.062 0.131 M1_TRN_TREES
10.49 26.48 5.15 90.91 0.51 1.30
M1_VAL_TREES 8.92 24.60 4.64 89.34 0.46 1.28
80 0.062 0.062 M1_TRN_TREES
6.18 23.94 3.03 93.94 0.30 1.17
M1_VAL_TREES 6.86 22.39 3.55 92.89 0.36 1.16
90 0.062 0.062 M1_TRN_TREES
6.18 21.97 3.03 96.97 0.30 1.08
M1_VAL_TREES 6.86 20.66 3.56 96.45 0.36 1.07
100 0.062 0.062 M1_TRN_TREES
6.18 20.39 3.03
100.0
0 0.30 1.00
M1_VAL_TREES 6.86 19.28 3.55
100.0
0 0.36 1.00

10/3/2019
Lift, cumulative,
Best lift.

10/3/2019

10/3/2019
Precision + classification
Similar for VAL.

10/3/2019
Comparing Gains-chart info with Precision Recall.
The gains-chart provides information on cumulative # of
Events per descending percentile / bin. These bins contain a
fixed number of observations.
Precision recall instead is at probability level, not at bin
Level, and thus # of observations along the curve is not
Uniform. Thus, selecting cutoff point from gains-chart selects
invariably from within a range of probabilities.
Selecting from Precision recall, selects a specific probability
point.

10/3/2019
References
Auslender L. (1998): Alacart, poor man’s classification trees, NESUG.
Breiman L., Friedman J., Olshen R., Stone J. (1984): Classification and Regression Trees,
Wadsworth.
Chipman H., George E., McCulloch R.: BART, Bayesian additive regression Trees, The
Annals of Statistics.
Friedman, J. (2001).Greedy boosting approximation: a gradient boosting machine. Ann.Stat.
29, 1189–1232.doi:10.1214/aos/1013203451
Quinlan J. Ross (1993): C4.5: programs for machine learning, Morgan Kaufmann Publshers.

10/3/2019
𝑻𝒉𝒆
End

4_1_Tree World.pdf

Recommended

Recommended

More Related Content

Similar to 4_1_Tree World.pdf

Similar to 4_1_Tree World.pdf (20)

More from Leonardo Auslender

More from Leonardo Auslender (20)

Recently uploaded

Recently uploaded (20)

4_1_Tree World.pdf