1. Conditional Trees
or
Unbiased recursive partitioning
A conditional inference framework
Christoph Molnar
Supervisor: Stephanie Möst
Department of Statistics, LMU
18 December 2012
1 / 36
2. Overview
Introduction and Motivation
Algorithm for unbiased trees
Conditional inference with permutation tests
Examples
Properties
Summary
2 / 36
3. CART trees
Model: Y = f (X )
Structure of decision trees
Recursive partitioning of covariable space X
Split optimizes criterion (Gini, information gain, sum of
squares) depending on scale of Y
Split point search: exhaustive search procedure
Avoid overfitting: Early stopping or pruning
Usage: prediction and explanation
Other tree types: ID3, C4.5, CHAID, . . .
3 / 36
4. What are conditional trees?
Special kind of trees
Recursive partitioning with binary splits and early stopping
Constant models in terminal nodes
Variable selection, early stopping and split point search based
on conditional inference
Uses permutation tests for inference
Solves problems of CART trees
4 / 36
5. Why conditional trees?
Helps to overcome problems of trees:
overfitting (can be solved with other techniques as well)
Selection bias towards covariables with many possible splits
(i.e. numeric, multi categorial)
Difficult interpretation due to selection bias
Variable selection: No concept of statistical significance
Not all scales of Y and X covered (ID3, C4.5, ...)
5 / 36
6. Simulation: selection bias
Variable selection unbiased ⇔ Probability of selecting a
covariable, which is independent from Y is the same for all
independent covariables
Measurement scale of covariable shouldn’t play a role
Simulation illustrating the selection bias:
Y ∼ N(0, 1)
X1 ∼ M n, 1 , 1
2 2
X2 ∼ M n, 1 , 1 , 1
3 3 3
X3 ∼ M n, 1 , 1 , 1 , 1
4 4 4 4
6 / 36
7. Simulation: results
Selection frequencies for the first split:
X1 : 0.128, X2 : 0.302, X3 : 0.556, none: 0.014
X1 X2 X3 none
0.0 0.2 0.4 0.6 0.8 1.0
Strongly biased towards variables with many possible splits
Example of a tree:
yes x3 = 1,2 no
−0.19 x2 = 1,3
−0.098 0.36
Overfitting! (Note: complexity parameter not cross-validated)
Desirable here: No split at all
Problem source: Exhaustive search through all variables and all
possible split points
Numeric/multi-categorial categorial have more split options ⇒
Multiple comparison problem 7 / 36
8. Idea of conditional trees
Variable selection and search for split point ⇒ two steps
Embed all decisions into hypothesis tests
All tests with conditional inference (permutation tests)
8 / 36
9. Ctree algorithm
1 Stop criterion
Test global null hypothesis H0 of independence between Y and
all Xj with
j j
H0 = ∩m H0 and H0 : D(Y|Xj ) = D(Y)
j=1
If H0 not rejected ⇒ Stop
2 Select variable Xj∗ with strongest association
3 Search best split point for Xj∗ and partitionate data
4 Repeat steps 1.), 2.) and 3.) for both of the new partitions
9 / 36
10. How can we test hypothesis of independence?
Parametric tests depend on distribution assumptions
Problem: Unknown conditional distribution
D(Y |X ) = D(Y |X1 , ..., Xm ) = D(Y |f (X1 ), ..., f (Xm ))
Need for a general framework, which can handle arbitrary
scales
Let the data speak: ⇒ permutation tests!
10 / 36
12. Permutation tests: simple example
Possible treatments for disease: A or B
Numeric measurement (blood value)
Question: Different blood values between treatment A and B?
⇔ µB = µA ?
Test statistic: T0 = µA − µB
ˆ ˆ
H0 : µA − µB = 0, H 1 : µ A − µB = 0
Distribution unknown ⇒ Permutation test
Treatment
q q q q q q q q q q q A
q B
1 2
y
T0 = µA − µB = 2.06 - 1.2 = 0.86
ˆ ˆ
12 / 36
13. Permute
Original data:
B B B B A A A A B A
0.5 0.9 1.1 1.2 1.5 1.9 2.0 2.1 2.3 2.8
One possible permutation:
B B B B A A A A B A
2.8 2.3 1.1 1.9 1.2 2.1 1.5 0.5 0.9 2.0
Permute the labels (A and B) and the numeric measurement
Calculate test statistic T for each permutation
Do this with all possible permutations
Result: Distribution of test statistic conditioned on sample
13 / 36
14. P-value and decision
k = {Permutation samples : |ˆA,perm − µB,perm | > |ˆA − µB |}
µ ˆ µ ˆ
k
p-value = #Permutations
p-value < α = 0.05? ⇒ If yes, H0 can be rejected
0.6
Test statistic of
density
0.4 q original
q permutation
0.2
q q qq q q q qqq q q q qq
q qqq q q
qq q qqq q q q qqq q qq q q qqq q
0.0 q q q
−1.0 −0.5 0.0 0.5 1.0
Difference of means per treatment
14 / 36
15. General algorithm for permutation tests
Requirement: Under H0 response and covariables are
exchangeable
Do the following:
1 Calculate test statistic T0
2 Calculate test statistic T for all permutations of pairs Y , X
3 Compute nextreme : Count number of T which are more
extreme than T0
nextreme
4 p-value p = npermutations
5 Reject H0 if p < α, with significance level α
If # possible permutations too big, draw random permutations
in 2.) (Monte Carlo sampling)
15 / 36
16. Framework by Strasser and Weber
General test statistic:
n
Tj (Ln , w ) = vec wi gj (Xij )h(Yi , (Y1 , ..., Yn ))T ∈ Rpj q
i=1
h is called influence function, gj is transformation of Xj
Choose gj , h depending on scale
It’s possible to calculate µ and Σ of T
(t−µ)
Standardized test statistic: c(t, µ, Σ) = maxk=1,...,pq √ k
(Σ)kk
Why so complex? ⇒ Cover all cases: Multicategorial X or Y ,
different scales
16 / 36
18. Ctree algorithm with permutation tests
1 Stop criterion
Test global null hypothesis H0 of independence between Y and
all Xj with
j j
H0 = ∩m H0 and H0 : D(Y|Xj ) = D(Y) (permutation tests
j=1
for each Xj )
If H0 not rejected (no significance for all Xj ) ⇒ Stop
2 Select variable Xj∗ with strongest association (smallest
p-value)
3 Search best split point for Xj∗ (max. test statistic c) and
partition data
4 Repeat steps 1.), 2.) and 3.) for both of the new partitions
18 / 36
19. Permutation tests for stop criterion
Choose influence function h for Y
Choose transformation function g for each Xj
Test each variable Xj separately for association with Y
j
(H0 : D(Y |Xj ) = D(Y ) = Variable Xj has no influence on Y )
j
Global H0 = ∩m H0 : No variable has influence on Y .
j=1
Test global H0 : Multiple Testing ⇒ Adjust α (Bonferroni
correction, ...)
19 / 36
20. Permutation tests for variable selection
Choose variable with smallest p-value for split
Note: Switch to p-value comparison gets rid of scaling problem
20 / 36
21. Test statistic for best split point
Use test statistic instead of Gini/SSE for split point search
n
TjA (Ln , w ) = vec wi I (Xji ∈ A) · h(Yi , (Y1 , . . . , Yn ))T
i=1
(T A −µ)k
Standardized test statistic: c = maxk √j
(Σ)kk
Measures discrepancy between {Yi |Xji ∈ A} and {Yi |Xji ∈ A}
/
Calculate c for all possible splits; Choose split point with
maximal c
Covers different scales of Y and X
21 / 36
23. Bodyfat: example for continuous regression
Example: bodyfat data
Predict body fat with anthropometric measurements
Data: Measurements of 71 healthy women
Response Y : body fat measured by DXA (numeric)
Covariables X : different body measurements (numeric)
For example: waist circumference, breadth of the knee, ...
h = Yi
g = Xi
n
Tj (Ln , w) = wi Xij Yi
i=1
¯ ¯
Xij Yi −nnode Xj Y
c= | t−µ |
σ ∝ i :node
(Pearson
¯
(Yi −Y )2 ¯
(Xij −Xj )2
i :node i :node
correlation coefficient)
23 / 36
24. Bodyfat: R-code
library("party")
library("rpart")
library("rpart.plot")
data(bodyfat, package = "mboost")
## conditional tree
cond_tree <- ctree(DEXfat ~ ., data = bodyfat)
## normal tree
classic_tree <- rpart(DEXfat ~ ., data = bodyfat)
24 / 36
27. Glaucoma: example for classification
Predict Glaucoma (= eye disease) based on laser scanning
measurements
Response Y : Binary, y ∈ {Glaucoma, normal}
Covariables X : Different volumes and areas of the eye (all
numeric)
(1, 0)T Glaucoma
h = eJ (Yi ) =
(0, 1)T normal
g (Xij ) = Xij
n
Tj (Ln , w) = vec wi Xij eJ (Yi )T =
i=1
¯ T
nGlaucoma · Xj,Glaucoma
¯j,normal
nnormal · X
¯ ¯
c ∝ max ngroup · (Xj,group − Xj,node ) group ∈ {Glaucoma,
normal}
27 / 36
30. Glaucoma: CART tree
rpart.plot(classic_tree, cex = 1.5)
yes varg < 0.21 no
glaucoma mhcg >= 0.17
glaucoma vars < 0.064
glaucoma tms >= −0.066
eas < 0.45 normal
glaucoma normal
30 / 36
31. Appendix: Examples of other scales
Y categorial, X categorial
h = eJ (Yi ), g = eK (Xij )
⇒ T is vectorized contingency table of Xj and Y
Xj
1 2 3
Pearson
residuals:
1.64
1
0.00
Y
2
−1.64
3
−2.08
p−value =
0.009
Y and Xj numeric,h = rg (Yi ), g = rg (Xij ) ⇒ Spearman’s
rho
Flexible T for different situations: Multivariate regression,
ordinal regression, censored regression, . . .
31 / 36
32. Properties
Prediction accuracy: Not better than normal trees, but not
worse either
Computational considerations: Same speed as normal trees.
Two possible interpretations of significance level α:
1. Pre-specified nominal level of underlying association tests
2. Simple hyper parameter determining the tree size
Low α yields smaller trees
32 / 36
33. Summary conditional trees
Not heuristics, but non-parametric models with well-defined
theoretical background
Suitable for regression with arbitrary scales of Y and X
Unbiased variable selection
No overfitting
Conditional trees structurally different from trees partitioned
with exhaustive search procedures
33 / 36
34. Literature and Software
J. Friedman, T. Hastie, and R. Tibshirani.
The elements of statistical learning, volume 1.
Springer Series in Statistics, 2001.
T. Hothorn, K. Hornik, and A. Zeileis.
Unbiased recursive partitioning: A conditional inference
framework.
Journal of Computational and Graphical Statistics, 15(3):
651–674, 2006.
H. Strasser and C. Weber.
On the asymptotic theory of permutation statistics.
1999.
R-packages:
rpart: Recursive partitioning
rpart.plot: Plot function for rpart
party: A Laboratory for Recursive Partytioning
All available on CRAN
34 / 36
35. Appendix: Competitors
Other partitioning algorithms in this area:
CHAID: Nominal response, χ2 test, multiway splits, nominal
covariables
GUIDE: Continuous response only, p-value from χ2 test,
categorizes continuous covariables
QUEST: ANOVA F-Test for continuous response, χ2 test for
nominal, compare on p-scale ⇒ reduces selection bias
CRUISE: Multiway splits, discriminant analysis in each node,
unbiased variable selection
35 / 36
36. Appendix: Properties of test statistic T
n
µj = E(Tj (Ln , w)|S(Ln , w)) = vec wi gj (Xji ) E(h|S(Ln , w))T
i =1
Σj = V(Tj (Ln , w)|S(Ln , w))
w.
= V(h|S(Ln , w)) ⊗ wi gj (Xji ) ⊗ wi gj (Xji )T
w. − 1
i
T
1
− V(h|S(Ln , w)) ⊗ wi gj (Xji ) ⊗ wi gj (Xji )
w. − 1
i i
n
w. = wi
i =1
E(h|S(Ln , w)) = w.−1 wi h(Yi , (Y1 , . . . , Yn )) ∈ Rq
i
−1
V(h|S(Ln , w)) = w. wi (h(Yi , (Y1 , . . . , Yn )) − E(h|S(Ln , w)))
i
(h(Yi , (Y1 , . . . , Yn )) − E(h|S(Ln , w)))T
36 / 36