Conditional trees

Conditional Trees
or
Unbiased recursive partitioning
A conditional inference framework

Christoph Molnar
Supervisor: Stephanie Möst

Department of Statistics, LMU

18 December 2012

1 / 36

Overview

Introduction and Motivation

Algorithm for unbiased trees

Conditional inference with permutation tests

Examples

Properties

Summary

2 / 36

CART trees

Model: Y = f (X )
Structure of decision trees
Recursive partitioning of covariable space X
Split optimizes criterion (Gini, information gain, sum of
squares) depending on scale of Y
Split point search: exhaustive search procedure
Avoid overﬁtting: Early stopping or pruning
Usage: prediction and explanation
Other tree types: ID3, C4.5, CHAID, . . .

3 / 36

What are conditional trees?

Special kind of trees
Recursive partitioning with binary splits and early stopping
Constant models in terminal nodes
Variable selection, early stopping and split point search based
on conditional inference
Uses permutation tests for inference
Solves problems of CART trees

4 / 36

Why conditional trees?

Helps to overcome problems of trees:
overfitting (can be solved with other techniques as well)
Selection bias towards covariables with many possible splits
(i.e. numeric, multi categorial)
Difficult interpretation due to selection bias
Variable selection: No concept of statistical significance
Not all scales of Y and X covered (ID3, C4.5, ...)

5 / 36

Simulation: selection bias

Variable selection unbiased ⇔ Probability of selecting a
covariable, which is independent from Y is the same for all
independent covariables
Measurement scale of covariable shouldn’t play a role
Simulation illustrating the selection bias:
Y ∼ N(0, 1)
X1 ∼ M n, 1 , 1
2 2
X2 ∼ M n, 1 , 1 , 1
3 3 3
X3 ∼ M n, 1 , 1 , 1 , 1
4 4 4 4

6 / 36

Simulation: results
Selection frequencies for the ﬁrst split:
X1 : 0.128, X2 : 0.302, X3 : 0.556, none: 0.014
X1 X2 X3 none
0.0 0.2 0.4 0.6 0.8 1.0

Strongly biased towards variables with many possible splits
Example of a tree:
yes x3 = 1,2 no

−0.19 x2 = 1,3

−0.098 0.36
Overﬁtting! (Note: complexity parameter not cross-validated)
Desirable here: No split at all
Problem source: Exhaustive search through all variables and all
possible split points
Numeric/multi-categorial categorial have more split options ⇒
Multiple comparison problem 7 / 36

Idea of conditional trees

Variable selection and search for split point ⇒ two steps
Embed all decisions into hypothesis tests
All tests with conditional inference (permutation tests)

8 / 36

Ctree algorithm

1 Stop criterion
Test global null hypothesis H0 of independence between Y and
all Xj with
j j
H0 = ∩m H0 and H0 : D(Y|Xj ) = D(Y)
j=1
If H0 not rejected ⇒ Stop
2 Select variable Xj∗ with strongest association
3 Search best split point for Xj∗ and partitionate data
4 Repeat steps 1.), 2.) and 3.) for both of the new partitions

9 / 36

How can we test hypothesis of independence?

Parametric tests depend on distribution assumptions
Problem: Unknown conditional distribution
D(Y |X ) = D(Y |X1 , ..., Xm ) = D(Y |f (X1 ), ..., f (Xm ))
Need for a general framework, which can handle arbitrary
scales
Let the data speak: ⇒ permutation tests!

10 / 36

Excursion: permutation tests

11 / 36

Permutation tests: simple example

Possible treatments for disease: A or B
Numeric measurement (blood value)
Question: Diﬀerent blood values between treatment A and B?
⇔ µB = µA ?
Test statistic: T0 = µA − µB
ˆ ˆ
H0 : µA − µB = 0, H 1 : µ A − µB = 0
Distribution unknown ⇒ Permutation test

Treatment
q q q q q q q q q q q A
q B
1 2
y

T0 = µA − µB = 2.06 - 1.2 = 0.86
ˆ ˆ

12 / 36

Permute

Original data:

B B B B A A A A B A
0.5 0.9 1.1 1.2 1.5 1.9 2.0 2.1 2.3 2.8

One possible permutation:

B B B B A A A A B A
2.8 2.3 1.1 1.9 1.2 2.1 1.5 0.5 0.9 2.0

Permute the labels (A and B) and the numeric measurement
Calculate test statistic T for each permutation
Do this with all possible permutations
Result: Distribution of test statistic conditioned on sample

13 / 36

P-value and decision

k = {Permutation samples : |ˆA,perm − µB,perm | > |ˆA − µB |}
µ ˆ µ ˆ
k
p-value = #Permutations
p-value < α = 0.05? ⇒ If yes, H0 can be rejected

0.6
Test statistic of
density

0.4 q original
q permutation
0.2
q q qq q q q qqq q q q qq
q qqq q q
qq q qqq q q q qqq q qq q q qqq q
0.0 q q q
−1.0 −0.5 0.0 0.5 1.0
Difference of means per treatment

14 / 36

General algorithm for permutation tests

Requirement: Under H0 response and covariables are
exchangeable
Do the following:
1 Calculate test statistic T0
2 Calculate test statistic T for all permutations of pairs Y , X
3 Compute nextreme : Count number of T which are more
extreme than T0
nextreme
4 p-value p = npermutations
5 Reject H0 if p < α, with signiﬁcance level α
If # possible permutations too big, draw random permutations
in 2.) (Monte Carlo sampling)

15 / 36

Framework by Strasser and Weber

General test statistic:
n
Tj (Ln , w ) = vec wi gj (Xij )h(Yi , (Y1 , ..., Yn ))T ∈ Rpj q
i=1
h is called inﬂuence function, gj is transformation of Xj
Choose gj , h depending on scale
It’s possible to calculate µ and Σ of T
(t−µ)
Standardized test statistic: c(t, µ, Σ) = maxk=1,...,pq √ k
(Σ)kk

Why so complex? ⇒ Cover all cases: Multicategorial X or Y ,
diﬀerent scales

16 / 36

End of excursion
Lets get back to business

17 / 36

Ctree algorithm with permutation tests

1 Stop criterion
Test global null hypothesis H0 of independence between Y and
all Xj with
j j
H0 = ∩m H0 and H0 : D(Y|Xj ) = D(Y) (permutation tests
j=1
for each Xj )
If H0 not rejected (no signiﬁcance for all Xj ) ⇒ Stop
2 Select variable Xj∗ with strongest association (smallest
p-value)
3 Search best split point for Xj∗ (max. test statistic c) and
partition data
4 Repeat steps 1.), 2.) and 3.) for both of the new partitions

18 / 36

Permutation tests for stop criterion

Choose influence function h for Y
Choose transformation function g for each Xj
Test each variable Xj separately for association with Y
j
(H0 : D(Y |Xj ) = D(Y ) = Variable Xj has no influence on Y )
j
Global H0 = ∩m H0 : No variable has influence on Y .
j=1
Test global H0 : Multiple Testing ⇒ Adjust α (Bonferroni
correction, ...)

19 / 36

Permutation tests for variable selection

Choose variable with smallest p-value for split
Note: Switch to p-value comparison gets rid of scaling problem

20 / 36

Test statistic for best split point

Use test statistic instead of Gini/SSE for split point search
n
TjA (Ln , w ) = vec wi I (Xji ∈ A) · h(Yi , (Y1 , . . . , Yn ))T
i=1
(T A −µ)k
Standardized test statistic: c = maxk √j
(Σ)kk

Measures discrepancy between {Yi |Xji ∈ A} and {Yi |Xji ∈ A}
/
Calculate c for all possible splits; Choose split point with
maximal c
Covers diﬀerent scales of Y and X
21 / 36

Usage examples with R
- Let’s get the party started -

22 / 36

Bodyfat: example for continuous regression
Example: bodyfat data
Predict body fat with anthropometric measurements
Data: Measurements of 71 healthy women
Response Y : body fat measured by DXA (numeric)
Covariables X : diﬀerent body measurements (numeric)
For example: waist circumference, breadth of the knee, ...
h = Yi
g = Xi
n
Tj (Ln , w) = wi Xij Yi
i=1

¯ ¯
Xij Yi −nnode Xj Y
c= | t−µ |
σ ∝ i :node
(Pearson
¯
(Yi −Y )2 ¯
(Xij −Xj )2
i :node i :node

correlation coeﬃcient)
23 / 36

Bodyfat: R-code

library("party")
library("rpart")
library("rpart.plot")
data(bodyfat, package = "mboost")
## conditional tree
cond_tree <- ctree(DEXfat ~ ., data = bodyfat)
## normal tree
classic_tree <- rpart(DEXfat ~ ., data = bodyfat)

24 / 36

Bodyfat: conditional tree

plot(cond_tree)
1
hipcirc
p < 0.001

≤ 108 > 108

2 9
anthro3c kneebreadth
p < 0.001 p = 0.006

≤ 3.76 > 3.76

3 6
anthro3c waistcirc
≤ 10.6 > 10.6
p = 0.001 p = 0.003

≤ 3.39 > 3.39 ≤ 86 > 86

Node 4 (n = 13) Node 5 (n = 12) Node 7 (n = 13) Node 8 (n = 7) Node 10 (n = 19) Node 11 (n = 7)

60 60 60 60 60 60

50 50 50 50 50 50

40 40 40 40 40 40

30 30 30 30 30 30

20 20 20 20 20 20

10 10 10 10 10 10

25 / 36

Bodyfat: CART tree

rpart.plot(classic_tree)

yes waistcir < 88 no

anthro3c < 3.4 hipcirc < 110

17 hipcirc < 101 35 45

23 30

⇒ Structurally diﬀerent trees!
26 / 36

Glaucoma: example for classiﬁcation

Predict Glaucoma (= eye disease) based on laser scanning
measurements
Response Y : Binary, y ∈ {Glaucoma, normal}
Covariables X : Diﬀerent volumes and areas of the eye (all
numeric)
(1, 0)T Glaucoma
h = eJ (Yi ) =
(0, 1)T normal
g (Xij ) = Xij
n
Tj (Ln , w) = vec wi Xij eJ (Yi )T =
i=1
¯ T
nGlaucoma · Xj,Glaucoma
¯j,normal
nnormal · X
¯ ¯
c ∝ max ngroup · (Xj,group − Xj,node ) group ∈ {Glaucoma,
normal}
27 / 36

Glaucoma: R-code

library("rpart")
library("party")
data("GlaucomaM", package = "ipred")
cond_tree <- ctree(Class ~ ., data = GlaucomaM)
classic_tree <- rpart(Class ~ ., data = GlaucomaM)

28 / 36

Glaucoma: conditional tree
Node 1 (n = 196)
1

normal glaucoma
0.8
0.6
0.4
0.2
0

Node 2 (n = 87) Node 5 (n = 109)
1 1
normal glaucoma

normal glaucoma
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0

Node 3 (n = 79) Node 4 (n = 8) Node 6 (n = 65) Node 7 (n = 44)
1 1 1 1
normal glaucoma

normal glaucoma

normal glaucoma

normal glaucoma
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0 0 0 0

## 1) vari <= 0.059; criterion = 1, statistic = 71.475
## 2) vasg <= 0.066; criterion = 1, statistic = 29.265
## 3)* weights = 79
## 2) vasg > 0.066
## 4)* weights = 8
## 1) vari > 0.059
## 5) tms <= -0.066; criterion = 0.951, statistic = 11.221
## 6)* weights = 65
## 5) tms > -0.066
## 7)* weights = 44
29 / 36

Glaucoma: CART tree

rpart.plot(classic_tree, cex = 1.5)

yes varg < 0.21 no

glaucoma mhcg >= 0.17

glaucoma vars < 0.064

glaucoma tms >= −0.066

eas < 0.45 normal

glaucoma normal

30 / 36

Appendix: Examples of other scales
Y categorial, X categorial
h = eJ (Yi ), g = eK (Xij )
⇒ T is vectorized contingency table of Xj and Y
Xj
1 2 3
Pearson
residuals:
1.64
1

0.00
Y
2

−1.64
3

−2.08
p−value =
0.009

Y and Xj numeric,h = rg (Yi ), g = rg (Xij ) ⇒ Spearman’s
rho
Flexible T for diﬀerent situations: Multivariate regression,
ordinal regression, censored regression, . . .
31 / 36

Properties

Prediction accuracy: Not better than normal trees, but not
worse either
Computational considerations: Same speed as normal trees.
Two possible interpretations of signiﬁcance level α:
1. Pre-speciﬁed nominal level of underlying association tests
2. Simple hyper parameter determining the tree size
Low α yields smaller trees

32 / 36

Summary conditional trees

Not heuristics, but non-parametric models with well-defined
theoretical background
Suitable for regression with arbitrary scales of Y and X
Unbiased variable selection
No overfitting
Conditional trees structurally different from trees partitioned
with exhaustive search procedures

33 / 36

Literature and Software
J. Friedman, T. Hastie, and R. Tibshirani.
The elements of statistical learning, volume 1.
Springer Series in Statistics, 2001.
T. Hothorn, K. Hornik, and A. Zeileis.
Unbiased recursive partitioning: A conditional inference
framework.
Journal of Computational and Graphical Statistics, 15(3):
651–674, 2006.
H. Strasser and C. Weber.
On the asymptotic theory of permutation statistics.
1999.
R-packages:
rpart: Recursive partitioning
rpart.plot: Plot function for rpart
party: A Laboratory for Recursive Partytioning
All available on CRAN
34 / 36

Appendix: Competitors

Other partitioning algorithms in this area:
CHAID: Nominal response, χ2 test, multiway splits, nominal
covariables
GUIDE: Continuous response only, p-value from χ2 test,
categorizes continuous covariables
QUEST: ANOVA F-Test for continuous response, χ2 test for
nominal, compare on p-scale ⇒ reduces selection bias
CRUISE: Multiway splits, discriminant analysis in each node,
unbiased variable selection

35 / 36

Appendix: Properties of test statistic T
n
µj = E(Tj (Ln , w)|S(Ln , w)) = vec wi gj (Xji ) E(h|S(Ln , w))T
i =1

Σj = V(Tj (Ln , w)|S(Ln , w))
w.
= V(h|S(Ln , w)) ⊗ wi gj (Xji ) ⊗ wi gj (Xji )T
w. − 1
i
T
1
− V(h|S(Ln , w)) ⊗ wi gj (Xji ) ⊗ wi gj (Xji )
w. − 1
i i
n
w. = wi
i =1

E(h|S(Ln , w)) = w.−1 wi h(Yi , (Y1 , . . . , Yn )) ∈ Rq
i
−1
V(h|S(Ln , w)) = w. wi (h(Yi , (Y1 , . . . , Yn )) − E(h|S(Ln , w)))
i

(h(Yi , (Y1 , . . . , Yn )) − E(h|S(Ln , w)))T
36 / 36

Conditional trees

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Conditional trees

Similar to Conditional trees (20)

Recently uploaded

Recently uploaded (20)

Conditional trees