Upcoming SlideShare
×

# Conditional trees

14,484 views

Published on

Slides of my presentation on conditional trees

Published in: Education
14 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
14,484
On SlideShare
0
From Embeds
0
Number of Embeds
4,596
Actions
Shares
0
223
0
Likes
14
Embeds 0
No embeds

No notes for slide

### Conditional trees

1. 1. Conditional Trees or Unbiased recursive partitioningA conditional inference framework Christoph Molnar Supervisor: Stephanie Möst Department of Statistics, LMU 18 December 2012 1 / 36
2. 2. Overview Introduction and Motivation Algorithm for unbiased trees Conditional inference with permutation tests Examples Properties Summary 2 / 36
3. 3. CART trees Model: Y = f (X ) Structure of decision trees Recursive partitioning of covariable space X Split optimizes criterion (Gini, information gain, sum of squares) depending on scale of Y Split point search: exhaustive search procedure Avoid overﬁtting: Early stopping or pruning Usage: prediction and explanation Other tree types: ID3, C4.5, CHAID, . . . 3 / 36
4. 4. What are conditional trees? Special kind of trees Recursive partitioning with binary splits and early stopping Constant models in terminal nodes Variable selection, early stopping and split point search based on conditional inference Uses permutation tests for inference Solves problems of CART trees 4 / 36
5. 5. Why conditional trees? Helps to overcome problems of trees: overﬁtting (can be solved with other techniques as well) Selection bias towards covariables with many possible splits (i.e. numeric, multi categorial) Diﬃcult interpretation due to selection bias Variable selection: No concept of statistical signiﬁcance Not all scales of Y and X covered (ID3, C4.5, ...) 5 / 36
6. 6. Simulation: selection bias Variable selection unbiased ⇔ Probability of selecting a covariable, which is independent from Y is the same for all independent covariables Measurement scale of covariable shouldn’t play a role Simulation illustrating the selection bias: Y ∼ N(0, 1) X1 ∼ M n, 1 , 1 2 2 X2 ∼ M n, 1 , 1 , 1 3 3 3 X3 ∼ M n, 1 , 1 , 1 , 1 4 4 4 4 6 / 36
7. 7. Simulation: results Selection frequencies for the ﬁrst split: X1 : 0.128, X2 : 0.302, X3 : 0.556, none: 0.014 X1 X2 X3 none 0.0 0.2 0.4 0.6 0.8 1.0 Strongly biased towards variables with many possible splits Example of a tree: yes x3 = 1,2 no −0.19 x2 = 1,3 −0.098 0.36 Overﬁtting! (Note: complexity parameter not cross-validated) Desirable here: No split at all Problem source: Exhaustive search through all variables and all possible split points Numeric/multi-categorial categorial have more split options ⇒ Multiple comparison problem 7 / 36
8. 8. Idea of conditional trees Variable selection and search for split point ⇒ two steps Embed all decisions into hypothesis tests All tests with conditional inference (permutation tests) 8 / 36
9. 9. Ctree algorithm 1 Stop criterion Test global null hypothesis H0 of independence between Y and all Xj with j j H0 = ∩m H0 and H0 : D(Y|Xj ) = D(Y) j=1 If H0 not rejected ⇒ Stop 2 Select variable Xj∗ with strongest association 3 Search best split point for Xj∗ and partitionate data 4 Repeat steps 1.), 2.) and 3.) for both of the new partitions 9 / 36
10. 10. How can we test hypothesis of independence? Parametric tests depend on distribution assumptions Problem: Unknown conditional distribution D(Y |X ) = D(Y |X1 , ..., Xm ) = D(Y |f (X1 ), ..., f (Xm )) Need for a general framework, which can handle arbitrary scales Let the data speak: ⇒ permutation tests! 10 / 36
11. 11. Excursion: permutation tests 11 / 36
12. 12. Permutation tests: simple example Possible treatments for disease: A or B Numeric measurement (blood value) Question: Diﬀerent blood values between treatment A and B? ⇔ µB = µA ? Test statistic: T0 = µA − µB ˆ ˆ H0 : µA − µB = 0, H 1 : µ A − µB = 0 Distribution unknown ⇒ Permutation test Treatment q q q q q q q q q q q A q B 1 2 y T0 = µA − µB = 2.06 - 1.2 = 0.86 ˆ ˆ 12 / 36
13. 13. Permute Original data: B B B B A A A A B A 0.5 0.9 1.1 1.2 1.5 1.9 2.0 2.1 2.3 2.8 One possible permutation: B B B B A A A A B A 2.8 2.3 1.1 1.9 1.2 2.1 1.5 0.5 0.9 2.0 Permute the labels (A and B) and the numeric measurement Calculate test statistic T for each permutation Do this with all possible permutations Result: Distribution of test statistic conditioned on sample 13 / 36
14. 14. P-value and decision k = {Permutation samples : |ˆA,perm − µB,perm | > |ˆA − µB |} µ ˆ µ ˆ k p-value = #Permutations p-value < α = 0.05? ⇒ If yes, H0 can be rejected 0.6 Test statistic of density 0.4 q original q permutation 0.2 q q qq q q q qqq q q q qq q qqq q q qq q qqq q q q qqq q qq q q qqq q 0.0 q q q −1.0 −0.5 0.0 0.5 1.0 Difference of means per treatment 14 / 36
15. 15. General algorithm for permutation tests Requirement: Under H0 response and covariables are exchangeable Do the following: 1 Calculate test statistic T0 2 Calculate test statistic T for all permutations of pairs Y , X 3 Compute nextreme : Count number of T which are more extreme than T0 nextreme 4 p-value p = npermutations 5 Reject H0 if p < α, with signiﬁcance level α If # possible permutations too big, draw random permutations in 2.) (Monte Carlo sampling) 15 / 36
16. 16. Framework by Strasser and Weber General test statistic: n Tj (Ln , w ) = vec wi gj (Xij )h(Yi , (Y1 , ..., Yn ))T ∈ Rpj q i=1 h is called inﬂuence function, gj is transformation of Xj Choose gj , h depending on scale It’s possible to calculate µ and Σ of T (t−µ) Standardized test statistic: c(t, µ, Σ) = maxk=1,...,pq √ k (Σ)kk Why so complex? ⇒ Cover all cases: Multicategorial X or Y , diﬀerent scales 16 / 36
17. 17. End of excursionLets get back to business 17 / 36
18. 18. Ctree algorithm with permutation tests 1 Stop criterion Test global null hypothesis H0 of independence between Y and all Xj with j j H0 = ∩m H0 and H0 : D(Y|Xj ) = D(Y) (permutation tests j=1 for each Xj ) If H0 not rejected (no signiﬁcance for all Xj ) ⇒ Stop 2 Select variable Xj∗ with strongest association (smallest p-value) 3 Search best split point for Xj∗ (max. test statistic c) and partition data 4 Repeat steps 1.), 2.) and 3.) for both of the new partitions 18 / 36
19. 19. Permutation tests for stop criterion Choose inﬂuence function h for Y Choose transformation function g for each Xj Test each variable Xj separately for association with Y j (H0 : D(Y |Xj ) = D(Y ) = Variable Xj has no inﬂuence on Y ) j Global H0 = ∩m H0 : No variable has inﬂuence on Y . j=1 Test global H0 : Multiple Testing ⇒ Adjust α (Bonferroni correction, ...) 19 / 36
20. 20. Permutation tests for variable selection Choose variable with smallest p-value for split Note: Switch to p-value comparison gets rid of scaling problem 20 / 36
21. 21. Test statistic for best split point Use test statistic instead of Gini/SSE for split point search n TjA (Ln , w ) = vec wi I (Xji ∈ A) · h(Yi , (Y1 , . . . , Yn ))T i=1 (T A −µ)k Standardized test statistic: c = maxk √j (Σ)kk Measures discrepancy between {Yi |Xji ∈ A} and {Yi |Xji ∈ A} / Calculate c for all possible splits; Choose split point with maximal c Covers diﬀerent scales of Y and X 21 / 36
22. 22. Usage examples with R - Let’s get the party started - 22 / 36
23. 23. Bodyfat: example for continuous regression Example: bodyfat data Predict body fat with anthropometric measurements Data: Measurements of 71 healthy women Response Y : body fat measured by DXA (numeric) Covariables X : diﬀerent body measurements (numeric) For example: waist circumference, breadth of the knee, ... h = Yi g = Xi n Tj (Ln , w) = wi Xij Yi i=1 ¯ ¯ Xij Yi −nnode Xj Y c= | t−µ | σ ∝ i :node (Pearson ¯ (Yi −Y )2 ¯ (Xij −Xj )2 i :node i :node correlation coeﬃcient) 23 / 36
24. 24. Bodyfat: R-code library("party") library("rpart") library("rpart.plot") data(bodyfat, package = "mboost") ## conditional tree cond_tree <- ctree(DEXfat ~ ., data = bodyfat) ## normal tree classic_tree <- rpart(DEXfat ~ ., data = bodyfat) 24 / 36
25. 25. Bodyfat: conditional tree plot(cond_tree) 1 hipcirc p < 0.001 ≤ 108 > 108 2 9 anthro3c kneebreadth p < 0.001 p = 0.006 ≤ 3.76 > 3.76 3 6 anthro3c waistcirc ≤ 10.6 > 10.6 p = 0.001 p = 0.003 ≤ 3.39 > 3.39 ≤ 86 > 86 Node 4 (n = 13) Node 5 (n = 12) Node 7 (n = 13) Node 8 (n = 7) Node 10 (n = 19) Node 11 (n = 7)60 60 60 60 60 6050 50 50 50 50 5040 40 40 40 40 4030 30 30 30 30 3020 20 20 20 20 2010 10 10 10 10 10 25 / 36
26. 26. Bodyfat: CART tree rpart.plot(classic_tree) yes waistcir < 88 no anthro3c < 3.4 hipcirc < 110 17 hipcirc < 101 35 45 23 30 ⇒ Structurally diﬀerent trees! 26 / 36
27. 27. Glaucoma: example for classiﬁcation Predict Glaucoma (= eye disease) based on laser scanning measurements Response Y : Binary, y ∈ {Glaucoma, normal} Covariables X : Diﬀerent volumes and areas of the eye (all numeric) (1, 0)T Glaucoma h = eJ (Yi ) = (0, 1)T normal g (Xij ) = Xij n Tj (Ln , w) = vec wi Xij eJ (Yi )T = i=1 ¯ T nGlaucoma · Xj,Glaucoma ¯j,normal nnormal · X ¯ ¯ c ∝ max ngroup · (Xj,group − Xj,node ) group ∈ {Glaucoma, normal} 27 / 36
28. 28. Glaucoma: R-code library("rpart") library("party") data("GlaucomaM", package = "ipred") cond_tree <- ctree(Class ~ ., data = GlaucomaM) classic_tree <- rpart(Class ~ ., data = GlaucomaM) 28 / 36
29. 29. Glaucoma: conditional tree Node 1 (n = 196) 1 normal glaucoma 0.8 0.6 0.4 0.2 0 Node 2 (n = 87) Node 5 (n = 109) 1 1 normal glaucoma normal glaucoma 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 Node 3 (n = 79) Node 4 (n = 8) Node 6 (n = 65) Node 7 (n = 44) 1 1 1 1 normal glaucoma normal glaucoma normal glaucoma normal glaucoma 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 0 ## 1) vari <= 0.059; criterion = 1, statistic = 71.475 ## 2) vasg <= 0.066; criterion = 1, statistic = 29.265 ## 3)* weights = 79 ## 2) vasg > 0.066 ## 4)* weights = 8 ## 1) vari > 0.059 ## 5) tms <= -0.066; criterion = 0.951, statistic = 11.221 ## 6)* weights = 65 ## 5) tms > -0.066 ## 7)* weights = 44 29 / 36
30. 30. Glaucoma: CART tree rpart.plot(classic_tree, cex = 1.5) yes varg < 0.21 no glaucoma mhcg >= 0.17 glaucoma vars < 0.064 glaucoma tms >= −0.066 eas < 0.45 normal glaucoma normal 30 / 36
31. 31. Appendix: Examples of other scales Y categorial, X categorial h = eJ (Yi ), g = eK (Xij ) ⇒ T is vectorized contingency table of Xj and Y Xj 1 2 3 Pearson residuals: 1.64 1 0.00 Y 2 −1.64 3 −2.08 p−value = 0.009 Y and Xj numeric,h = rg (Yi ), g = rg (Xij ) ⇒ Spearman’s rho Flexible T for diﬀerent situations: Multivariate regression, ordinal regression, censored regression, . . . 31 / 36
32. 32. Properties Prediction accuracy: Not better than normal trees, but not worse either Computational considerations: Same speed as normal trees. Two possible interpretations of signiﬁcance level α: 1. Pre-speciﬁed nominal level of underlying association tests 2. Simple hyper parameter determining the tree size Low α yields smaller trees 32 / 36
33. 33. Summary conditional trees Not heuristics, but non-parametric models with well-deﬁned theoretical background Suitable for regression with arbitrary scales of Y and X Unbiased variable selection No overﬁtting Conditional trees structurally diﬀerent from trees partitioned with exhaustive search procedures 33 / 36
34. 34. Literature and Software J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer Series in Statistics, 2001. T. Hothorn, K. Hornik, and A. Zeileis. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3): 651–674, 2006. H. Strasser and C. Weber. On the asymptotic theory of permutation statistics. 1999. R-packages: rpart: Recursive partitioning rpart.plot: Plot function for rpart party: A Laboratory for Recursive Partytioning All available on CRAN 34 / 36
35. 35. Appendix: Competitors Other partitioning algorithms in this area: CHAID: Nominal response, χ2 test, multiway splits, nominal covariables GUIDE: Continuous response only, p-value from χ2 test, categorizes continuous covariables QUEST: ANOVA F-Test for continuous response, χ2 test for nominal, compare on p-scale ⇒ reduces selection bias CRUISE: Multiway splits, discriminant analysis in each node, unbiased variable selection 35 / 36
36. 36. Appendix: Properties of test statistic T n µj = E(Tj (Ln , w)|S(Ln , w)) = vec wi gj (Xji ) E(h|S(Ln , w))T i =1 Σj = V(Tj (Ln , w)|S(Ln , w)) w. = V(h|S(Ln , w)) ⊗ wi gj (Xji ) ⊗ wi gj (Xji )T w. − 1 i T 1 − V(h|S(Ln , w)) ⊗ wi gj (Xji ) ⊗ wi gj (Xji ) w. − 1 i i n w. = wi i =1 E(h|S(Ln , w)) = w.−1 wi h(Yi , (Y1 , . . . , Yn )) ∈ Rq i −1 V(h|S(Ln , w)) = w. wi (h(Yi , (Y1 , . . . , Yn )) − E(h|S(Ln , w))) i (h(Yi , (Y1 , . . . , Yn )) − E(h|S(Ln , w)))T 36 / 36