Conditional trees
Upcoming SlideShare
Loading in...5
×
 

Conditional trees

on

  • 6,537 views

Slides of my presentation on conditional trees

Slides of my presentation on conditional trees

Statistics

Views

Total Views
6,537
Views on SlideShare
3,211
Embed Views
3,326

Actions

Likes
2
Downloads
92
Comments
0

48 Embeds 3,326

http://www.r-bloggers.com 1302
http://machine-master.blogspot.de 1014
http://machine-master.blogspot.com 448
http://machine-master.blogspot.ca 129
http://machine-master.blogspot.co.uk 84
http://machine-master.blogspot.com.es 43
http://machine-master.blogspot.in 31
http://machine-master.blogspot.fr 26
http://machine-master.blogspot.com.au 26
http://machine-master.blogspot.ch 22
http://feeds.feedburner.com 22
http://machine-master.blogspot.it 16
http://machine-master.blogspot.dk 15
http://machine-master.blogspot.nl 14
http://machine-master.blogspot.com.br 13
http://machine-master.blogspot.co.at 10
http://machine-master.blogspot.be 9
http://www.statsblogs.com 9
http://machine-master.blogspot.pt 9
http://machine-master.blogspot.kr 7
http://machine-master.blogspot.ru 6
http://machine-master.blogspot.com.ar 5
http://machine-master.blogspot.ie 5
http://machine-master.blogspot.no 5
http://machine-master.blogspot.co.nz 4
http://sg1003.webmail.hinet.net 4
http://machine-master.blogspot.tw 4
http://machine-master.blogspot.mx 4
http://machine-master.blogspot.com.tr 4
http://machine-master.blogspot.cz 3
http://127.0.0.1 3
http://machine-master.blogspot.jp 3
http://machine-master.blogspot.co.il 3
http://machine-master.blogspot.hu 3
http://machine-master.blogspot.sg 3
http://web-18.citromail.hu 3
http://machine-master.blogspot.fi 2
http://machine-master.blogspot.gr 2
http://www.newsblur.com 2
http://translate.googleusercontent.com 1
https://www.facebook.com 1
http://machine-master.blogspot.ro 1
http://machine-master.blogspot.se 1
http://mail.tac-financial.com 1
http://poczta.onet.pl 1
http://bael10.yatopa.com 1
http://abtasty.com 1
http://www.tubebox.us 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Conditional trees Conditional trees Presentation Transcript

  • Conditional Trees or Unbiased recursive partitioningA conditional inference framework Christoph Molnar Supervisor: Stephanie Möst Department of Statistics, LMU 18 December 2012 1 / 36
  • Overview Introduction and Motivation Algorithm for unbiased trees Conditional inference with permutation tests Examples Properties Summary 2 / 36
  • CART trees Model: Y = f (X ) Structure of decision trees Recursive partitioning of covariable space X Split optimizes criterion (Gini, information gain, sum of squares) depending on scale of Y Split point search: exhaustive search procedure Avoid overfitting: Early stopping or pruning Usage: prediction and explanation Other tree types: ID3, C4.5, CHAID, . . . 3 / 36 View slide
  • What are conditional trees? Special kind of trees Recursive partitioning with binary splits and early stopping Constant models in terminal nodes Variable selection, early stopping and split point search based on conditional inference Uses permutation tests for inference Solves problems of CART trees 4 / 36 View slide
  • Why conditional trees? Helps to overcome problems of trees: overfitting (can be solved with other techniques as well) Selection bias towards covariables with many possible splits (i.e. numeric, multi categorial) Difficult interpretation due to selection bias Variable selection: No concept of statistical significance Not all scales of Y and X covered (ID3, C4.5, ...) 5 / 36
  • Simulation: selection bias Variable selection unbiased ⇔ Probability of selecting a covariable, which is independent from Y is the same for all independent covariables Measurement scale of covariable shouldn’t play a role Simulation illustrating the selection bias: Y ∼ N(0, 1) X1 ∼ M n, 1 , 1 2 2 X2 ∼ M n, 1 , 1 , 1 3 3 3 X3 ∼ M n, 1 , 1 , 1 , 1 4 4 4 4 6 / 36
  • Simulation: results Selection frequencies for the first split: X1 : 0.128, X2 : 0.302, X3 : 0.556, none: 0.014 X1 X2 X3 none 0.0 0.2 0.4 0.6 0.8 1.0 Strongly biased towards variables with many possible splits Example of a tree: yes x3 = 1,2 no −0.19 x2 = 1,3 −0.098 0.36 Overfitting! (Note: complexity parameter not cross-validated) Desirable here: No split at all Problem source: Exhaustive search through all variables and all possible split points Numeric/multi-categorial categorial have more split options ⇒ Multiple comparison problem 7 / 36
  • Idea of conditional trees Variable selection and search for split point ⇒ two steps Embed all decisions into hypothesis tests All tests with conditional inference (permutation tests) 8 / 36
  • Ctree algorithm 1 Stop criterion Test global null hypothesis H0 of independence between Y and all Xj with j j H0 = ∩m H0 and H0 : D(Y|Xj ) = D(Y) j=1 If H0 not rejected ⇒ Stop 2 Select variable Xj∗ with strongest association 3 Search best split point for Xj∗ and partitionate data 4 Repeat steps 1.), 2.) and 3.) for both of the new partitions 9 / 36
  • How can we test hypothesis of independence? Parametric tests depend on distribution assumptions Problem: Unknown conditional distribution D(Y |X ) = D(Y |X1 , ..., Xm ) = D(Y |f (X1 ), ..., f (Xm )) Need for a general framework, which can handle arbitrary scales Let the data speak: ⇒ permutation tests! 10 / 36
  • Excursion: permutation tests 11 / 36
  • Permutation tests: simple example Possible treatments for disease: A or B Numeric measurement (blood value) Question: Different blood values between treatment A and B? ⇔ µB = µA ? Test statistic: T0 = µA − µB ˆ ˆ H0 : µA − µB = 0, H 1 : µ A − µB = 0 Distribution unknown ⇒ Permutation test Treatment q q q q q q q q q q q A q B 1 2 y T0 = µA − µB = 2.06 - 1.2 = 0.86 ˆ ˆ 12 / 36
  • Permute Original data: B B B B A A A A B A 0.5 0.9 1.1 1.2 1.5 1.9 2.0 2.1 2.3 2.8 One possible permutation: B B B B A A A A B A 2.8 2.3 1.1 1.9 1.2 2.1 1.5 0.5 0.9 2.0 Permute the labels (A and B) and the numeric measurement Calculate test statistic T for each permutation Do this with all possible permutations Result: Distribution of test statistic conditioned on sample 13 / 36
  • P-value and decision k = {Permutation samples : |ˆA,perm − µB,perm | > |ˆA − µB |} µ ˆ µ ˆ k p-value = #Permutations p-value < α = 0.05? ⇒ If yes, H0 can be rejected 0.6 Test statistic of density 0.4 q original q permutation 0.2 q q qq q q q qqq q q q qq q qqq q q qq q qqq q q q qqq q qq q q qqq q 0.0 q q q −1.0 −0.5 0.0 0.5 1.0 Difference of means per treatment 14 / 36
  • General algorithm for permutation tests Requirement: Under H0 response and covariables are exchangeable Do the following: 1 Calculate test statistic T0 2 Calculate test statistic T for all permutations of pairs Y , X 3 Compute nextreme : Count number of T which are more extreme than T0 nextreme 4 p-value p = npermutations 5 Reject H0 if p < α, with significance level α If # possible permutations too big, draw random permutations in 2.) (Monte Carlo sampling) 15 / 36
  • Framework by Strasser and Weber General test statistic: n Tj (Ln , w ) = vec wi gj (Xij )h(Yi , (Y1 , ..., Yn ))T ∈ Rpj q i=1 h is called influence function, gj is transformation of Xj Choose gj , h depending on scale It’s possible to calculate µ and Σ of T (t−µ) Standardized test statistic: c(t, µ, Σ) = maxk=1,...,pq √ k (Σ)kk Why so complex? ⇒ Cover all cases: Multicategorial X or Y , different scales 16 / 36
  • End of excursionLets get back to business 17 / 36
  • Ctree algorithm with permutation tests 1 Stop criterion Test global null hypothesis H0 of independence between Y and all Xj with j j H0 = ∩m H0 and H0 : D(Y|Xj ) = D(Y) (permutation tests j=1 for each Xj ) If H0 not rejected (no significance for all Xj ) ⇒ Stop 2 Select variable Xj∗ with strongest association (smallest p-value) 3 Search best split point for Xj∗ (max. test statistic c) and partition data 4 Repeat steps 1.), 2.) and 3.) for both of the new partitions 18 / 36
  • Permutation tests for stop criterion Choose influence function h for Y Choose transformation function g for each Xj Test each variable Xj separately for association with Y j (H0 : D(Y |Xj ) = D(Y ) = Variable Xj has no influence on Y ) j Global H0 = ∩m H0 : No variable has influence on Y . j=1 Test global H0 : Multiple Testing ⇒ Adjust α (Bonferroni correction, ...) 19 / 36
  • Permutation tests for variable selection Choose variable with smallest p-value for split Note: Switch to p-value comparison gets rid of scaling problem 20 / 36
  • Test statistic for best split point Use test statistic instead of Gini/SSE for split point search n TjA (Ln , w ) = vec wi I (Xji ∈ A) · h(Yi , (Y1 , . . . , Yn ))T i=1 (T A −µ)k Standardized test statistic: c = maxk √j (Σ)kk Measures discrepancy between {Yi |Xji ∈ A} and {Yi |Xji ∈ A} / Calculate c for all possible splits; Choose split point with maximal c Covers different scales of Y and X 21 / 36
  • Usage examples with R - Let’s get the party started - 22 / 36
  • Bodyfat: example for continuous regression Example: bodyfat data Predict body fat with anthropometric measurements Data: Measurements of 71 healthy women Response Y : body fat measured by DXA (numeric) Covariables X : different body measurements (numeric) For example: waist circumference, breadth of the knee, ... h = Yi g = Xi n Tj (Ln , w) = wi Xij Yi i=1 ¯ ¯ Xij Yi −nnode Xj Y c= | t−µ | σ ∝ i :node (Pearson ¯ (Yi −Y )2 ¯ (Xij −Xj )2 i :node i :node correlation coefficient) 23 / 36
  • Bodyfat: R-code library("party") library("rpart") library("rpart.plot") data(bodyfat, package = "mboost") ## conditional tree cond_tree <- ctree(DEXfat ~ ., data = bodyfat) ## normal tree classic_tree <- rpart(DEXfat ~ ., data = bodyfat) 24 / 36
  • Bodyfat: conditional tree plot(cond_tree) 1 hipcirc p < 0.001 ≤ 108 > 108 2 9 anthro3c kneebreadth p < 0.001 p = 0.006 ≤ 3.76 > 3.76 3 6 anthro3c waistcirc ≤ 10.6 > 10.6 p = 0.001 p = 0.003 ≤ 3.39 > 3.39 ≤ 86 > 86 Node 4 (n = 13) Node 5 (n = 12) Node 7 (n = 13) Node 8 (n = 7) Node 10 (n = 19) Node 11 (n = 7)60 60 60 60 60 6050 50 50 50 50 5040 40 40 40 40 4030 30 30 30 30 3020 20 20 20 20 2010 10 10 10 10 10 25 / 36
  • Bodyfat: CART tree rpart.plot(classic_tree) yes waistcir < 88 no anthro3c < 3.4 hipcirc < 110 17 hipcirc < 101 35 45 23 30 ⇒ Structurally different trees! 26 / 36
  • Glaucoma: example for classification Predict Glaucoma (= eye disease) based on laser scanning measurements Response Y : Binary, y ∈ {Glaucoma, normal} Covariables X : Different volumes and areas of the eye (all numeric) (1, 0)T Glaucoma h = eJ (Yi ) = (0, 1)T normal g (Xij ) = Xij n Tj (Ln , w) = vec wi Xij eJ (Yi )T = i=1 ¯ T nGlaucoma · Xj,Glaucoma ¯j,normal nnormal · X ¯ ¯ c ∝ max ngroup · (Xj,group − Xj,node ) group ∈ {Glaucoma, normal} 27 / 36
  • Glaucoma: R-code library("rpart") library("party") data("GlaucomaM", package = "ipred") cond_tree <- ctree(Class ~ ., data = GlaucomaM) classic_tree <- rpart(Class ~ ., data = GlaucomaM) 28 / 36
  • Glaucoma: conditional tree Node 1 (n = 196) 1 normal glaucoma 0.8 0.6 0.4 0.2 0 Node 2 (n = 87) Node 5 (n = 109) 1 1 normal glaucoma normal glaucoma 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 Node 3 (n = 79) Node 4 (n = 8) Node 6 (n = 65) Node 7 (n = 44) 1 1 1 1 normal glaucoma normal glaucoma normal glaucoma normal glaucoma 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 0 ## 1) vari <= 0.059; criterion = 1, statistic = 71.475 ## 2) vasg <= 0.066; criterion = 1, statistic = 29.265 ## 3)* weights = 79 ## 2) vasg > 0.066 ## 4)* weights = 8 ## 1) vari > 0.059 ## 5) tms <= -0.066; criterion = 0.951, statistic = 11.221 ## 6)* weights = 65 ## 5) tms > -0.066 ## 7)* weights = 44 29 / 36
  • Glaucoma: CART tree rpart.plot(classic_tree, cex = 1.5) yes varg < 0.21 no glaucoma mhcg >= 0.17 glaucoma vars < 0.064 glaucoma tms >= −0.066 eas < 0.45 normal glaucoma normal 30 / 36
  • Appendix: Examples of other scales Y categorial, X categorial h = eJ (Yi ), g = eK (Xij ) ⇒ T is vectorized contingency table of Xj and Y Xj 1 2 3 Pearson residuals: 1.64 1 0.00 Y 2 −1.64 3 −2.08 p−value = 0.009 Y and Xj numeric,h = rg (Yi ), g = rg (Xij ) ⇒ Spearman’s rho Flexible T for different situations: Multivariate regression, ordinal regression, censored regression, . . . 31 / 36
  • Properties Prediction accuracy: Not better than normal trees, but not worse either Computational considerations: Same speed as normal trees. Two possible interpretations of significance level α: 1. Pre-specified nominal level of underlying association tests 2. Simple hyper parameter determining the tree size Low α yields smaller trees 32 / 36
  • Summary conditional trees Not heuristics, but non-parametric models with well-defined theoretical background Suitable for regression with arbitrary scales of Y and X Unbiased variable selection No overfitting Conditional trees structurally different from trees partitioned with exhaustive search procedures 33 / 36
  • Literature and Software J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer Series in Statistics, 2001. T. Hothorn, K. Hornik, and A. Zeileis. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3): 651–674, 2006. H. Strasser and C. Weber. On the asymptotic theory of permutation statistics. 1999. R-packages: rpart: Recursive partitioning rpart.plot: Plot function for rpart party: A Laboratory for Recursive Partytioning All available on CRAN 34 / 36
  • Appendix: Competitors Other partitioning algorithms in this area: CHAID: Nominal response, χ2 test, multiway splits, nominal covariables GUIDE: Continuous response only, p-value from χ2 test, categorizes continuous covariables QUEST: ANOVA F-Test for continuous response, χ2 test for nominal, compare on p-scale ⇒ reduces selection bias CRUISE: Multiway splits, discriminant analysis in each node, unbiased variable selection 35 / 36
  • Appendix: Properties of test statistic T n µj = E(Tj (Ln , w)|S(Ln , w)) = vec wi gj (Xji ) E(h|S(Ln , w))T i =1 Σj = V(Tj (Ln , w)|S(Ln , w)) w. = V(h|S(Ln , w)) ⊗ wi gj (Xji ) ⊗ wi gj (Xji )T w. − 1 i T 1 − V(h|S(Ln , w)) ⊗ wi gj (Xji ) ⊗ wi gj (Xji ) w. − 1 i i n w. = wi i =1 E(h|S(Ln , w)) = w.−1 wi h(Yi , (Y1 , . . . , Yn )) ∈ Rq i −1 V(h|S(Ln , w)) = w. wi (h(Yi , (Y1 , . . . , Yn )) − E(h|S(Ln , w))) i (h(Yi , (Y1 , . . . , Yn )) − E(h|S(Ln , w)))T 36 / 36