SlideShare a Scribd company logo
1 of 36
Download to read offline
Conditional Trees
                or
 Unbiased recursive partitioning
A conditional inference framework

        Christoph Molnar
    Supervisor: Stephanie Möst

     Department of Statistics, LMU


        18 December 2012




                                     1 / 36
Overview


   Introduction and Motivation

   Algorithm for unbiased trees

   Conditional inference with permutation tests

   Examples

   Properties

   Summary




                                                  2 / 36
CART trees



      Model: Y = f (X )
      Structure of decision trees
      Recursive partitioning of covariable space X
      Split optimizes criterion (Gini, information gain, sum of
      squares) depending on scale of Y
      Split point search: exhaustive search procedure
      Avoid overfitting: Early stopping or pruning
      Usage: prediction and explanation
      Other tree types: ID3, C4.5, CHAID, . . .




                                                                  3 / 36
What are conditional trees?




       Special kind of trees
       Recursive partitioning with binary splits and early stopping
       Constant models in terminal nodes
       Variable selection, early stopping and split point search based
       on conditional inference
       Uses permutation tests for inference
       Solves problems of CART trees




                                                                         4 / 36
Why conditional trees?



   Helps to overcome problems of trees:
       overfitting (can be solved with other techniques as well)
       Selection bias towards covariables with many possible splits
       (i.e. numeric, multi categorial)
       Difficult interpretation due to selection bias
       Variable selection: No concept of statistical significance
       Not all scales of Y and X covered (ID3, C4.5, ...)




                                                                      5 / 36
Simulation: selection bias



       Variable selection unbiased ⇔ Probability of selecting a
       covariable, which is independent from Y is the same for all
       independent covariables
       Measurement scale of covariable shouldn’t play a role
       Simulation illustrating the selection bias:
       Y ∼ N(0, 1)
       X1 ∼ M n, 1 , 1
                    2 2
       X2 ∼ M n, 1 , 1 , 1
                    3 3 3
       X3 ∼ M n, 1 , 1 , 1 , 1
                    4 4 4 4




                                                                     6 / 36
Simulation: results
       Selection frequencies for the first split:
       X1 : 0.128, X2 : 0.302, X3 : 0.556, none: 0.014
         X1         X2                    X3          none
       0.0      0.2       0.4           0.6         0.8   1.0


       Strongly biased towards variables with many possible splits
       Example of a tree:
                           yes     x3 = 1,2   no



                                −0.19    x2 = 1,3



                                    −0.098     0.36
       Overfitting! (Note: complexity parameter not cross-validated)
       Desirable here: No split at all
       Problem source: Exhaustive search through all variables and all
       possible split points
       Numeric/multi-categorial categorial have more split options ⇒
       Multiple comparison problem                                       7 / 36
Idea of conditional trees




       Variable selection and search for split point ⇒ two steps
       Embed all decisions into hypothesis tests
       All tests with conditional inference (permutation tests)




                                                                   8 / 36
Ctree algorithm



    1   Stop criterion
            Test global null hypothesis H0 of independence between Y and
            all Xj with
                         j       j
            H0 = ∩m H0 and H0 : D(Y|Xj ) = D(Y)
                    j=1
            If H0 not rejected ⇒ Stop
    2   Select variable Xj∗ with strongest association
    3   Search best split point for Xj∗ and partitionate data
    4   Repeat steps 1.), 2.) and 3.) for both of the new partitions




                                                                           9 / 36
How can we test hypothesis of independence?




       Parametric tests depend on distribution assumptions
       Problem: Unknown conditional distribution
       D(Y |X ) = D(Y |X1 , ..., Xm ) = D(Y |f (X1 ), ..., f (Xm ))
       Need for a general framework, which can handle arbitrary
       scales
   Let the data speak: ⇒ permutation tests!




                                                                      10 / 36
Excursion: permutation tests




                               11 / 36
Permutation tests: simple example


      Possible treatments for disease: A or B
      Numeric measurement (blood value)
      Question: Different blood values between treatment A and B?
      ⇔ µB = µA ?
      Test statistic: T0 = µA − µB
                           ˆ    ˆ
      H0 : µA − µB = 0,       H 1 : µ A − µB = 0
      Distribution unknown ⇒ Permutation test

                                                       Treatment
        q       q       q q   q       q q q   q    q   q A
                                                       q B
                    1                   2
                                  y

      T0 = µA − µB = 2.06 - 1.2 = 0.86
           ˆ    ˆ


                                                                   12 / 36
Permute

  Original data:

          B     B     B     B     A     A     A     A     B     A
          0.5   0.9   1.1   1.2   1.5   1.9   2.0   2.1   2.3   2.8

  One possible permutation:

          B     B     B     B     A     A     A     A     B     A
          2.8   2.3   1.1   1.9   1.2   2.1   1.5   0.5   0.9   2.0

      Permute the labels (A and B) and the numeric measurement
      Calculate test statistic T for each permutation
      Do this with all possible permutations
      Result: Distribution of test statistic conditioned on sample

                                                                      13 / 36
P-value and decision


                    k = {Permutation samples : |ˆA,perm − µB,perm | > |ˆA − µB |}
                                                µ         ˆ            µ    ˆ
                                        k
                    p-value =      #Permutations
                    p-value < α = 0.05? ⇒ If yes, H0 can be rejected




              0.6
                                                                                     Test statistic of
    density




              0.4                                                                     q original
                                                                                      q permutation
              0.2
                            q        q qq q q q   qqq q q q    qq
                                 q                                     qqq q q
                      qq     q    qqq    q q q qqq q qq q q qqq          q
              0.0                           q          q                         q
                     −1.0          −0.5         0.0          0.5           1.0
                                   Difference of means per treatment




                                                                                                         14 / 36
General algorithm for permutation tests


       Requirement: Under H0 response and covariables are
       exchangeable
       Do the following:
         1   Calculate test statistic T0
         2   Calculate test statistic T for all permutations of pairs Y , X
         3   Compute nextreme : Count number of T which are more
             extreme than T0
                            nextreme
         4   p-value p = npermutations
         5   Reject H0 if p < α, with significance level α
       If # possible permutations too big, draw random permutations
       in 2.) (Monte Carlo sampling)




                                                                              15 / 36
Framework by Strasser and Weber



      General test statistic:
                            n
      Tj (Ln , w ) = vec         wi gj (Xij )h(Yi , (Y1 , ..., Yn ))T   ∈ Rpj q
                           i=1
      h is called influence function, gj is transformation of Xj
      Choose gj , h depending on scale
      It’s possible to calculate µ and Σ of T
                                                              (t−µ)
      Standardized test statistic: c(t, µ, Σ) = maxk=1,...,pq √ k
                                                                           (Σ)kk

      Why so complex? ⇒ Cover all cases: Multicategorial X or Y ,
      different scales




                                                                                   16 / 36
End of excursion
Lets get back to business




                            17 / 36
Ctree algorithm with permutation tests


     1   Stop criterion
             Test global null hypothesis H0 of independence between Y and
             all Xj with
                           j      j
             H0 = ∩m H0 and H0 : D(Y|Xj ) = D(Y) (permutation tests
                     j=1
             for each Xj )
             If H0 not rejected (no significance for all Xj ) ⇒ Stop
     2   Select variable Xj∗ with strongest association (smallest
         p-value)
     3   Search best split point for Xj∗ (max. test statistic c) and
         partition data
     4   Repeat steps 1.), 2.) and 3.) for both of the new partitions



                                                                            18 / 36
Permutation tests for stop criterion




       Choose influence function h for Y
       Choose transformation function g for each Xj
       Test each variable Xj separately for association with Y
          j
       (H0 : D(Y |Xj ) = D(Y ) = Variable Xj has no influence on Y )
                            j
       Global H0 = ∩m H0 : No variable has influence on Y .
                        j=1
       Test global H0 : Multiple Testing ⇒ Adjust α (Bonferroni
       correction, ...)


                                                                      19 / 36
Permutation tests for variable selection




       Choose variable with smallest p-value for split
       Note: Switch to p-value comparison gets rid of scaling problem




                                                                        20 / 36
Test statistic for best split point




       Use test statistic instead of Gini/SSE for split point search
                              n
       TjA (Ln , w ) = vec         wi I (Xji ∈ A) · h(Yi , (Y1 , . . . , Yn ))T
                             i=1
                                                       (T A −µ)k
       Standardized test statistic: c = maxk √j
                                                          (Σ)kk

       Measures discrepancy between {Yi |Xji ∈ A} and {Yi |Xji ∈ A}
                                                               /
       Calculate c for all possible splits; Choose split point with
       maximal c
       Covers different scales of Y and X
                                                                                  21 / 36
Usage examples with R
  - Let’s get the party started -




                                    22 / 36
Bodyfat: example for continuous regression
       Example: bodyfat data
       Predict body fat with anthropometric measurements
       Data: Measurements of 71 healthy women
       Response Y : body fat measured by DXA (numeric)
       Covariables X : different body measurements (numeric)
       For example: waist circumference, breadth of the knee, ...
       h = Yi
       g = Xi
                           n
       Tj (Ln , w) =            wi Xij Yi
                          i=1

                                                             ¯ ¯
                                               Xij Yi −nnode Xj Y
       c=   | t−µ |
               σ      ∝              i :node
                                                                                (Pearson
                                               ¯
                                          (Yi −Y )2                     ¯
                                                                  (Xij −Xj )2
                                i :node                 i :node

       correlation coefficient)
                                                                                           23 / 36
Bodyfat: R-code




   library("party")
   library("rpart")
   library("rpart.plot")
   data(bodyfat, package = "mboost")
   ## conditional tree
   cond_tree <- ctree(DEXfat ~ ., data = bodyfat)
   ## normal tree
   classic_tree <- rpart(DEXfat ~ ., data = bodyfat)




                                                       24 / 36
Bodyfat: conditional tree

       plot(cond_tree)
                                                                                                                               1
                                                                                                                            hipcirc
                                                                                                                           p < 0.001



                                                                                         ≤ 108                                            > 108

                                                             2                                                                                            9
                                                         anthro3c                                                                                    kneebreadth
                                                         p < 0.001                                                                                    p = 0.006



                                            ≤ 3.76                     > 3.76

                           3                                                               6
                       anthro3c                                                        waistcirc
                                                                                                                                                  ≤ 10.6          > 10.6
                       p = 0.001                                                       p = 0.003



               ≤ 3.39              > 3.39                                       ≤ 86               > 86

     Node 4 (n = 13)                   Node 5 (n = 12)               Node 7 (n = 13)                      Node 8 (n = 7)               Node 10 (n = 19)               Node 11 (n = 7)

60                            60                                60                            60                                  60                         60

50                            50                                50                            50                                  50                         50

40                            40                                40                            40                                  40                         40

30                            30                                30                            30                                  30                         30

20                            20                                20                            20                                  20                         20

10                            10                                10                            10                                  10                         10



                                                                                                                                                                            25 / 36
Bodyfat: CART tree

   rpart.plot(classic_tree)



                                  yes   waistcir < 88   no




                    anthro3c < 3.4                           hipcirc < 110



               17              hipcirc < 101            35                   45



                          23                   30




   ⇒ Structurally different trees!
                                                                                  26 / 36
Glaucoma: example for classification

      Predict Glaucoma (= eye disease) based on laser scanning
      measurements
      Response Y : Binary, y ∈ {Glaucoma, normal}
      Covariables X : Different volumes and areas of the eye (all
      numeric)
                       (1, 0)T     Glaucoma
      h = eJ (Yi ) =
                       (0, 1)T     normal
      g (Xij ) = Xij
                           n
      Tj (Ln , w) = vec         wi Xij eJ (Yi )T   =
                          i=1
                    ¯              T
        nGlaucoma · Xj,Glaucoma
                    ¯j,normal
          nnormal · X
                           ¯       ¯
      c ∝ max ngroup · (Xj,group − Xj,node )           group ∈ {Glaucoma,
      normal}
                                                                            27 / 36
Glaucoma: R-code




  library("rpart")
  library("party")
  data("GlaucomaM", package = "ipred")
  cond_tree <- ctree(Class ~ ., data = GlaucomaM)
  classic_tree <- rpart(Class ~ ., data = GlaucomaM)




                                                       28 / 36
Glaucoma: conditional tree
                                                                             Node 1 (n = 196)
                                                                                             1




                                                                        normal glaucoma
                                                                                                                  0.8
                                                                                                                  0.6
                                                                                                                  0.4
                                                                                                                  0.2
                                                                                                                  0



                                  Node 2 (n = 87)                                                                            Node 5 (n = 109)
                                                                  1                                                                          1
                        normal glaucoma




                                                                                                                        normal glaucoma
                                                                  0.8                                                                                             0.8
                                                                  0.6                                                                                             0.6
                                                                  0.4                                                                                             0.4
                                                                  0.2                                                                                             0.2
                                                                  0                                                                                               0



               Node 3 (n = 79)                                Node 4 (n = 8)                              Node 6 (n = 65)                                 Node 7 (n = 44)
                                          1                                               1                                               1                                 1
     normal glaucoma




                                                normal glaucoma




                                                                                                normal glaucoma




                                                                                                                                                normal glaucoma
                                          0.8                                             0.8                                             0.8                               0.8
                                          0.6                                             0.6                                             0.6                               0.6
                                          0.4                                             0.4                                             0.4                               0.4
                                          0.2                                             0.2                                             0.2                               0.2
                                          0                                               0                                               0                                 0




   ## 1) vari <= 0.059; criterion = 1, statistic = 71.475
   ##   2) vasg <= 0.066; criterion = 1, statistic = 29.265
   ##     3)* weights = 79
   ##   2) vasg > 0.066
   ##     4)* weights = 8
   ## 1) vari > 0.059
   ##   5) tms <= -0.066; criterion = 0.951, statistic = 11.221
   ##     6)* weights = 65
   ##   5) tms > -0.066
   ##     7)* weights = 44
                                                                                                                                                                                  29 / 36
Glaucoma: CART tree

  rpart.plot(classic_tree, cex = 1.5)

      yes   varg < 0.21   no




    glaucoma       mhcg >= 0.17



            glaucoma           vars < 0.064



                     glaucoma         tms >= −0.066



                                eas < 0.45        normal



                       glaucoma          normal


                                                           30 / 36
Appendix: Examples of other scales
      Y categorial, X categorial
          h = eJ (Yi ), g = eK (Xij )
          ⇒ T is vectorized contingency table of Xj and Y
                        Xj
                    1        2     3
                                       Pearson
                                       residuals:
                                             1.64
            1




                                            0.00
           Y
           2




                                           −1.64
            3




                                           −2.08
                                       p−value =
                                       0.009




      Y and Xj numeric,h = rg (Yi ), g = rg (Xij ) ⇒ Spearman’s
      rho
      Flexible T for different situations: Multivariate regression,
      ordinal regression, censored regression, . . .
                                                                     31 / 36
Properties




       Prediction accuracy: Not better than normal trees, but not
       worse either
       Computational considerations: Same speed as normal trees.
       Two possible interpretations of significance level α:
             1. Pre-specified nominal level of underlying association tests
             2. Simple hyper parameter determining the tree size
             Low α yields smaller trees




                                                                             32 / 36
Summary conditional trees




      Not heuristics, but non-parametric models with well-defined
      theoretical background
      Suitable for regression with arbitrary scales of Y and X
      Unbiased variable selection
      No overfitting
      Conditional trees structurally different from trees partitioned
      with exhaustive search procedures




                                                                       33 / 36
Literature and Software
       J. Friedman, T. Hastie, and R. Tibshirani.
       The elements of statistical learning, volume 1.
       Springer Series in Statistics, 2001.
       T. Hothorn, K. Hornik, and A. Zeileis.
       Unbiased recursive partitioning: A conditional inference
       framework.
       Journal of Computational and Graphical Statistics, 15(3):
       651–674, 2006.
        H. Strasser and C. Weber.
        On the asymptotic theory of permutation statistics.
        1999.
   R-packages:
        rpart: Recursive partitioning
        rpart.plot: Plot function for rpart
        party: A Laboratory for Recursive Partytioning
   All available on CRAN
                                                                   34 / 36
Appendix: Competitors



   Other partitioning algorithms in this area:
       CHAID: Nominal response, χ2 test, multiway splits, nominal
       covariables
       GUIDE: Continuous response only, p-value from χ2 test,
       categorizes continuous covariables
       QUEST: ANOVA F-Test for continuous response, χ2 test for
       nominal, compare on p-scale ⇒ reduces selection bias
       CRUISE: Multiway splits, discriminant analysis in each node,
       unbiased variable selection




                                                                      35 / 36
Appendix: Properties of test statistic T
                                                       n
      µj = E(Tj (Ln , w)|S(Ln , w)) = vec                    wi gj (Xji ) E(h|S(Ln , w))T
                                                      i =1

      Σj = V(Tj (Ln , w)|S(Ln , w))
               w.
         =          V(h|S(Ln , w)) ⊗             wi gj (Xji ) ⊗ wi gj (Xji )T
             w. − 1
                                             i
                                                                                          T
              1
         −        V(h|S(Ln , w)) ⊗               wi gj (Xji )    ⊗         wi gj (Xji )
           w. − 1
                                             i                         i
              n
      w. =          wi
             i =1




        E(h|S(Ln , w)) = w.−1           wi h(Yi , (Y1 , . . . , Yn )) ∈ Rq
                                    i
                               −1
        V(h|S(Ln , w)) = w.             wi (h(Yi , (Y1 , . . . , Yn )) − E(h|S(Ln , w)))
                                    i

                         (h(Yi , (Y1 , . . . , Yn )) − E(h|S(Ln , w)))T
                                                                                              36 / 36

More Related Content

What's hot

Supervised Learning
Supervised LearningSupervised Learning
Supervised Learning
butest
 

What's hot (20)

Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
Bayesian inference
Bayesian inferenceBayesian inference
Bayesian inference
 
Classification
ClassificationClassification
Classification
 
Data cleaning-outlier-detection
Data cleaning-outlier-detectionData cleaning-outlier-detection
Data cleaning-outlier-detection
 
CounterFactual Explanations.pdf
CounterFactual Explanations.pdfCounterFactual Explanations.pdf
CounterFactual Explanations.pdf
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
Bayes Belief Networks
Bayes Belief NetworksBayes Belief Networks
Bayes Belief Networks
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VECUnit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
 
Ml3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metricsMl3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metrics
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods
 
Term weighting
Term weightingTerm weighting
Term weighting
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Decision tree
Decision treeDecision tree
Decision tree
 
Bayesian learning
Bayesian learningBayesian learning
Bayesian learning
 
Framming data link layer
Framming data link layerFramming data link layer
Framming data link layer
 
Supervised Learning
Supervised LearningSupervised Learning
Supervised Learning
 

Similar to Conditional trees

[The following information applies to the questions displayed belo.docx
[The following information applies to the questions displayed belo.docx[The following information applies to the questions displayed belo.docx
[The following information applies to the questions displayed belo.docx
danielfoster65629
 
Statistics_summary_1634533932.pdf
Statistics_summary_1634533932.pdfStatistics_summary_1634533932.pdf
Statistics_summary_1634533932.pdf
YoursTube1
 
T Test For Two Independent Samples
T Test For Two Independent SamplesT Test For Two Independent Samples
T Test For Two Independent Samples
shoffma5
 
2D1431 Machine Learning
2D1431 Machine Learning2D1431 Machine Learning
2D1431 Machine Learning
butest
 
Matrix Computations in Machine Learning
Matrix Computations in Machine LearningMatrix Computations in Machine Learning
Matrix Computations in Machine Learning
butest
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validation
Stéphane Canu
 

Similar to Conditional trees (20)

2019 PMED Spring Course - SMARTs-Part II - Eric Laber, April 10, 2019
2019 PMED Spring Course - SMARTs-Part II - Eric Laber, April 10, 2019 2019 PMED Spring Course - SMARTs-Part II - Eric Laber, April 10, 2019
2019 PMED Spring Course - SMARTs-Part II - Eric Laber, April 10, 2019
 
hypothesisTestPPT.pptx
hypothesisTestPPT.pptxhypothesisTestPPT.pptx
hypothesisTestPPT.pptx
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
 
[The following information applies to the questions displayed belo.docx
[The following information applies to the questions displayed belo.docx[The following information applies to the questions displayed belo.docx
[The following information applies to the questions displayed belo.docx
 
Statistics_summary_1634533932.pdf
Statistics_summary_1634533932.pdfStatistics_summary_1634533932.pdf
Statistics_summary_1634533932.pdf
 
Statistics_Cheat_sheet_1567847508.pdf
Statistics_Cheat_sheet_1567847508.pdfStatistics_Cheat_sheet_1567847508.pdf
Statistics_Cheat_sheet_1567847508.pdf
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
 
columbus15_cattaneo.pdf
columbus15_cattaneo.pdfcolumbus15_cattaneo.pdf
columbus15_cattaneo.pdf
 
Intro to Quant Trading Strategies (Lecture 7 of 10)
Intro to Quant Trading Strategies (Lecture 7 of 10)Intro to Quant Trading Strategies (Lecture 7 of 10)
Intro to Quant Trading Strategies (Lecture 7 of 10)
 
Input analysis
Input analysisInput analysis
Input analysis
 
Lecture: Monte Carlo Methods
Lecture: Monte Carlo MethodsLecture: Monte Carlo Methods
Lecture: Monte Carlo Methods
 
Binomial probability distributions
Binomial probability distributions  Binomial probability distributions
Binomial probability distributions
 
T Test For Two Independent Samples
T Test For Two Independent SamplesT Test For Two Independent Samples
T Test For Two Independent Samples
 
chapter_8_20162.pdf
chapter_8_20162.pdfchapter_8_20162.pdf
chapter_8_20162.pdf
 
Chi square distribution and analysis of frequencies.pptx
Chi square distribution and analysis of frequencies.pptxChi square distribution and analysis of frequencies.pptx
Chi square distribution and analysis of frequencies.pptx
 
2D1431 Machine Learning
2D1431 Machine Learning2D1431 Machine Learning
2D1431 Machine Learning
 
Benelearn2016
Benelearn2016Benelearn2016
Benelearn2016
 
Matrix Computations in Machine Learning
Matrix Computations in Machine LearningMatrix Computations in Machine Learning
Matrix Computations in Machine Learning
 
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validation
 

Recently uploaded

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Recently uploaded (20)

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 

Conditional trees

  • 1. Conditional Trees or Unbiased recursive partitioning A conditional inference framework Christoph Molnar Supervisor: Stephanie Möst Department of Statistics, LMU 18 December 2012 1 / 36
  • 2. Overview Introduction and Motivation Algorithm for unbiased trees Conditional inference with permutation tests Examples Properties Summary 2 / 36
  • 3. CART trees Model: Y = f (X ) Structure of decision trees Recursive partitioning of covariable space X Split optimizes criterion (Gini, information gain, sum of squares) depending on scale of Y Split point search: exhaustive search procedure Avoid overfitting: Early stopping or pruning Usage: prediction and explanation Other tree types: ID3, C4.5, CHAID, . . . 3 / 36
  • 4. What are conditional trees? Special kind of trees Recursive partitioning with binary splits and early stopping Constant models in terminal nodes Variable selection, early stopping and split point search based on conditional inference Uses permutation tests for inference Solves problems of CART trees 4 / 36
  • 5. Why conditional trees? Helps to overcome problems of trees: overfitting (can be solved with other techniques as well) Selection bias towards covariables with many possible splits (i.e. numeric, multi categorial) Difficult interpretation due to selection bias Variable selection: No concept of statistical significance Not all scales of Y and X covered (ID3, C4.5, ...) 5 / 36
  • 6. Simulation: selection bias Variable selection unbiased ⇔ Probability of selecting a covariable, which is independent from Y is the same for all independent covariables Measurement scale of covariable shouldn’t play a role Simulation illustrating the selection bias: Y ∼ N(0, 1) X1 ∼ M n, 1 , 1 2 2 X2 ∼ M n, 1 , 1 , 1 3 3 3 X3 ∼ M n, 1 , 1 , 1 , 1 4 4 4 4 6 / 36
  • 7. Simulation: results Selection frequencies for the first split: X1 : 0.128, X2 : 0.302, X3 : 0.556, none: 0.014 X1 X2 X3 none 0.0 0.2 0.4 0.6 0.8 1.0 Strongly biased towards variables with many possible splits Example of a tree: yes x3 = 1,2 no −0.19 x2 = 1,3 −0.098 0.36 Overfitting! (Note: complexity parameter not cross-validated) Desirable here: No split at all Problem source: Exhaustive search through all variables and all possible split points Numeric/multi-categorial categorial have more split options ⇒ Multiple comparison problem 7 / 36
  • 8. Idea of conditional trees Variable selection and search for split point ⇒ two steps Embed all decisions into hypothesis tests All tests with conditional inference (permutation tests) 8 / 36
  • 9. Ctree algorithm 1 Stop criterion Test global null hypothesis H0 of independence between Y and all Xj with j j H0 = ∩m H0 and H0 : D(Y|Xj ) = D(Y) j=1 If H0 not rejected ⇒ Stop 2 Select variable Xj∗ with strongest association 3 Search best split point for Xj∗ and partitionate data 4 Repeat steps 1.), 2.) and 3.) for both of the new partitions 9 / 36
  • 10. How can we test hypothesis of independence? Parametric tests depend on distribution assumptions Problem: Unknown conditional distribution D(Y |X ) = D(Y |X1 , ..., Xm ) = D(Y |f (X1 ), ..., f (Xm )) Need for a general framework, which can handle arbitrary scales Let the data speak: ⇒ permutation tests! 10 / 36
  • 12. Permutation tests: simple example Possible treatments for disease: A or B Numeric measurement (blood value) Question: Different blood values between treatment A and B? ⇔ µB = µA ? Test statistic: T0 = µA − µB ˆ ˆ H0 : µA − µB = 0, H 1 : µ A − µB = 0 Distribution unknown ⇒ Permutation test Treatment q q q q q q q q q q q A q B 1 2 y T0 = µA − µB = 2.06 - 1.2 = 0.86 ˆ ˆ 12 / 36
  • 13. Permute Original data: B B B B A A A A B A 0.5 0.9 1.1 1.2 1.5 1.9 2.0 2.1 2.3 2.8 One possible permutation: B B B B A A A A B A 2.8 2.3 1.1 1.9 1.2 2.1 1.5 0.5 0.9 2.0 Permute the labels (A and B) and the numeric measurement Calculate test statistic T for each permutation Do this with all possible permutations Result: Distribution of test statistic conditioned on sample 13 / 36
  • 14. P-value and decision k = {Permutation samples : |ˆA,perm − µB,perm | > |ˆA − µB |} µ ˆ µ ˆ k p-value = #Permutations p-value < α = 0.05? ⇒ If yes, H0 can be rejected 0.6 Test statistic of density 0.4 q original q permutation 0.2 q q qq q q q qqq q q q qq q qqq q q qq q qqq q q q qqq q qq q q qqq q 0.0 q q q −1.0 −0.5 0.0 0.5 1.0 Difference of means per treatment 14 / 36
  • 15. General algorithm for permutation tests Requirement: Under H0 response and covariables are exchangeable Do the following: 1 Calculate test statistic T0 2 Calculate test statistic T for all permutations of pairs Y , X 3 Compute nextreme : Count number of T which are more extreme than T0 nextreme 4 p-value p = npermutations 5 Reject H0 if p < α, with significance level α If # possible permutations too big, draw random permutations in 2.) (Monte Carlo sampling) 15 / 36
  • 16. Framework by Strasser and Weber General test statistic: n Tj (Ln , w ) = vec wi gj (Xij )h(Yi , (Y1 , ..., Yn ))T ∈ Rpj q i=1 h is called influence function, gj is transformation of Xj Choose gj , h depending on scale It’s possible to calculate µ and Σ of T (t−µ) Standardized test statistic: c(t, µ, Σ) = maxk=1,...,pq √ k (Σ)kk Why so complex? ⇒ Cover all cases: Multicategorial X or Y , different scales 16 / 36
  • 17. End of excursion Lets get back to business 17 / 36
  • 18. Ctree algorithm with permutation tests 1 Stop criterion Test global null hypothesis H0 of independence between Y and all Xj with j j H0 = ∩m H0 and H0 : D(Y|Xj ) = D(Y) (permutation tests j=1 for each Xj ) If H0 not rejected (no significance for all Xj ) ⇒ Stop 2 Select variable Xj∗ with strongest association (smallest p-value) 3 Search best split point for Xj∗ (max. test statistic c) and partition data 4 Repeat steps 1.), 2.) and 3.) for both of the new partitions 18 / 36
  • 19. Permutation tests for stop criterion Choose influence function h for Y Choose transformation function g for each Xj Test each variable Xj separately for association with Y j (H0 : D(Y |Xj ) = D(Y ) = Variable Xj has no influence on Y ) j Global H0 = ∩m H0 : No variable has influence on Y . j=1 Test global H0 : Multiple Testing ⇒ Adjust α (Bonferroni correction, ...) 19 / 36
  • 20. Permutation tests for variable selection Choose variable with smallest p-value for split Note: Switch to p-value comparison gets rid of scaling problem 20 / 36
  • 21. Test statistic for best split point Use test statistic instead of Gini/SSE for split point search n TjA (Ln , w ) = vec wi I (Xji ∈ A) · h(Yi , (Y1 , . . . , Yn ))T i=1 (T A −µ)k Standardized test statistic: c = maxk √j (Σ)kk Measures discrepancy between {Yi |Xji ∈ A} and {Yi |Xji ∈ A} / Calculate c for all possible splits; Choose split point with maximal c Covers different scales of Y and X 21 / 36
  • 22. Usage examples with R - Let’s get the party started - 22 / 36
  • 23. Bodyfat: example for continuous regression Example: bodyfat data Predict body fat with anthropometric measurements Data: Measurements of 71 healthy women Response Y : body fat measured by DXA (numeric) Covariables X : different body measurements (numeric) For example: waist circumference, breadth of the knee, ... h = Yi g = Xi n Tj (Ln , w) = wi Xij Yi i=1 ¯ ¯ Xij Yi −nnode Xj Y c= | t−µ | σ ∝ i :node (Pearson ¯ (Yi −Y )2 ¯ (Xij −Xj )2 i :node i :node correlation coefficient) 23 / 36
  • 24. Bodyfat: R-code library("party") library("rpart") library("rpart.plot") data(bodyfat, package = "mboost") ## conditional tree cond_tree <- ctree(DEXfat ~ ., data = bodyfat) ## normal tree classic_tree <- rpart(DEXfat ~ ., data = bodyfat) 24 / 36
  • 25. Bodyfat: conditional tree plot(cond_tree) 1 hipcirc p < 0.001 ≤ 108 > 108 2 9 anthro3c kneebreadth p < 0.001 p = 0.006 ≤ 3.76 > 3.76 3 6 anthro3c waistcirc ≤ 10.6 > 10.6 p = 0.001 p = 0.003 ≤ 3.39 > 3.39 ≤ 86 > 86 Node 4 (n = 13) Node 5 (n = 12) Node 7 (n = 13) Node 8 (n = 7) Node 10 (n = 19) Node 11 (n = 7) 60 60 60 60 60 60 50 50 50 50 50 50 40 40 40 40 40 40 30 30 30 30 30 30 20 20 20 20 20 20 10 10 10 10 10 10 25 / 36
  • 26. Bodyfat: CART tree rpart.plot(classic_tree) yes waistcir < 88 no anthro3c < 3.4 hipcirc < 110 17 hipcirc < 101 35 45 23 30 ⇒ Structurally different trees! 26 / 36
  • 27. Glaucoma: example for classification Predict Glaucoma (= eye disease) based on laser scanning measurements Response Y : Binary, y ∈ {Glaucoma, normal} Covariables X : Different volumes and areas of the eye (all numeric) (1, 0)T Glaucoma h = eJ (Yi ) = (0, 1)T normal g (Xij ) = Xij n Tj (Ln , w) = vec wi Xij eJ (Yi )T = i=1 ¯ T nGlaucoma · Xj,Glaucoma ¯j,normal nnormal · X ¯ ¯ c ∝ max ngroup · (Xj,group − Xj,node ) group ∈ {Glaucoma, normal} 27 / 36
  • 28. Glaucoma: R-code library("rpart") library("party") data("GlaucomaM", package = "ipred") cond_tree <- ctree(Class ~ ., data = GlaucomaM) classic_tree <- rpart(Class ~ ., data = GlaucomaM) 28 / 36
  • 29. Glaucoma: conditional tree Node 1 (n = 196) 1 normal glaucoma 0.8 0.6 0.4 0.2 0 Node 2 (n = 87) Node 5 (n = 109) 1 1 normal glaucoma normal glaucoma 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 Node 3 (n = 79) Node 4 (n = 8) Node 6 (n = 65) Node 7 (n = 44) 1 1 1 1 normal glaucoma normal glaucoma normal glaucoma normal glaucoma 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 0 ## 1) vari <= 0.059; criterion = 1, statistic = 71.475 ## 2) vasg <= 0.066; criterion = 1, statistic = 29.265 ## 3)* weights = 79 ## 2) vasg > 0.066 ## 4)* weights = 8 ## 1) vari > 0.059 ## 5) tms <= -0.066; criterion = 0.951, statistic = 11.221 ## 6)* weights = 65 ## 5) tms > -0.066 ## 7)* weights = 44 29 / 36
  • 30. Glaucoma: CART tree rpart.plot(classic_tree, cex = 1.5) yes varg < 0.21 no glaucoma mhcg >= 0.17 glaucoma vars < 0.064 glaucoma tms >= −0.066 eas < 0.45 normal glaucoma normal 30 / 36
  • 31. Appendix: Examples of other scales Y categorial, X categorial h = eJ (Yi ), g = eK (Xij ) ⇒ T is vectorized contingency table of Xj and Y Xj 1 2 3 Pearson residuals: 1.64 1 0.00 Y 2 −1.64 3 −2.08 p−value = 0.009 Y and Xj numeric,h = rg (Yi ), g = rg (Xij ) ⇒ Spearman’s rho Flexible T for different situations: Multivariate regression, ordinal regression, censored regression, . . . 31 / 36
  • 32. Properties Prediction accuracy: Not better than normal trees, but not worse either Computational considerations: Same speed as normal trees. Two possible interpretations of significance level α: 1. Pre-specified nominal level of underlying association tests 2. Simple hyper parameter determining the tree size Low α yields smaller trees 32 / 36
  • 33. Summary conditional trees Not heuristics, but non-parametric models with well-defined theoretical background Suitable for regression with arbitrary scales of Y and X Unbiased variable selection No overfitting Conditional trees structurally different from trees partitioned with exhaustive search procedures 33 / 36
  • 34. Literature and Software J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer Series in Statistics, 2001. T. Hothorn, K. Hornik, and A. Zeileis. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3): 651–674, 2006. H. Strasser and C. Weber. On the asymptotic theory of permutation statistics. 1999. R-packages: rpart: Recursive partitioning rpart.plot: Plot function for rpart party: A Laboratory for Recursive Partytioning All available on CRAN 34 / 36
  • 35. Appendix: Competitors Other partitioning algorithms in this area: CHAID: Nominal response, χ2 test, multiway splits, nominal covariables GUIDE: Continuous response only, p-value from χ2 test, categorizes continuous covariables QUEST: ANOVA F-Test for continuous response, χ2 test for nominal, compare on p-scale ⇒ reduces selection bias CRUISE: Multiway splits, discriminant analysis in each node, unbiased variable selection 35 / 36
  • 36. Appendix: Properties of test statistic T n µj = E(Tj (Ln , w)|S(Ln , w)) = vec wi gj (Xji ) E(h|S(Ln , w))T i =1 Σj = V(Tj (Ln , w)|S(Ln , w)) w. = V(h|S(Ln , w)) ⊗ wi gj (Xji ) ⊗ wi gj (Xji )T w. − 1 i T 1 − V(h|S(Ln , w)) ⊗ wi gj (Xji ) ⊗ wi gj (Xji ) w. − 1 i i n w. = wi i =1 E(h|S(Ln , w)) = w.−1 wi h(Yi , (Y1 , . . . , Yn )) ∈ Rq i −1 V(h|S(Ln , w)) = w. wi (h(Yi , (Y1 , . . . , Yn )) − E(h|S(Ln , w))) i (h(Yi , (Y1 , . . . , Yn )) − E(h|S(Ln , w)))T 36 / 36