Introduction to cart_2007


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to cart_2007

  1. 1. Data Mining with Decision Trees: An Introduction to CART® Dan Steinberg Mikhail Golovnya Introduction to CART® Salford Systems ©2007
  2. 2. Intro CART Outline Historical background Introduction into splitting Sample CART model rules Binary Recursive Constrained trees Partitioning Introduction into priors and Fundamentals of tree costs pruning Cross-Validation Competitors and Stable trees and train-test surrogates consistency (TTC) Missing value handling Hot-spot analysis Variable Importance Model scoring and Penalties and structured translation trees Batteries of CART runs Introduction to CART® Salford Systems ©2007 2
  3. 3. Historical Background Introduction to CART® Salford Systems ©2007 3
  4. 4. In The Beginning… In 1984 Berkeley and Stanford statisticians announced a remarkable new classification tool which: Could separate relevant from irrelevant predictors Did not require any kind of variable transforms (logs, square roots) Impervious to outliers and missing values Could yield relatively simple and easy to comprehend models Required little to no supervision by the analyst Was frequently more accurate than traditional logistic regression or discriminant analysis, and other parametric tools Introduction to CART® Salford Systems ©2007 4
  5. 5. The Years of Struggle CART was slow to gain widespread recognition in late 1980s and early 1990s for several reasons Monograph introducing CART is challenging to read brilliant book overflowing with insights into tree growing methodology but fairly technical and brief Method was not expounded in any textbooks Originally taught only in advanced graduate statistics classes at a handful of leading Universities Original standalone software came with slender documentation, and output was not self-explanatory Method was such a radical departure from conventional statistics Introduction to CART® Salford Systems ©2007 5
  6. 6. The Final TriumphRising interest in data mining Availability of huge data sets requiring analysis Need to automate or accelerate and improve analysis process Comparative performance studiesAdvantages of CART over other tree methods handling of missing values assistance in interpretation of results (surrogates) performance advantages: speed, accuracyNew software and documentation make techniquesaccessible to end usersWord of mouth generated by early adoptersEasy to use professional software available from SalfordSystems Introduction to CART® Salford Systems ©2007 6
  7. 7. Sample CART ModelIntroduction to CART® Salford Systems ©2007 7
  8. 8. So What is CART? Best illustrated with a famous example-the UCSD Heart Disease study Given the diagnosis of a heart attack based on Chest pain, Indicative EKGs, Elevation of enzymes typically released by damaged heart muscle Predict who is at risk of a 2nd heart attack and early death within 30 days Prediction will determine treatment program (intensive care or not) For each patient about 100 variables were available, including: Demographics, medical history, lab results 19 noninvasive variables were used in the analysis Introduction to CART® Salford Systems ©2007 8
  9. 9. Typical CART Solution PATIENTS = 215 Example of a <= 91 > 91 CLASSIFICATION SURVIVE 178 82.8% DEAD 37 17.2% tree Dependent Is BP<=91? variable is categorical (SURVIVE, DIE) Terminal Node A PATIENTS = 195 Want to predict SURVIVE 6 30.0% <= 62.5 SURVIVE 172 88.2% > 62.5 class membership DEAD 14 70.0% DEAD 23 11.8% NODE = DEAD Is AGE<=62.5? Terminal Node B PATIENTS = 91 SURVIVE 102 98.1% <=.5 >.5 SURVIVE 70 76.9% DEAD 2 1.9% DEAD 21 23.1% NODE = SURVIVE Is SINUS<=.5? Terminal Node C Terminal Node D SURVIVE 14 50.0% SURVIVE 56 88.9% DEAD 14 50.0% DEAD 7 11.1% NODE = DEAD NODE = SURVIVE Introduction to CART® Salford Systems ©2007 9
  10. 10. How to Read It Entire tree represents a complete analysis or model Has the form of a decision tree Root of inverted tree contains all data Root gives rise to child nodes Child nodes can in turn give rise to their own children At some point a given path ends in a terminal node Terminal node classifies object Path through the tree governed by the answers to QUESTIONS or RULES Introduction to CART® Salford Systems ©2007 10
  11. 11. Decision Questions Question is of the form: is statement TRUE? Is continuous variable X ≤ c ? Does categorical variable D take on levels i, j, or k? e.g. Is geographic region 1, 2, 4, or 7? Standard split: If answer to question is YES a case goes left; otherwise it goes right This is the form of all primary splits Question is formulated so that only two answers possible Called binary partitioning In CART the YES answer always goes left Introduction to CART® Salford Systems ©2007 11
  12. 12. The Tree is a Classifier Classification is determined by following a case’s path down the tree Terminal nodes are associated with a single class Any case arriving at a terminal node is assigned to that class The implied classification threshold is constant across all nodes and depends on the currently set priors, costs, and observation weights (discussed later) Other algorithms often use only the majority rule or a limited set of fixed rules for node class assignment thus being less flexible In standard classification, tree assignment is not probabilistic Another type of CART tree, the class probability tree does report distributions of class membership in nodes (discussed later) With large data sets we can take the empirical distribution in a terminal node to represent the distribution of classes Introduction to CART® Salford Systems ©2007 12
  13. 13. The Importance of Binary Splits Some researchers suggested using 3-way, 4-way, etc. splits at each node (univariate) There are strong theoretical and practical reasons why we feel strongly against such approaches: Exhaustive search for a binary split has linear algorithm complexity, the alternatives have much higher complexity Choosing the best binary split is “less ambiguous” than the suggested alternatives Most importantly, binary splits result to the smallest rate of data fragmentation thus allowing a larger set of predictors to enter a model Any multi-way split can be replaced by a sequence of binary splits There are numerous practical illustrations on the validity of the above claims Introduction to CART® Salford Systems ©2007 13
  14. 14. Accuracy of a Tree For classification trees All objects reaching a terminal node are classified in the same way e.g. All heart attack victims with BP>91 and AGE less than 62.5 are classified as SURVIVORS Classification is same regardless of other medical history and lab results The performance of any classifier can be summarized in terms of the Prediction Success Table (a.k.a. Confusion Matrix) The key to comparing the performance of different models is how one “wraps” the entire prediction success matrix into a single number Overall accuracy Average within-class accuracy Most generally – the Expected Cost Introduction to CART® Salford Systems ©2007 14
  15. 15. Prediction Success Table Classified As TRUE Class 1 Class 2 "Survivors" "Early Deaths" Total % Correct Class 1 "Survivors" 158 20 178 88% Class 2 "Early Deaths" 9 28 37 75% Total 167 48 215 86.51% A large number of different terms exists to name various simple ratios based on the table above: Sensitivity, specificity, hit rate, recall, overall accuracy, false/true positive, false/true negative, predictive value, etc. Introduction to CART® Salford Systems ©2007 15
  16. 16. Tree Interpretation and Use Practical decision tool Trees like this are used for real world decision making Rules for physicians Decision tool for a nationwide team of salesmen needing to classify potential customers Selection of the most important prognostic variables Variable screen for parametric model building Example data set had 100 variables Use results to find variables to include in a logistic regression Insight — in our example AGE is not relevant if BP is not high Suggests which interactions will be important Introduction to CART® Salford Systems ©2007 16
  17. 17. Introducing CART Interface Introduction to CART® Salford Systems ©2007 17
  18. 18. Classification with CART® Real world study early 1990s Fixed line service provider offering a new mobile phone service Wants to identify customers most likely to accept new mobile offer Data set based on limited market trial Introduction to CART® Salford Systems ©2007 18
  19. 19. Real World Trial 830 households offered a mobile phone package All offered identical package but pricing was varied at random Handset prices ranged from low to high values Per minute prices ranged separately from low to high rate rates Household asked to make yes or no decision on offer 15.2% of the households accepted the offer Our goal was to answer two key questions Who to make offers to? How to price? Introduction to CART® Salford Systems ©2007 19
  20. 20. Setting Up the ModelOnly requirement is to select TARGET (dependent) variable. CART willdo everything else automatically. Introduction to CART® Salford Systems ©2007 20
  21. 21. Model Results CART completes analysis and gives access to all results from the NAVIGATOR Shown on the next slide Upper section displays tree of a selected size Lower section displays error rate for trees of all possible sizes Green bar marks most accurate tree We display a compact 10 node tree for further scrutiny Introduction to CART® Salford Systems ©2007 21
  22. 22. Navigator Display Most accurate tree is marked with the green bar. We select the 10 node tree for convenience of a more compact display. Introduction to CART® Salford Systems ©2007 22
  23. 23. Root Node Root node displays details of TARGET variable in overall training data We see that 15.2% of the 830 households accepted the offer Goal of the analysis is now to extract patterns characteristic of responders If we could only use a single piece of information to separate responders from non-responders CART chooses the HANDSET PRICE Those offered the phone with a price > 130 contain only 9.9% responders Those offered a lower price respond at 21.9% Introduction to CART® Salford Systems ©2007 23
  24. 24. Maximal TreeMaximal tree is the raw material for the best modelGoal is to find optimal tree embedded inside maximaltreeWill find optimal tree via “pruning”Like backwards stepwise regression Introduction to CART® Salford Systems ©2007 24
  25. 25. Main Drivers Red= Good Response Blue=Poor Response) High values of a split variable always go to the right; low values go left Introduction to CART® Salford Systems ©2007 25
  26. 26. Examine Individual Node Hover mouse over node to see inside Even though this node is on the “high price” of the tree it still exhibits the strongest response across all terminal node segments (43.5% response) Here we select rules expressed in C Entire tree can also be rendered in Java, XML/PMML, or SAS Introduction to CART® Salford Systems ©2007 26
  27. 27. Configure Print Image Variety of fine controls to position and scale image nicely Introduction to CART® Salford Systems ©2007 27
  28. 28. Variable Importance Ranking CART offers multiple ways to compute variable importance Introduction to CART® Salford Systems ©2007 28
  29. 29. Predictive Accuracy This model is not very accurate but ranks responders well Introduction to CART® Salford Systems ©2007 29
  30. 30. Gains and ROC Curves In top decile model captures about 40% of responders Introduction to CART® Salford Systems ©2007 30
  31. 31. The Big PictureIntroduction to CART® Salford Systems ©2007 31
  32. 32. Binary Recursive Partitioning Data is split into two partitions thus “binary” partition Partitions can also be split into sub-partitions hence procedure is recursive CART tree is generated by repeated dynamic partitioning of data set Automatically determines which variables to use Automatically determines which regions to focus on The whole process is totally data driven Introduction to CART® Salford Systems ©2007 32
  33. 33. Three Major Stages Tree growing Consider all possible splits of a node Find the best split according to the given splitting rule (best improvement) Continue sequentially until the largest tree is grown Tree Pruning – creating a sequence of nested trees (pruning sequence) by systematic removal of the weakest branches Optimal Tree Selection – using test sample or cross-validation to find the best tree in the pruning sequence Introduction to CART® Salford Systems ©2007 33
  34. 34. General Workflow Historical Data Stage 1 Build a Sequence Stage 2 Learn of Nested Trees Stage 3 Monitor Performance Test Best Validate Confirm Findings Introduction to CART® Salford Systems ©2007 34
  35. 35. Searching All Possible Splits For any node CART will examine ALL possible splits Computationally intensive but there are only a finite number of splits Consider first the variable AGE--in our data set has minimum value 40 The rule Is AGE ≤ 40? will separate out these cases to the left — the youngest persons Next increase the AGE threshold to the next youngest person Is AGE ≤ 43 This will direct six cases to the left Continue increasing the splitting threshold value by value Introduction to CART® Salford Systems ©2007 35
  36. 36. Split Tables Split tables are used to locate continuous splits There will be as many splits as there are unique values minus one Sorted by Age Sorted by Blood Pressure AGE BP SINUST SURVIVE AGE BP SINUST SURVIVE 40 91 0 SURVIVE 43 78 1 DEAD 40 110 0 SURVIVE 40 83 1 DEAD 40 83 1 DEAD 40 91 0 SURVIVE 43 99 0 SURVIVE 43 99 0 SURVIVE 43 78 1 DEAD 40 110 0 SURVIVE 43 135 0 SURVIVE 49 110 1 SURVIVE 45 120 0 SURVIVE 48 119 1 DEAD 48 119 1 DEAD 45 120 0 SURVIVE 48 122 0 SURVIVE 48 122 0 SURVIVE 49 150 0 DEAD 43 135 0 SURVIVE 49 110 1 SURVIVE 49 150 0 DEAD Introduction to CART® Salford Systems ©2007 36
  37. 37. Categorical Predictors Left Right Categorical predictors spawn1 A B, C, D many more splits2 B A, C, B For example, only four levels A,3 C A, B, D B, C, D result to 7 splits4 D A, B, C5 A, B C, D In general, there will be 2K-1-16 A, C B, D splits7 A, D B, C CART considers all possible splits based on a categorical predictor unless the number of categories exceeds 15 Introduction to CART® Salford Systems ©2007 37
  38. 38. Binary Target Shortcut When the target is binary, special algorithms reduce compute time to linear (K-1 instead of 2K-1-1) in levels 30 levels takes twice as long as 15, not 10,000+ times as long When you have high-level categorical variables in a multi- class problem Create a binary classification problem Try different definitions of DPV (which groups to combine) Explore predictor groupings produced by CART From a study of all results decide which are most informative Create new grouped predicted variable for multi-class problem Introduction to CART® Salford Systems ©2007 38
  39. 39. Pruning and Variable Importance Introduction to CART® Salford Systems ©2007 39
  40. 40. GYM Model Example The original data comes from a financial institution and disguised as a health club Problem: need to understand a market research clustering scheme Clusters were created externally using 18 variables and conventional clustering software Need to find simple rules to describe cluster membership CART tree provides a nice way to arrive at an intuitive story Introduction to CART® Salford Systems ©2007 40
  41. 41. Variable Definitions ID Identification # of member not used CLUSTER Cluster assigned from clustering scheme (10 level categorical coded 1-10) target ANYRAQT Racquet ball usage (binary indicator coded 0, 1) ONRCT Number of on-peak racquet ball uses ANYPOOL Pool usage (binary indicator coded 0, 1) ONPOOL Number of on-peak pool uses PLRQTPCT Percent of pool and racquet ball usage TPLRCT Total number of pool and racquet ball uses ONAER Number of on-peak aerobics classes attended OFFAER Number of off-peak aerobics classes attended SAERDIF Difference between number of on- and off-peak aerobics visits TANNING Number of visits to tanning salon predictors PERSTRN Personal trainer (binary indicator coded 0, 1) CLASSES Number of classes taken NSUPPS Number of supplements/vitamins/frozen dinners purchased SMALLBUS Small business discount (binary indicator coded 0, 1) OFFER Terms of offer IPAKPRIC Index variable for package price MONFEE Monthly fee paid FIT Fitness score NFAMMEN Number of family members HOME Home ownership (binary indicator coded 0, 1) Introduction to CART® Salford Systems ©2007 41
  42. 42. Things to Do Dataset: gymsmall.syd Set the target and predictors Set 50% test sample Navigate the largest, smallest, and optimal trees Check [Splitters] and [Tree Details] Observe the logic of pruning Check [Summary Reports] Understand competitors and surrogates Understand variable importance Introduction to CART® Salford Systems ©2007 42
  43. 43. Pruning Multiple Nodes Note that Terminal Node 5 has class assignment 1 All the remaining nodes have class assignment 2 Pruning merges Terminal Nodes 5 and 6 thus creating a “chain reaction” of redundant split removal Introduction to CART® Salford Systems ©2007 43
  44. 44. Pruning Weaker Split First Eliminating nodes 3 and 4 results to “lesser” damage – loosing one correct dominant class assignment Eliminating nodes 1 and 2 or 5 and 6 would result to loosing one correct minority class assignment Introduction to CART® Salford Systems ©2007 44
  45. 45. Understanding Variable Importance Both FIT and CLASSES are reported as equally important Yet the tree only includes CLASSES ONPOOL is one of the main splitters yet is has very low importance Want to understand why Introduction to CART® Salford Systems ©2007 45
  46. 46. Competitors and Surrogates Node details for the root node split are shown above Improvement measures class separation between two sides Competitors are next best splits in the given node ordered by improvement values Surrogates are splits most resembling (case by case) the main split Association measures the degree of resemblance Looking for competitors is free, finding surrogates is expensive It is possible to have an empty list of surrogates FIT is a perfect surrogate (and competitor) for CLASSES Introduction to CART® Salford Systems ©2007 46
  47. 47. The Utility of Surrogates Suppose that the main splitter is INCOME and comes from a survey form People with very low or very high income are likely not to report it – informative missingness in both tails of the income distribution The splitter itself wants to separate low income on the left from high income on the right – hence, it is unreasonable to treat missing INCOME as a separate unique value (usual way to handle such case) Using a surrogate will help to resolve ambiguity and redistribute missing income records between the two sides For example, all low education subjects will join the low income side while all high education subjects will join the high income side Introduction to CART® Salford Systems ©2007 47
  48. 48. Competitors and Surrogates Are Different A A Three nominal B B classes C C Equally presentA Split X C A Split Y B Equal priors, unitB C costs, etc. For symmetry considerations, both splits must result to identical improvements Yet one split can’t be substituted for another Hence Split X and Split Y are perfect competitors (same improvement) but very poor surrogates (low association) The reverse does not hold – a perfect surrogate is always a perfect competitor Introduction to CART® Salford Systems ©2007 48
  49. 49. Calculating AssociationSplit X Split Y Default Ten observations in a node Main splitter X Split Y mismatches on 2 observations The default rule sends all cases to the dominant side mismatching 3 observations Association = (Default – Split Y)/Default = (3-2)/3 = 1/3 Introduction to CART® Salford Systems ©2007 49
  50. 50. Notes on Association Measure local to a node, non-parametric in nature Association is negative for nodes with very poor resemblance, not limited by -1 from below Any positive association is useful (better strategy than the default rule) Hence define surrogates as splits with the highest positive association In calculating case mismatch, check the possibility for split reversal (if the rule is met, send cases to the right) CART marks straight splits as “s” and reverse splits as “r” Introduction to CART® Salford Systems ©2007 50
  51. 51. Calculating Variable Importance List all main splitters as well as all surrogates along with the resulting improvements Aggregate the above list by variable accumulating improvements Sort the result descending Divide each entry by the largest value and scale to 100% The resulting importance list is based on the given tree exclusively, if a different tree a grown the importance list will most likely change Variable importance reveals a deeper structure in stark contrast to a casual look at the main splitters Introduction to CART® Salford Systems ©2007 51
  52. 52. Mystery Solved In our example improvements in the root node dominate the whole tree Consequently, the variable importance reflects the surrogates table in the root node Introduction to CART® Salford Systems ©2007 52
  53. 53. Variable Importance Caution Importance is a function of the OVERALL tree including deepest nodes Suppose you grow a large exploratory tree — review importances Then find an optimal tree via test set or CV yielding smaller tree Optimal tree SAME as exploratory tree in the top nodes YET importances might be quite different. WHY? Because larger tree uses more nodes to compute the importance When comparing results be sure to compare similar or same sized trees
  54. 54. Splitting Rules and Friends Introduction to CART® Salford Systems ©2007 54
  55. 55. Splitting Rules - Introduction Splitting rule controls the tree growing stage At each node, a finite pool of all possible splits exists The pool is independent of the target variable, it is based solely on the observations (and corresponding values of predictors) in the node The task of the splitting rule is to rank all splits in the pool according to their improvements The winner will become the main splitter The top runner-ups will become competitors It is possible to construct different splitting rules emphasizing various approaches to the definition of the “best split” Introduction to CART® Salford Systems ©2007 55
  56. 56. Splitting Rules Control Selected in the “Method” tab of the Model Setup dialog There are 6 alternative rules for classification and 2 rules for regression Since there is no theoretical justification on the best splitting rule, we recommend trying all of them and picking the best resulting model Introduction to CART® Salford Systems ©2007 56
  57. 57. Further Insights On binary classification problem, GINI and TWOING are equivalent Therefore, we recommend setting “Favor even Splits” control to 1 when TWOING is used GINI and Symmetric GINI only differ when an asymmetric cost matrix is used and identical otherwise Ordered TWOING is a restricted version of TWOING on multinomial targets coded into a set of contiguous integers Class Probability is identical to GINI during tree growing but differs during the pruning Introduction to CART® Salford Systems ©2007 57
  58. 58. Battery RULES The battery was designed to automatically try all six alternative splitting rules According to the output above, Entropy splitting rule resulted to the best accuracy model on the given data Individual model navigators are also available Introduction to CART® Salford Systems ©2007 58
  59. 59. Controlled Tree Growth CART was originally conceived to be fully automated thus protecting an analyst from making erroneous split decisions In some cases, however, it could be desirable to influence tree evolution to address specific modeling needs Modern CART engine allows multiple direct and indirect ways of accomplishing this: Penalties on variables, missing values, and categories Controlling which variables are allowed or disallowed at different levels in the tree and node sizes Direct forcing of splits at the top two levels in the tree Introduction to CART® Salford Systems ©2007 59
  60. 60. Penalizing Individual Variables We use variable penalty control to penalize CLASSES a little bit This is enough to ensure FIT the main splitter position The improvement associated with CLASSES is now downgraded by 1 minus the penalty factor Introduction to CART® Salford Systems ©2007 60
  61. 61. Penalizing Missing and Categorical Predictors Since splits on variables with missing values are evaluated on a reduced set of observations, such splits may have an unfair advantage Introducing systematic penalty set to the proportion of the variable missing effectively removes the possible bias Same holds for categorical variables with large number of distinct levels We recommend that the user always set both controls to 1 Introduction to CART® Salford Systems ©2007 61
  62. 62. Forcing Splits CART allows direct forcing of the split variable (and possibly split value) in the top three nodes in the tree Introduction to CART® Salford Systems ©2007 62
  63. 63. Notes on Forcing Splits It is not guaranteed that a forced split below the root node level will be preserved (because of pruning) Do not simply overturn “illogical” automatic splits by forcing something else – look for an explanation why CART makes a “wrong” choice, it could indicate a problem with your data Introduction to CART® Salford Systems ©2007 63
  64. 64. Constrained Trees Many predictive models can benefit from Salford’s patent pending “Structured Trees” Trees constrained in how they are grown to reflect decision support requirements In mobile phone example: want tree to first segment on customer characteristics and then complete using price variables Price variables are under the control of the company Customer characteristics are not under company control Introduction to CART® Salford Systems ©2007 64
  65. 65. Setting Up ConstraintsGreen indicates where in the tree variables of group areallowed to appear Introduction to CART® Salford Systems ©2007 65
  66. 66. Constrained Tree Demographic and spend information at top of tree Handset (HANDPRIC) and per minute pricing (USEPRICE) at bottom Introduction to CART® Salford Systems ©2007 66
  67. 67. Prior Probabilities An essential component of any CART analysis A prior probability distribution for the classes is needed to do proper splitting since prior probabilities are used to calculate the improvements in the Gini formula Example: Suppose data contains 99% type A (nonresponders), 1% type B (responders), CART might focus on not missing any class A objects one "solution" classify all objects nonresponders Advantages: Only 1% error rate Very simple predictive rules, although not informative But realize this could be an optimal rule
  68. 68. Priors EQUAL Most common priors: PRIORS EQUAL This gives each class equal weight regardless of its frequency in the data Prevents CART from favoring more prevalent classes If we have 900 class A and 100 class B and equal priors A: 900 B: 100 A: 300 A: 600 B: 10 B: 90 1/3 of all A 2/3 of all A 1/10 of all B 9/10 of all B Class as A Class as B With equal priors prevalence measured as percent of OWN class size Measurements relative to own class not entire learning set If PRIORS DATA both child nodes would be class A Equal priors puts both classes on an equal footing
  69. 69. Class Assignment Formula I For equal priors class node is class 1 if N 1( t ) N 1 > N 2( t ) N 2 Ni = number of cases in class i at the root node Ni (t) = number of cases in class i at the node t Classify any node by “count ratio in node” relative to “count ratio in root” Classify by which level has greatest relative richness
  70. 70. Class Assignment Formula II When priors are not equal: π1N1(t) / π2N2(t) > N1 / N2 where πi is the prior for class i Since π1 / π2 always appears as ratio just think in terms of boosting amount Ni(t) by a factor, e.g. 9:1, 100:1, 2:1 If priors are EQUAL, πi terms cancel out If priors are DATA, then π1 / π2 = N1 / N2 which simplifies formula to plurality rule (simple counting) classify as class 1 if N1(t) > N2(t) If priors are neither EQUAL nor DATA then formula will provide boost to up-weighted classes
  71. 71. Example A: A: 900 900 B: B: 100 100 A: 300 A: 600 A: 600 A: 300 B: 10 B: 90 B: 90 B: 10 300 > 900 so class as A 90 > 100 so class as B 10 100 600 900 In gen eral th e tests w ou ld b e w eigh ted b y th e ap p rop riate p riors π (B ) 90 > 100 π (A ) 300 > 900 π (A ) 600 900 π (B ) 10 100 tests can b e w ritten as A or B as con ven ien t B A
  72. 72. Cross ValidationIntroduction to CART® Salford Systems ©2007 72
  73. 73. Cross-Validation Cross-validation is a recent computer intensive development in statistics Purpose is to protect oneself from over fitting errors Dont want to capitalize on chance — track idiosyncrasies of this data Idiosyncrasies which will NOT be observed on fresh data Ideally would like to use large test data sets to evaluate trees, N>5000 Practically, some studies dont have sufficient data to spare for testing Cross-validation will use SAME data for learning and for testing
  74. 74. 10-fold CV: Industry Standard Begin by growing maximal tree on ALL data Divide data into 10 portions stratified on dependent variable levels Reserve first portion for test, grow new tree on remaining 9 portions Use the 1/10 test set to measure error rate for this 9/10 data tree Error rate is measured for the maximal tree and for all subtrees Now rotate out new 1/10 test data set Grow new tree on remaining 9 portions Repeat until all 10 portions of the data have been used as test sets
  75. 75. The CV Procedure 1 2 3 4 5 6 7 8 9 10 Test Learn 1 2 3 4 5 6 7 8 9 10 Learn Test Learn 1 2 3 4 5 6 7 8 9 10 Learn Test Learn ETC... 1 2 3 4 5 6 7 8 9 10 Learn Test
  76. 76. CV Details Every observation is used as test case exactly once In 10 fold CV each observation is used as a learning case 9 times In 10 fold CV 10 auxiliary CART trees are grown in addition to initial tree When all 10 CV trees are done the error rates are cumulated (summed) Summing of error rates is done by TREE complexity Summed Error (cost) rates are then attributed to INITIAL tree Observe that no two of these trees need be the same Results are subject to random fluctuations — sometimes severe
  77. 77. CV Replication and Stability Look at the separate CV tree results of a single CV analysis CART produces a summary of the results for each tree at end of output Table titled "CV TREE COMPETITOR LISTINGS" reports results for each CV tree Root (TOP), Left Child of Root (LEFT), and Right Child of Root (RIGHT) Want to know how stable results are -- do different variables split root node in different trees? Only difference between different CV trees is random elimination of 1/10 data Special battery called CVR exists to study the effects of repeated cross-validation process
  78. 78. Battery CVR Introduction to CART® Salford Systems ©2007 78
  79. 79. Regression TreesIntroduction to CART® Salford Systems ©2007 79
  80. 80. General Comments Regression trees result to piece-wise constant models (multi-dimensional staircase) on an orthogonal partition of the data space Thus usually not the best possible performer in terms of conventional regression loss functions Only a very limited number of controls is available to influence the modeling process Priors and costs are no longer applicable There are two splitting rules: LS and LAD Very powerful in capturing high-order interactions but somewhat weak in explaining simple main effects Introduction to CART® Salford Systems ©2007 80
  81. 81. Boston Housing Data Set Harrison, D. and D. Rubinfeld. Hedonic Housing Prices & Demand For Clean Air. Journal of Environmental Economics and Management, v5, 81-102 , 1978 506 census tracts in City of Boston for the year 1970 Goal: study relationship between quality of life variables and property values MV median value of owner-occupied homes in tract (‘000s) CRIM per capita crime rates NOX concentration of nitrogen oxides (pphm) AGE percent built before 1940 DIS weighted distance to centers of employment RM average number of rooms per house LSTAT percent neighborhood ‘lower SES’ RAD accessibility to radial highways CHAS borders Charles River (0/1) INDUS percent non-retail business TAX tax rate PT pupil teacher ratio Introduction to CART® Salford Systems ©2007 81
  82. 82. Splitting the Root Node Improvement is defined in terms of the greatest reduction in the sum of squared errors when a single constant prediction is replaced by two separate constants on each side Introduction to CART® Salford Systems ©2007 82
  83. 83. Regression Tree Model All cases in the given node are assigned the same predicted response – the node average of the original target Nodes are color-coded according to the predicted response We have a convenient segmentation of the population according to the average response levels Introduction to CART® Salford Systems ©2007 83
  84. 84. The Best and the Worst Segments Introduction to CART® Salford Systems ©2007 84
  85. 85. Scoring and Deployment Introduction to CART® Salford Systems ©2007 85
  86. 86. Notes on Scoring Trees cannot be expressed as a simple analytical expression, instead, a lengthy collection of rules is needed to express the tree logic Nonetheless, the scoring operation is very fast once the rules are compiled into the machine code Scoring can be done in two ways: Internally – by using the SCORE command Externally – by using the TRANSLATE command to generate C, SAS, Java, or PMML code Model of any size can be scored or translated Introduction to CART® Salford Systems ©2007 86
  87. 87. Internal Scoring Introduction to CART® Salford Systems ©2007 87
  88. 88. Output File Format CASEID RESPONSE NODE CORRECT DV PROB_1 PROB_2 PATH_1 PATH_2 PATH_3 PATH_4 PATH_5 PATH_6 XA XB XC 1 0 1 1 0 0.998795 0.001205 1 2 3 4 5 -1 0 0 0 2 1 7 0 0 0.738636 0.261364 1 -7 0 0 0 0 1 0 0 3 0 1 1 0 0.998795 0.001205 1 2 3 4 5 -1 0 0 0 CART creates the following output variables RESPONSE – predicted class assignment NODE – terminal node the case falls into CORRECT – indicates whether the prediction is correct (assuming that the actual target is known) PROB_x – class probability (constant within each node) PATH_x – path traveled by case in the tree Case 1 starts in the root node then follows through internal nodes 2, 3, 4, 5 and finally lands into terminal node 1 Case 2 starts in the root node and then lands into terminal node 7 Finally CART adds all of the original model variables including the target and predictors Introduction to CART® Salford Systems ©2007 88
  89. 89. Model Translation Introduction to CART® Salford Systems ©2007 89
  90. 90. Output Code Introduction to CART® Salford Systems ©2007 90
  91. 91. CART AutomationIntroduction to CART® Salford Systems ©2007 91
  92. 92. Automated CART Runs Starting with CART 6.0 we have added a new feature called batteries A battery is a collection of runs obtained via a predefined procedure A trivial set of batteries is varying one of the key CART parameters: RULES – run every major splitting rule ATOM – vary minimum parent node size MINCHILD – vary minimum node size DEPTH – vary tree depth NODES – vary maximum number of nodes allowed FLIP – flipping the learn and test samples (two runs only) Introduction to CART® Salford Systems ©2007 92
  93. 93. More Batteries DRAW – each run uses a randomly drawn learn sample from the original data CV – varying number of folds used in cross-validation CVR – repeated cross-validation each time with a different random number seed KEEP – select a given number of predictors at random from the initial larger set Supports a core set of predictors that must always be present ONEOFF – one predictor model for each predictor in the given list The remaining batteries are more advanced and will be discussed individually Introduction to CART® Salford Systems ©2007 93
  94. 94. Battery MCT The primary purpose is to test possible model overfitting by running Monte Carlo simulation Randomly permute the target variable Build a CART tree Repeat a number of times The resulting family of profiles measures intrinsic overfitting due to the learning process Compare the actual run profile to the family of random profiles to check the significance of the results This battery is especially useful when the relative error of the original run exceeds 90% Introduction to CART® Salford Systems ©2007 94
  95. 95. Battery MVI Designed to check the impact of the missing values indicators on the model performance Build a standard model, missing values are handled via the mechanism of surrogates Build a model using MVIs only – measures the amount of predictive information contained in the missing value indicators alone Combine the two Same as above with missing penalties turned on Introduction to CART® Salford Systems ©2007 95
  96. 96. Battery PRIOR Vary the priors for the specified class within a given range For example, vary the prior on class 1 from 0.2 to 0.8 in increments of 0.02 Will result to a family of models with drastically different performance in terms of class richness and accuracy (see the earlier discussion) Can be used for hot spot detection Can also be used to construct a very efficient voting ensemble of trees This battery is covered in greater length in our Advanced CART class Introduction to CART® Salford Systems ©2007 96
  97. 97. Battery SHAVING Automates variable selection process Shaving from the top – each step the most important variable is eliminated Capable to detect and eliminate “model hijackers” – variables that appear to be important on the learn sample but in general may hurt ultimate model performance (for example, ID variable) Shaving from the bottom – each step the least important variable is eliminated May drastically reduce the number of variables used by the model SHAVING ERROR – at each iteration all current variables are tried for elimination one at a time to determine which predictor contributes the least May result to a very long sequence of runs (quadratic complexity) Introduction to CART® Salford Systems ©2007 97
  98. 98. Battery SAMPLE Build a series of models in which the learn sample is systematically reduced Used to determine the impact of the learn sample size on the model performance Can be combined with BATTERY DRAW to get additional spread Introduction to CART® Salford Systems ©2007 98
  99. 99. Battery TARGET Attempt to build a model for each variable in the given list as a target using the remaining variables as predictors Very effective in capturing inter-variable dependency patterns Can be used as an input to variable clustering and multi-dimensional scaling approaches When run on MVIs only, can reveal missing value patterns in the data Introduction to CART® Salford Systems ©2007 99
  100. 100. Stable Trees and Consistency Introduction to CART® Salford Systems ©2007 100
  101. 101. Notes on Tree Stability Tree are usually stable near the top and become progressively more and more unstable as the nodes become smaller Therefore, smaller trees in the pruning sequence will tend to be more stable Node instability usually manifests itself when comparing node class distributions between the learn and test samples Ideally we want to identify the largest stable tree in the pruning sequence The optimal tree suggested by CART may contain a few unstable nodes structurally needed to optimize the overall accuracy Train Test Consistency feature (TTC) was designed to address the topic at length Introduction to CART® Salford Systems ©2007 101
  102. 102. MJ2000 Run The dataset comes from an undisclosed financial institution Set up a default classification run with 50% random test sample CART determines 36-node optimal tree Introduction to CART® Salford Systems ©2007 102
  103. 103. Node Stability Terminal Nodes TTC Report A quick look at the Learn and Test node distributions reveals a large number of unstable nodes Let’s use TTC feature by pressing the [T/T Consist] button to identify the largest stable tree Introduction to CART® Salford Systems ©2007 103
  104. 104. The Largest Stable Tree TTC Report Tree with 11 terminal nodes is identified as stable (all nodes match on both direction and rank) While within the specified tolerance, nodes 2, 6, 8 are somewhat unsatisfactory in terms of rank-order These nodes will emerge as rank-order unstable if we lower the tolerance level to 0.5 Introduction to CART® Salford Systems ©2007 104
  105. 105. Improving Predictor List Predictor Elimination by Shaving Run BATTERY SHAVE to systematically eliminate the least important predictors The resulting optimal tree has even better accuracy than before It also uses a much smaller set of only 6 predictors Let’s study the tree stability for the new model Introduction to CART® Salford Systems ©2007 105
  106. 106. Comparing the Results Original Tree New Tree The tree on the left is the original (unshaved) one The tree on the right is the result of shaving The new tree is far more stable than the original one, except for node 8 Introduction to CART® Salford Systems ©2007 106
  107. 107. Justification Learn Sample Test Sample Node 8 is irrelevant for classification but its presence is required in order to allow a very strong split below Introduction to CART® Salford Systems ©2007 107
  108. 108. Recommended Reading Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: Wadsworth Breiman, L (1996), Bagging Predictors, Machine Learning, 24, 123-140 Breiman, L (1996), Stacking Regressions, Machine Learning, 24, 49-64. Friedman, J. H., Hastie, T. and Tibshirani, R. "Additive Logistic Regression: a Statistical View of Boosting." (Aug. 1998) Introduction to CART® Salford Systems ©2007 108