Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using CART For Beginners with A Teclo Example Dataset


Published on

Familiarize yourself with CART Decision Tree technology in this beginner's tutorial using a telecommunications example dataset from the 1990s. By the end of this tutorial you should feel comfortable using CART on your own with sample or real-world data.

Published in: Technology
  • Be the first to comment

Using CART For Beginners with A Teclo Example Dataset

  1. 1. Salford Systems Webex Training Salford Systems
  2. 2. CART® Decision Tree Basics • We start with a simple analysis of some market research data using CART • This introduction assumes no background in data mining or predictive analytics • We do assume you have had some experience reviewing data with the purpose of discovering interesting and or predictive patterns © Copyright Salford Systems 2013
  3. 3. Beginning with CART • CART is the perfect place to start learning about data mining • Widely regarded as one of the most important tools in data mining and also the easiest to understand and master – Decision trees are still the most popular data analysis tool among experienced data miners • Delivers easy to understand analyses of complex data – Allows for very sophisticated analyses especially when a structured series of trees are developed – Effective Exploratory Data Analysis (EDA) to support more conventional modeling (eg logistic regression) © Copyright Salford Systems 2013
  4. 4. Classification with CART® Real world study early 1990s • Fixed line service provider offering a new mobile phone service • Wants to identify customers most likely to accept new mobile offer • Data set based on limited market trial • 830 usable records • 67 attributes and target including – Demographics – Attitudes and Needs – Pricing for handset & minutes © Copyright Salford Systems 2013
  5. 5. Mobile Phone Offer • Data is a sample of land line telephone customers of a European telco • At the time mobile phones were very rare in the country in question • The Company realized the time was right to introduce mobile phones on a substantial scale to their existing fixed line customer base • Key questions: – WHO to target with the marketing campaigns for the new product – HOW MUCH to charge for the handset © Copyright Salford Systems 2013
  6. 6. Nature of the Research • Company arranged to make real world offers to about 1,000 existing land line customers • Everyone was presented the same offer (only one model of phone and one service plan available) • The PRICE of the handset was varied randomly over a large range prices from near zero to about $300 • Goal was to learn who responded positively and at what price points • Offers were made in person as part of a one hour visit in which much was learned about the household (media preferences, number of children, distance to work, etc) © Copyright Salford Systems 2013
  7. 7. • Target variable RESPONSE: Coded 0 or 1 (YES, NO) • 65 available predictors include variables like: Nature of the Data HANDPRIC Cost of handset (one time fee) USEPRICE Usage cost (per month, 100 minutes) TELEBILC Landline home phone bill average CITY Resident in which of 5 major cities AGE Coded in 5 year increments HOUSIZ Possible proxy for income, coded 1-6 SEX Male, Female, Unknown EDUCATN Coded 1-7 ( thru postgrad) © Copyright Salford Systems 2013
  8. 8. Analysis File Overview in CART 6.0 © Copyright Salford Systems 2013
  9. 9. Set Up the Model (Select Target, allowable predictors) Only requirement is to select TARGET (dependent) variable. CART will do everything else automatically© Copyright Salford Systems 2013
  10. 10. CART: Does its own variable selection • Embedded variable (feature) selection means that modeler can let the software make its own choice of predictors • Modeler will often want to limit the model to focus on selected inputs – Exclude ID variables and merge keys – Exclude clones of the dependent variable – Exclude data pertaining to the future (relative to the dependent) – E.g. restrict a model to easily available predictors – Test predictive power of purchased external data • Modeling automation can allow exploration of a vast space of pre-selected predictors (see later slides) © Copyright Salford Systems 2013
  11. 11. In Example we run CART model • CART completes analysis and gives access to all results from the NAVIGATOR – Shown on next slide • Upper section displays tree of a selected size – number of terminal nodes • Lower section displays error rate for trees of all possible sizes • Green bar marks most accurate tree • We display a compact 10 node tree for further scrutiny © Copyright Salford Systems 2013
  12. 12. CART Model Viewer Access reports and drill into model details Most accurate tree is marked with the green bar. Above we select the 10 node tree for convenience of a more compact display. Note train/ test area under ROC curve © Copyright Salford Systems 2013
  13. 13. Root Node:Hover Mouse Tree starts with all training data Displays details of TARGET variable in overall training data Above we see that 15.2% of the 830 households accepted the offer Goal of the analysis is now to extract patterns characteristic of responders © Copyright Salford Systems 2013
  14. 14. Goal is to split node: separate responders • Details of root node split • If we could only use a single piece of information to separate responders from non-responders CART chooses the HANDSET PRICE •Those offered the phone with a price > 130 contain only 9.9% responders •Those offered a lower price respond at 21.9% © Copyright Salford Systems 2013
  15. 15. CART Splitting Rules • We discuss the details later • Here we just point out that the split CART displays is – ―the best of all possible splits‖ • Subject to the splitting criteria you have chosen and any constraints imposed • How do we know this split is ―best‖? • Because CART actually tries all possible splits looking for the best – Exhaustive brute force search – Advanced algorithms used to make this search fast – As much as 100 times faster than other decision trees © Copyright Salford Systems 2013
  16. 16. Grow progressively bigger tree: One split at a time © Copyright Salford Systems 2013 • Binary recursive partitioning repeated until further splitting impossible (e.g. data exhausted) • This leads us to the largest possible or “maximal tree
  17. 17. Maximal tree is raw material for best model © Copyright Salford Systems 2013 • Goal is to find optimal tree embedded inside maximal tree • Will find optimal tree via ―pruning‖ • Like backwards stepwise regression • Challenge: A tree with 100 terminal nodes can be pruned back to 99 terminal nodes by eliminating any one 99 penultimate nodes • Now the 99 new terminal nodes can be cut back to 98 by eliminating any one of the surviving 98 penultimate nodes • Something like 99! possible trees. How do we find the best?
  18. 18. Pruning Sequence • CART automatically generates a pruning sequence which develops a preferred sequence of progressively smaller trees • We can prove that for a given tree size the CART tree in the sequence will be the best performing tree of all possible trees of that size • In our sequence, the 10 node tree is gauranteed to be more accurate than any other 10 node tree you could extract from the maximal tree • You as the user never need to worry about this • ―Better‖ is defined in terms of performance on the training data as we need the tree sequence before we can test © Copyright Salford Systems 2013
  19. 19. Error Curve: Plots Accuracy vs Model Size © Copyright Salford Systems 2013 •Requires test data •Can use cross-validation (sample reuse) if data is scarce •Curve typically U-shaped • Too small is not good and neither is too large •Can look at any tree in the sequence of pruned subtrees •Error is what BFOS call an “honest” estimate of model performance
  20. 20. Pick a modest sized tree to examine Note high response in this RED colored node Response of 38.5% in this segment vs. 15.2% overall Lift = 2.53 © Copyright Salford Systems 2013
  21. 21. Navigator allows access to all model info • The terminal nodes are color coded to represent results – RED nodes are ―hot‖ and contain high concentrations of the class of interest (buyers) – BLUE nodes are ―cold‖ and contain very low concentrations of the class of interest – PINK and WHITE nodes have moderate concentrations • We first look to see if we have any RED nodes – Explore any red nodes via mouse hover • Then we drill down to see a tree schematic revealing the main drivers of the tree © Copyright Salford Systems 2013
  22. 22. Select ―splitters‖ View Selects a streamlined overview of the tree showing ONLY primary splitters © Copyright Salford Systems 2013
  23. 23. Model Overview: Main Drivers ( Red= Good Response Blue=Poor Response) High values of a split variable always go to the right; low values go left © Copyright Salford Systems 2013
  24. 24. Examine Extreme Right-most Terminal Node • Hover mouse over node to see inside • Even though this node is on the ―high price‖ of the tree it still exhibits the strongest response across all terminal node segments (43.5% response) •Rules defining this node are shown on next slide © Copyright Salford Systems 2013
  25. 25. Rules can be extracted in a variety of languages • Here we select rules expressed in C for one node of interest Entire tree can also be rendered in Java, XML/PMML, or SAS © Copyright Salford Systems 2013
  26. 26. Continuing down the tree • We note that even if the new product is offered at a high price we can still find prospects very interested: – Those that have a high average landline bill and own a pager – This group displays greatest probability of response (43.5%) © Copyright Salford Systems 2013
  27. 27. Classic Detailed Tree Display Analyst can select details to be displayed © Copyright Salford Systems 2013
  28. 28. Control Over Details Displayed in Nodes At left an example showing the class bar chart is displayed Separate controls for internal and terminal nodes © Copyright Salford Systems 2013
  29. 29. Configure Print Image Interactively Shrink to one page, include header/footer • © Copyright Salford Systems 2013
  30. 30. Tree Performance Measures and Principle Message • In addition to the details of the tree (splits, split values) • Variable Importance Ranking • Confusion Matrix (Prediction Success Matrix) • Gains, ROC © Copyright Salford Systems 2013
  31. 31. Variable Importance Ranking (Relative impact on outcomes) Three major ways of computing variable importance. Above default display. © Copyright Salford Systems 2013
  32. 32. Predictive Accuracy (How often right, how often wrong) This model is not very accurate but ranks responders well © Copyright Salford Systems 2013
  33. 33. Gains Curve In top decile model captures about 23% of responders © Copyright Salford Systems 2013
  34. 34. Performance Evaluation: ROC Curve © Copyright Salford Systems 2013
  35. 35. Observations on CART Tree Contrasts with Conventional Stats • CART leverages rank order of predictor to split – Transforming predictor X into Log(X) will not change tree – Of course tree will be expressed in terms of Log(X) but this will not change the location of the split – Traditional statisticians experiments with alternative transforms unnecessary • CART is immune to outliers in predictors – Suppose X has values 1,2,3,…,100, 900 – To CART this is the same as 1,2,3,…,100, 101 – All CART ―sees‖ is the rank order • We will see later that CART has built-in missing value handling • So no worry about outliers, missing values, transformations © Copyright Salford Systems 2013
  36. 36. CART Methodology: Partition Data Into Two Segments • Partitioning line parallel to an axis • Root node split first – 2.450 – Isolates all the type 1 species from rest of the sample • This gives us two child nodes – One is a Terminal Node with only type 1 species – The other contains only type 2 and 3 • Note: entire data set divided into two separate parts © Copyright Salford Systems 2013
  37. 37. Second Split: Partitions Only Portion of the Data • Again, partition with line parallel to one of the two axes • CART selects PETALWID to split this NODE – Split it at 1.75 – Gives a tree with a misclassification rate of 4% • Split applies only to a single partition of the data • Each partition is analyzed separately © Copyright Salford Systems 2013
  38. 38. Discriminant Analysis Uses Oblique Lines •Linear Combinations are difficult to understand and explain •CART does permit ―oblique‖ splits based on linear combinations of small sets of variables but this is rarely desirable © Copyright Salford Systems
  39. 39. CART Representation of a Surface Model clearly non-linear Height of bar represents probability of response Remaining axes represent values of two predictors Greatest prob of response here in corner to the right 0
  40. 40. CART Splitting Process • Standard splits are based on ONE predictor and the form of a database RULE • A data record goes left if splitter_variable <= split value • Examples: A data record goes left • if AGE<=35 • if CREDIT_SCORE <= 700 • if TELEPHONE_BILL <= 50 © Copyright Salford Systems 2013
  41. 41. Searching all splits facilitated by sorting • On left we sort by TELEBILC, on right by TRAVTIMR • Test smallest value first, then next smallest, etc moving all the way down the column • The arrow shows a split sending 10 cases to the left and all other data to the right
  42. 42. Example Root Node Split Continuous Splitter © Copyright Salford Systems 2013 From our Euro_telco_mini.xls example Split is TELEBILC <= 50
  43. 43. Alternative Split Points What if split the data at TELEBILC<=25? © Copyright Salford Systems 2013 Note the response rate between the two nodes with this split are very similar They are much different after splitting at the optimal value
  44. 44. Two splits separate quite differently © Copyright Salford Systems 2013 The first pane shows two segments with 14.3% and 15.5% response The second pane shows two segments with 12.7% and 19.8% Our goal in CART is to generate substantially different segments and we accomplish this by experimenting with every possible split value for every predictor
  45. 45. CART Splitting Process: More • Splitter variables need not be numeric, they can be text • Splitter variables need not be ordered • A data record goes left • if CITY$ = ―London‖ OR ―Madrid‖ OR ―Paris‖ • if DIAGNOSIS = 111 OR 35 OR 9999 © Copyright Salford Systems 2013
  46. 46. • CART considers all possible splits based on a categorical predictor • Example: four regions - A, B, C, D can be split 7 ways (23 -1 = 7) • Each decision is a possible split of the node and each is evaluated • Note: A on the left and B,C,D on the right is the same split as its mirror image A on the right and B,C,D on the left • So we only list one version of this split – It is which cases stay together that matters not which side of the tree they are on Splits on K-level categorical predictors 2K-1 -1 ways to split Left Right 1 A B, C, D 2 B A, C, B 3 C A, B, D 4 D A, B, C 5 A, B C, D 6 A, C B, D 7 A, D B, C © Copyright Salford Systems 2013
  47. 47. Categorical Split Caution: Dangers of HLCs (High Level Categoricals) • Because categorical variables generate 2K-1 -1 ways to split the data high values of K can be problematic • K=33 is not an unusually large number of levels yet allows for about 4 billion ways to split the data • When the number of possible splits exceeds the number of records in the data the categorical variable has an advantage over any continuous splitter – A continuous variable with a unique value in every row of the data gives us a choice of split points equal to the number of rows of data • Later we will discuss several ways to deal with HLCs including repackaging the high cardinality categoricals into lower cardinality versions and penalties © Copyright Salford Systems 2013
  48. 48. Example Root Node Split Categorical Splitter © Copyright Salford Systems 2013 From our Euro_telco_mini.xls example Observe that we have to LIST the values that go to each child node
  49. 49. CART Competitor Splits • The CART mechanism for splitting data is always the same • We are given a block of data – Could be all of our data and we are starting from scratch – Could be a small part of our data obtained after already doing a lot of slicing and dicing • When we work with a block of data we do not take into account how we got to that block of data • We do not consider any information which might be available outside of the block of data • The block of data to be analyzed is our entire universe and nothing else exists for us © Copyright Salford Systems 2013
  50. 50. Getting Ready to Split • For a block of data to be split – It must contain a sufficient number of data records (ATOM) – We can tell CART what the minimum must be – Default is just TWO records – In large database analysis we might reasonably set the minimum quite a bit higher – ATOM values such as 10, 20, 50, 100, 200 have cropped up in our practical work • If you are working with a small database such as those encountered in biomedical research (e.g. 200 records total) you will want to allow the ATOM size to be small • If you are working with hundreds of thousands or millions of records there is no harm in trying a minimum size like 200 © Copyright Salford Systems 2013
  51. 51. Still Getting Ready to Split • If we have a classification problem such as modeling response to a marketing offer where there are two outcomes – Responded – Did Not Respond • To be splittable the block of data cannot be ―pure‖, i.e. composed of all responders or all non-responders – True regardless of how large the block of data is – Splitting is designed to separate the responders from the non- responders so we need a mixture to have something to do • The data records cannot all have exactly the same values for the predictors – CART will be looking for a useful difference in a predictor between responders and non-responders © Copyright Salford Systems 2013
  52. 52. Observation on Dummy Variable Predictors • If you split a node using a continuous variable there is always the chance that this same variable is used again in a subsequent split for descendent nodes • Once a node is split with a dummy variable this variable can never be used again in descendant nodes – Because a descendant node will contain either all 0 or all 1 values for this variable. Hence it cannot split. • If a dummy variable is introduced into the tree below the root it might appear in more than one location in the tree – But one use will never be the ancestor of the other use © Copyright Salford Systems 2013
  53. 53. Making The Split • To split the block of data (which we will henceforth refer to as splitting the node) we search each available predictor • For every predictor we make a trial split at every distinct value of the predictor • For each trial split we compute a goodness of split measure normally referred to as the ―improvement‖ • For each predictor we find the split value that yields the best improvement • Once every predictor has been searched to find the best split point we rank the splitters in descending order and then use the best overall splitter to grow the tree © Copyright Salford Systems 2013
  54. 54. Ranked List of Splitters • The ranked list of splitters is also known as the competitor list • CART always computes the entire list as this is the only way to know for sure which split is best • To save space CART normally only displays the top 5 competitors within a node – You can request a larger number in your options settings • The root node at the top of the tree always displays the complete list of competitors even if there are thousands of predictors © Copyright Salford Systems 2013
  55. 55. Why Care about Competitor Splits? • Useful to know if the best splitter is far better than all the rest or only slightly better • Useful to know which predictors show up near the top – Are they very different from each or are they all reflecting the same underlying information • Useful to know if a strong but perhaps 2nd best predictor splits the data more evenly than the best – We might want to try FORCING that 2nd best predictor into the root to see what happens – Sometimes this yields an overall better tree • Pattern of top splitters may reflect problems – Top 3 competitors may all be ―too good to be true‖ and we might need to drop them all from the analysis © Copyright Salford Systems 2013
  56. 56. Surrogate Splits • Surrogate splits were first introduced by the authors of CART in their classic monograph Classification and Regression Trees, 1984. • Surrogate splits are mimics or substitutes for the primary splitter of a node • An ideal surrogate splits the data in exactly the same way as the primary split – The ―association‖ measure reflects how close to perfect a given surrogate is © Copyright Salford Systems 2013
  57. 57. Why Surrogates? • Surrogates have two primary functions: – To split data when the primary splitter is missing – To reveal common patterns among predictors in a data set • CART searches for surrogate splitters in every node in the tree – Surrogates are searched for even when there is no missing data – No guarantee that useful surrogates can be found – CART attempts to find at least five surrogates for every node but this number can be modified – Number of surrogates actually found normally varies from node to node © Copyright Salford Systems 2013
  58. 58. CART and Missing Values in Deployment • CART is the only learning machine that is prepared to deal with any pattern of missing values in future data • Even if the training data have no missings CART develops strategies to deal with the eventuality of any variable or variables being missing • Some learning machines cannot handle missing values at all • Other learning machines can only deal with missing value patterns that they have been trained on (seen before) – Eg Handle X5=missing only if X5 was ever missing in the training data • CART has no such restrictions and is always ready for any pattern of missings © Copyright Salford Systems 2013
  59. 59. Surrogates in Action: Euro_telco_mini.xls © Copyright Salford Systems 2013 Remember to check off CITY, MARITAL and RESPONSE as ―categorical‖
  60. 60. Manually Prune Back to the 10-node tree © Copyright Salford Systems 2013 Just click on the blue curve in the lower panel to select a smaller easier to manage tree. The double click on left child of root node (see arrow above)
  61. 61. Look at the Left Child of the ―Root‖ © Copyright Salford Systems 2013 The primary splitter predicting subscription to a new mobile phone offer is the monthly telephone bill (TELEBILC) dividing node into spenders of more or less than $50 per month
  62. 62. Surrogate for TELEBILC • If this variable were missing for any reason (database error, person recently moved, new customer) we do not know whether to move down the tree to the left or to the right • Surrogate variable can be used in place of the missing primary splitter. In this case the surrogate is of the form go to the left if MARITAL=1 • Left is associated with LOW spending on the telephone bill • CART suggests that single person households spend less while households headed by married or divorced persons spend more © Copyright Salford Systems 2013
  63. 63. Surrogates and Direction • A surrogate is intended to be a substitute for the primary splitter making similar left/right decisions • But surrogates may work in the opposite direction so every continuous variable surrogate is supplied with a ―tag‖ – The letter ―s‖ after the split point stands for ―standard‖ – The letter ―r‖ after the split point stands for ―reverse‖ • If a surrogate is negatively correlated with the primary splitter then it will split in the reverse direction – Categorical splitters are always organized so that the levels that correspond to left in the primary splitter go left in the surrogate © Copyright Salford Systems 2013
  64. 64. Normally Surrogates Make Sense • Our primary splitter is the average monthly spend of a household on a fixed line telephone account • Our surrogates include marital status, commute time to work, age, and the city of residence – Longer commutes are associated with larger spend on the phone – Older head of household also is associated with larger spend – We cannot interpret the CITY variable at this point because we don’t know the identity of the cities • In general surrogates help us understand the primary splitter – Especially helpful in survey research © Copyright Salford Systems 2013
  65. 65. How to Compute Surrogates? • This is a technical question which we will not cover here – The CART monograph contains a wealth of technical information although it can be a challenging read • However, we will discuss the main ideas • The top surrogate is – A single variable – A single split (in the same format as any primary splitter) – Intended to mimic as closely as possible how data is partitioned by the primary segment into LEFT and RIGHT nodes • To get a surrogate think of generating a one split CART tree where the dependent variable is {LEFT or RIGHT} as defined by the primary splitter. (There are many details) © Copyright Salford Systems 2013
  66. 66. What is ―Association‖? • Association is a measure of the strength of the surrogate • The lowest possible reported score is 0 (useless) • The highest possible score is 1 (perfect clone) • CART starts from the default rule: if you don’t kow which way to send a data record down a tree go with the majority (sometimes weighted majority) • If when training the tree most cases went left then in the absence of other information also go left • The default makes mistakes of course because it always sends every record to the same majority side – Association measures how much better the surrogate is than the default rule (percent reduction in errors made) • Default rule is the ―surrogate of last resort‖ © Copyright Salford Systems 2013
  67. 67. Competitors and Surrogates: Different Objectives © Copyright Salford Systems 2013 Competitors yield the best possible split when using that variable Surrogate yields the best possible mimic of the primary splitter and goodness of split may be sacrificed to match some aspect of the primary splitter Note that C2 is a competitor with one split point and a surrogate with a different split point
  68. 68. Grow another tree on GB2000.XLS • We prefer this data set because it has no missing values making working through examples much easier • Don’t forget: CART always computes surrogates and in this way the CART tree is always prepared for future missings • We will not be trying to make sense of this tree – will look just at the mechanics • Note the root node splitter and the top surrogate © Copyright Salford Systems 2013
  69. 69. Root Node Split © Copyright Salford Systems 2013 Root Splitter: M1 <= -.04645 Top Surrogate: C2 <= -.10835
  70. 70. Main Splitter vs. Best Surrogate Main Splitter Surrogate Left Right Left Right Class 1 672 328 626 374 Class 2 252 748 300 700 Total 924 1076 926 1074 © Copyright Salford Systems 2013 Best Surrogate must closely match not only the record counts in the child nodes but also the distribution of the target variable
  71. 71. Modeling ROOTSPLIT with CART © Copyright Salford Systems 2013 Observation: Modeling the root node split (we have to create a new variable to reflect this) will not necessarily match the surrogate report Other factors must be taken into account. Here we get the right variable but not the right split point
  72. 72. Main Splitter VS Best Surrogate Model Root Split As a Binary Target Main Splitter Surrogate Alternate Left Right Left Right Left Right Class 1 672 328 626 374 598 402 Class 2 252 748 300 700 288 712 Total 924 1076 926 1074 886 1114 © Copyright Salford Systems 2013 Best Surrogate must closely match record counts in the child nodes and the distribution of the target variable Modeling root split on available predictors will not match surrogate exactly
  73. 73. Variable Importance in CART • It is hard to imagine now but in 1984 when the CART monograph was first published data analysts did not generally rank variables • Although informally researchers would pay attention to t- statistics or p-values associated with the coefficients of regressions researchers frowned on the practice of ranking predictors • Since the advent of modern data analytic methods researchers expect to see a variable importance ranking for all models • It all started with CART! © Copyright Salford Systems 2013
  74. 74. CART concept of Variable Importance • Variable importance is intended to measure how much work a variable does in a particular tree • Variable importance is thus tied to a specific model • A variable might be most important in one model and not important at all in a different model built on the same data • The fact that a variable is important does not mean that we need it! If we were deprived of the use of an important variable it might be that other available variables could substitute for it or do the same predictive work • Variable Importance describes the role of a variable in a specific tree © Copyright Salford Systems 2013
  75. 75. Variable Importance and Tree Size • Every tree in the CART sequence has its own variable importance list • A small tree will typically have only a few important variables • A large tree will typically have many more important variables – Because with more nodes there are more chances for more variables to play a role in the tree • Usually we focus on the tree CART has identified as optimal but this should not deter you from selecting another (usually smaller) tree © Copyright Salford Systems 2013
  76. 76. Splitter Improvement Scores • Recall that every splitter (and every surrogate) has an associated ―improvement‖ score which measures how good a splitter is • The improvement score for a splitter in a node is always scaled down by the percent of data that actually pass through the node • 100% of all data pass through the root node so the root node splitter is always scaled by 100% • But a child node of the root might have say 30% of the data pass through – whatever improvement we compute for split of that node will be multiplied by 0.30 • Splits lower in the tree have only a small fraction of full data passing through so their adjusted improvement scores tend to be small © Copyright Salford Systems 2013
  77. 77. Variable Importance Computation • To construct a variable importance score for a variable we start by locating every node that variable split • We add up all of the improvement scores generated by that variable in those nodes • Then we go through every node this variable acted as a surrogate and add up all those improvement scores as well • The grand total is the raw importance score • After obtaining raw importance scores for every variable we rescale the results so that the best score is always 100 © Copyright Salford Systems 2013
  78. 78. Variations on Importance Scores • Breiman, Friedman, Olshen and Stone discuss one idea they ultimately rejected: – Including competitor improvements scores as well • This turns out to be a bad idea because it leads to double- counting – If a variable is the 2nd best splitter in a node there is an excellent chance that the same split will score well in the child nodes – If we were to give the splitter credit in the parent node for a being a competitor we would probably end up giving the exact same split credit again lower down in the tree – Another way to think about this: a split is trying to enter the tree. If we do not acept the split right away the same split may keep trying to enter the tree lower down – We only want to give this split credit once © Copyright Salford Systems 2013
  79. 79. BATTERY LOVO • Leave One Variable Out (LOVO) – Available in SPM PRO EX versions but you can accomplish the process manually as well • Take your best modeling set up including your preferred list of predictors • BATTERY LOVO runs a set of models that are identical to your preferred set up except that one variable has been excluded • To be complete we run a ―drop just one variable‖ model for each variable in your KEEP list • If you have 20 variables then BATTERY LOVO will run 20 models (each of which will have 19 predictors) – Now rank the models from worst to best © Copyright Salford Systems 2013
  80. 80. BATTERY LOVO Importance Ranking • Using the LOVO procedure tests how much our model deteriorates if we were to remove a given variable • It is sensible to say that a variable is very important if losing it damages the model substantially • Conversely, if losing a variable does no harm then we could conclude that the variable is useless • CAUTION: the LOVO ranking could be quite different from the CART internal ranking and both rankings are ―right‖ – CART measures how much work a variable actually does – LOVO measures how it hurts to lose a variable © Copyright Salford Systems 2013
  81. 81. Randomization Test • Leo Breiman introduced yet another concept of variable importance measure related to his work on tree ensembles • Start with your test data – Score this data with your preferred model to obtain baseline performance – Take the first predictor in the test data and randomly shuffle its values in the column of data – The values are unchanged but values are relocated to rows they do not belong on – Now score again. We would expect performance to drop because one predictor has been damaged. Repeat say 100 times and average the performance deterioration. – Doing this for al variables will produce performance degradation scores and the larger the score the more important the variable © Copyright Salford Systems 2013
  82. 82. Randomization Test • As of December 2011 this test is only available from the command line of recent versions of SPM • After growing a CART tree and saving the grove issue these commands from the command line or an SPM Notepad SCORE VARIMP=YES NPREPS=100 • You may readily run with NPREPS=30 but the results are more reliable with a larger number of replications © Copyright Salford Systems 2013
  83. 83. Results from Random Shuffling: Baseline ROC=.85320 © Copyright Salford Systems 2013 Rank Score ROC_After Variable 1 100 0.82144 M1 2 63.21 0.83312 RES 3 45.57 0.83873 LS 4 25.9 0.84498 CR 5 22.66 0.84601 C2 6 21.29 0.84644 BU 7 5.84 0.85135 DT 8 4.25 0.85185 A1 9 4.23 0.85186 PRE 10 3.49 0.85209 OC 11 3.18 0.85219 MAR 12 2.29 0.85248 YM 13 1.64 0.85268 LT 14 0 0.8532 DP 15 0 0.8532 TRA 16 0 0.8532 GEN 17 0 0.8532 A2 18 0 0.8532 B 19 0 0.8532 CP2 20 0 0.8532 CD2 21 0 0.8532 D1 22 0 0.8532 E 23 0 0.8532 M 24 0 0.8532 CH 25 0 0.8532 TY$
  84. 84. Which Importance Score Should I Use? • The internal CART variable importance scores are the easiest and the fastest to obtain and are a great starting point • LOVO scores are useful when your goal is to assess whether you can live without a predictor © Copyright Salford Systems 2013
  85. 85. • Importance is a function of the OVERALL tree including deepest nodes • Suppose you grow a large exploratory tree — review importances • Then find an optimal tree via test set or CV yielding smaller tree • Optimal tree SAME as exploratory tree in the top nodes • YET importances might be quite different. • WHY? Because larger tree uses more nodes to compute the importance • When comparing results be sure to compare similar or same sized trees Variable Importance Caution © Copyright Salford Systems 2013
  86. 86. Train/Test Consistency Checks • Unlike classical statistics data mining models generally do not rely on training data to assess model quality • In the SPM data mining suite we are always focused on test data model performance – This is the only way to reliably protect against over fitting • Every modeling method including our classical statistical models in SPM 7.0 offers test data performance measures • Generally these measures are overall model performance indicators – Measures say nothing about internal model details © Copyright Salford Systems 2013
  87. 87. CART Tree Assessment • CART uses test data performance of every tree in the back- pruned sequence of progressively smaller trees to identify the overall best performer on classification accuracy • CART also notes which tree achieves the best test data Area Under the ROC (AUROC) curve on the Navigator © Copyright Salford Systems 2013
  88. 88. What more can we do? • CART performance measures have always been overall-tree scores • No specific attention is paid to node-specific performance • However, in real world applications we often want to pay close attention to individual nodes – Might use the rank order of the nodes in important decisions – Prefer to rely on nodes that are most accurate in their predictions of event rates (response) • Therefore we need an additional tool for assessing CART tree performance at the node level • Provided by the PRO EX feature we call TTC – Train/Test Consistency checks © Copyright Salford Systems 2013
  89. 89. Use the GB2000.XLS data set © Copyright Salford Systems 2013 Model setup to select TARGET as the dependent variable CART as the modeling method On the TEST tab we opt for 50% randomly selected test partition
  90. 90. TTC in CART and SPM PRO EX • The TTC report is available from the navigator which displays for every CART model – Look for the TTC button near the bottom of the navigator • TTC relies on separate train and test data partitions which means that TTC is not available when using cross-validation © Copyright Salford Systems 2013
  91. 91. TTC Display © Copyright Salford Systems 2013 Upper panel of TTC display contains one line in the table for every sized tree Bottom row represents the 2 node tree. Top line is for largest tree grown
  92. 92. TTC: Select Target Class © Copyright Salford Systems 2013 In this case TARGET=2 represents BAD which is our focus class You the modeler get to choose which class to focus on, there is no ―right‖ class
  93. 93. TTC Upper Panel © Copyright Salford Systems 2013 Rank Match: Do the train and test samples rank order the nodes in the same way (a statistical test allows for insignificant ―wobbles‖) Direction Agreement: Do the train and test samples agree as to whether a node is ―above average‖ or ―below average‖ (response, lift, event rate). Again a statistical test allows for insignificant violations
  94. 94. Click on 14 node tree in TTC upper panel © Copyright Salford Systems 2013 Red curve is training data and shows node specific lift (node response/ overall response) Dark Blue horizontal line is the LIFT=1.0 reference line Light blue line with green triangles displays test data 3rd ranked node in train data would be ranked 1st or 2nd in test data
  95. 95. TTC Details © Copyright Salford Systems 2013 For the 14 node tree we are told that agreement on ―direction‖ fails 1 time And the rank order agreement fails 5 times (scroll to right to see this) The statistical sensitivity of the test is controlled by the z-score selected in the Thresholds area to the right of the display. Defaults are 1.00 Setting this threshold to 2.00 will allow much more train/test divergence
  96. 96. Changing TTC Sensitivity Threshold © Copyright Salford Systems 2013 Changing the thresholds to 2.00 permits moderate deviations and treats them as statistical noise. After changing thresholds click on ―Apply‖ if display has not updated We prefer to use the 1.00 threshold as this points us to trees with very high consistency that decision makers like to see. It does point to rather small trees.
  97. 97. TTC: Display for 6 node tree © Copyright Salford Systems 2013 Much more defensible tree as train and test data align very well
  98. 98. Summary • TTC focuses on two types of train-test disagreement • DIRECTION: Is this node a response node or not? – We regard disagreement on this fundamental topic to be fatal • RANK ORDER: Are the richest nodes as identified by the training data confirmed in test data – Without this we cannot defend deployment of a tree • TTC allows us to quickly identify which tree in the pruning sequence is the largest satisfying train/test consistency • TTC optimal tree is often rather close in size to Breiman’s 1 SE rule tree – But 1 SE rule does not look inside nodes at all – 1 SE rule is available for cross-validation while TTC is not © Copyright Salford Systems 2013
  99. 99. Controlling Node Sizes In CART With ATOM and MINCHILD • Today’s topic is on the technical side but very easy to understand • Concepts are relevant to all Salford tree-based tools including TreeNet and Random Forests • Controlling the sizes of terminal nodes is a practical matter • If you are using CART, for example, to segment a database you might want to make it impossible to create segments that are too small • Altering terminal node size can also influence performance details of the optimal tree © Copyright Salford Systems 2013
  100. 100. Background: Obtaining Optimal Trees • CART theory teaches us that we cannot arrive at the optimal tree via a stopping rule • The CART authors devoted quite a bit of energy to researching this topic • For any stopping rule it is possible to construct data sets for which that stopping rule will not work • We will end up stopping too early and we will miss important data structure • Result discovered both by experimentation and via mathematical construction © Copyright Salford Systems 2013
  101. 101. Grow First Then Prune • CART methodology is thus to start with an unlimited growing phase • Grow the largest possible tree first • Think of this as a search engine for discovering possibly valuable trees • THEN use pruning to arrive at the optimal tree or a set of trees that yield both acceptable predictive performance and simplicity • CART also insists that we have a test method to make our final tree selection. That is the topic of another session. © Copyright Salford Systems 2013
  102. 102. Maximum Tree Size • CART theory tells us that trees should be grown to their maximum size during the growing phase • Thus, trees should be grown until we either – Run out of data (1 record left and thus there is nothing to split) – Node impossible to split because pure (all GOOD or all BAD) – Node impossible to split because all records have identical values for predictors • Experience tells us that if you start with 1,000 records in a typical binary classification problem you should expect about 500 terminal nodes in the largest possible tree – But could be many less • Let’s try for biggest possible tree with the GB2000.xls data © Copyright Salford Systems 2013
  103. 103. An Unlimited Tree Using GB2000.xls © Copyright Salford Systems 2013 To get 349 nodes we set the test method to EXPLORE, MINCHILD=2, ATOM=1
  104. 104. Terminal Node Sample Sizes © Copyright Salford Systems 2013 We obtain this frequency chart by clicking the graph icon in the center left area of the navigator. We can see that many but not all terminal nodes are small.
  105. 105. Bottom Left Most Part of Tree © Copyright Salford Systems 2013 We get a relatively large node to the extreme left (all class 2) Remaining three terminal nodes in this snippet are also all ―pure‖ but much smaller Obvious why the tree has to stop here as there is nothing left to do once a node is pure Obtained by right clicking the node of interest and selecting ―Display Tree‖
  106. 106. Practical Maximal Trees • In real world practice it may not be necessary to push the tree growth to the literal maximum • Essential to grow a large tree – Large enough to include the optimal tree • We can control the size of the maximal CART tree in a number of ways – Some controls tell CART to stop early – Other controls limit CART’s freedom to produce small nodes © Copyright Salford Systems 2013
  107. 107. Key Controls over Splits: ATOM and MINCHILD • ATOM – ATOM terminates splitting along a branch of the tree when the node sample size is to small – If a node contain fewer than ATOM data records then STOP – 10 is commonly used but you might set this much larger • MINCHILD – MINCHILD prevents creation of child nodes that are too small – The smallest possible value is 1 meaning that in splitting a node we would be permitted to send 1 solitary record to a child node and all other records to the other child node – Larger values are sensible and desirable. Values such as 5, 10, 20, 30, 50 could work well depending on the data. We have used values as large as 200 © Copyright Salford Systems 2013
  108. 108. Setting ATOM and MINCHILD © Copyright Salford Systems 2013 On Advanced Tab of Model Setup Parent control (ATOM) Terminal node min (MINCHILD)
  109. 109. Setting ATOM and MINCHILD • ATOM: Minimum size required for a node to be a parent • MINCHILD: Minimum size allowed for a child • We recommend that ATOM be set to three times MINCHILD • ATOM must be at least twice MINCHILD to allow a split consistent with MINCHILD • If you set inconsistent values for ATOM and MINCHILD they will be reset automatically to be consistent • To get the control you want be sure that ATOM is at least twice MINCHILD © Copyright Salford Systems 2013
  110. 110. ATOM and MINCHILD • ATOM controls the right to be a parent • Parent must generate two children • Parent must contain enough data to be able to fill two child nodes • So parent must have at least 2*MINCHILD records © Copyright Salford Systems 2013
  111. 111. ATOM and MINCHILD • By allowing ATOM to be three times MINCHILD you give CART some flexibility in finding the split 10 records 10 records • Min-------------------------------|-----------------------------------Max split Suppose ATOM=20 and MINCHILD=10. Then we must split this node into two exactly equal child nodes of 10 records each. There is no flexibility here • If no such split can be found because of clumping of values of the variable then the node cannot be split on that variable © Copyright Salford Systems 2013
  112. 112. ATOM is 3 times MINCHILD 10 records 10 records 10 records • Min------------------*------|--------------------*--------------------Max left child ..….. split region…... right child • In the example above ATOM=30 and the region of possible splitting points lies in between the two asterisks • There can be just one split point. So long as the smaller side has at least 10 records (in this example of MINCHILD=10) there is freedom to choose • To give CART flexibility as to where to locate this last split (at the bottom of the tree) when need to have ATOM> 2*MINCHILD • Not mandatory but worth keeping in mind. So first choose MINCHILD and then set ATOM sensibly © Copyright Salford Systems 2013
  113. 113. An Unappealing Node Split: Could be prevented by using a larger MINCHILD © Copyright Salford Systems 2013 Only one record is sent to the right and the remaining 1999 records go left Can prevent such splits with a control which does not allow a child to be created with fewer than the specified number of records
  114. 114. Experiment to get Best Settings © Copyright Salford Systems 2013 SPM PRO EX Battery Tab Model Setop Select ATOM and MINCHILD Modify values to be tested, optionally We used a 50% random sample for testing
  115. 115. Choosing ATOM and MINCHILD © Copyright Salford Systems 2013 Settings of ATOM=10 and MINCHILD=5 yield a Rel. error within 1% of the literal best
  116. 116. Direct Control Over Tree Size (Almost) • You also have the option of LIMITing tree in a variety of ways including limiting the DEPTH of the tree • To get to the LIMITS menu item you must first go to the Classic Output © Copyright Salford Systems 2013
  117. 117. Growing Limits Dialog © Copyright Salford Systems 2013 DEPTH=1 will allow just one split Controlling tree size via a DEPTH limit may yield Inferior results We tend to use it only When wanting extremely small trees such as one split
  118. 118. LIMITS Details • A tree of depth=1 can have only two terminal nodes • With each additional depth level we allow for a doubling of the number of terminal nodes • Potential sizes are then 2,4,8,16 etc. • However, depth limits do not guarantee a specific number of terminal nodes only that no terminal node will deeper than was allowed © Copyright Salford Systems 2013
  119. 119. LIMIT DEPTH=1 © Copyright Salford Systems 2013 We sometimes want to start a CART analysis by splitting just the ROOT node and then reviewing the entire ranked list of potential splitters Mostly useful for very large data sets as this reduces compute time substantially
  120. 120. LIMIT DEPTH=2 © Copyright Salford Systems 2013 Maximum length of any branch will allow two splits between the root node and any terminal node. But some branches might stop early due to pre-pruning.
  121. 121. Depth Limit=3 Method GINI © Copyright Salford Systems 2013 With METHOD GINI you may not get every branch of the tree exhibited to the full depth you wanted (due a technical matter – ―pre-pruning‖
  122. 122. Depth Limit=3 METHOD PROB © Copyright Salford Systems 2013 You have a better chance of getting every branch grown out to full depth using METHOD PROB
  123. 123. Concluding Remarks • Setting ATOM (smallest legal parent) and MINCHILD (smallest legal child) can help to speed up large database runs • Modest limitation will not harm performance if we take care with the settings • Can and should use experimentation to find best settings • In some circumstances setting these controls to values larger than their minimums can improve performance on test data © Copyright Salford Systems 2013
  124. 124. CART and the PRIORS Parameter • If you are a casual user of CART you probably can get by without knowing anything about PRIORS • The default settings of CART handle PRIORS in a way that is well suited for almost all classification problems • A casual user will probably not want to review or understand the more technical output‖ which is printed to the plain text ―classic output‖ window • BUT there are some very effective uses of CART that require judicious manipulation of the PRIORS settings • Therefore a basic understanding of PRIORS may be helpful and worth the effort © Copyright Salford Systems 2013
  125. 125. Classic Reference • The original CART monograph published in 1984, remains one of the great classics of machine learning • Classification and Regression Trees by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone, CRC Press • Available also in paperback and as e-book from Amazon: • • Not the easiest reading but well worth having as a reference and contains fascinating discussions regarding the decisions the authors made in crafting CART • Contains extensive discussion of priors as well as all major concepts relevant to CART. Still wothwhile reading. © Copyright Salford Systems 2013
  126. 126. CART Monograph Details © Copyright Salford Systems 2013
  127. 127. For The Casual User • Thinking about a binary 0/1 classification problem we have two ways of evaluating a CART generated segment – Assign the segment to the majority class (more than 50%) – If there are more 1s then 0s then the segment is labeled ―1‖ – Assign the segment to the class with a LIFT greater than 1 – We start with a baseline event rate (fraction of 1 in the data) – Look at the ratio of event rate in the node to event rate in sample • Ratio of event rate in segment to event rate in root – Any segment with a better than baseline event rate is labeled ―1‖ • CART by default uses the LIFT concept for making decisions (known in CART-speak as PRIORS EQUAL) • You can elect to use the first method via PRIORS DATA © Copyright Salford Systems 2013
  128. 128. Example Split: Priors Equal © Copyright Salford Systems 2013 Almost 80% GOOD (Class 0) Remainder BAD (Class 1) Left child is considered a BAD dominant node because 36% BAD > 21.4% BAD Priors equal simply ensures that we think in these ―relative to what we started with‖ terms
  129. 129. PRIORS EQUAL or PRIORS DATA • PRIORS EQUAL is almost always the right choice – Is the DEFAULT and almost always yields useful results • PRIORS DATA focuses on absolute majority and not relative counts in the data – Will rarely work with highly unbalanced data (eg 10:1 ratio of 0 to 1) • PRIORS can be expressed as a ratio – Default 1:1 – You can set priors to whatever ratio you like • 1.2:1 as we did in the previous example • 5:1 • 10:1 – Changing priors usually changes results, sometimes dramatically – Extreme priors often make getting any tree impossible © Copyright Salford Systems 2013
  130. 130. Setting PRIORS Mechanics © Copyright Salford Systems 2013 To set your own PRIORS first click the SPECIFY option The default settings of 1:1 can now be changed To the left the dialog is allowing me to alter the entry for Class 0 Once entered I will be given the opportunity to make an new entry for Class 1
  131. 131. If PRIORS can change results then what is right? • The results CART gives you are intended to reflect what you consider important and what makes sense given your objectives • PRIORS EQUAL usually reflects what most people want • If tweaking the PRIORS and changing them gives you better results given your objectives then use the tweaked priors © Copyright Salford Systems 2013
  132. 132. Advice on PRIORS • Start with the default of EQUAL – Most users never get beyond this! • BATTERY PRIORS – CART PRO EX runs an automatic sweep across dozens of different settings to display the consequences of tweaking the priors – Results are then summarized in tables and charts – Useful when you want to achieve a specific balance of accuracy across the dependent variable classes – Choose the setting that is practically best • Otherwise, you can experiment manually to measure the impact of a change © Copyright Salford Systems 2013
  133. 133. PRIORS: Under the Hood • To understand how PRIORS affect core CART calculations we need to start with a brief review of splitting rules • We will only discuss the Gini to illustrate the key concepts © Copyright Salford Systems 2013
  134. 134. Start With Gini Splitting Rule: Two classes • Very simple formula for the two class (binary) dependent variable • Label the classes as Class 0 and Class 1 and in a specific node in a tree we represent the shares of the data for the two classes as p0 and p1 These two must sum to 1 (p0 + p1 = 1) • The measure of diversity (or impurity) in a given subset of data (e.g. a node) is given by Impurity = 1 – p0*p0 – p1*p1 • Impurity will equal 0 if either sample share is equal to 1 (100%) • Impurity will equal 0.50 when both sample shares are equal (50%) 1 – (.5*.5) – (.5*.5) = 1 - .25 - .25 = .50 © Copyright Salford Systems 2013
  135. 135. Splitting Criteria and Impurity • The Gini measure is just a sensible way to represent how diverse the data is in a node (for a classification problem) – Extensive experience shows it works well, a good measure – You do have a choice of 6 different splitting methods in CART • Useful because it can be used for any number of classes – Every class has a share – Square the shares and subtract them all from 1 • We use the Gini measure as a way to rank competing splits • Split A will be considered better it produces child nodes with less diversity (on average) than does split B • We measure the goodness of split by looking at the reduction in impurity relative to the node being split (the parent) © Copyright Salford Systems 2013
  136. 136. Improvement Calculation • Hypothetical Example © Copyright Salford Systems 2013 Parent Node Impurity = 0.50 Left Child Impurity = .30 Right Child Impurity=.20 20% of data 80% of data Left child improves diversity by 0.20 (0.50 – 0.30) Right child improves diversity by 0.30 (0.50 – 0.20) Weighted average impurity is .2*.3 + .8*.2=.22 Improvement from parent is .5 - .22 = .28
  137. 137. Graphing Gini Impurity (2 classes) • Impurity formula here simplifies to 2p(1-p) • Impurity is greatest when p=(1-p)= 0.5 • Impurity is low when p is near either extreme of 0 or 1 as the node is dominated by one class • Declines slowly near p=.5 and accelerates as it approaches 0 or 1 1 0 0.5 1 Graph is of 2*[2*p*(1-p)] to make it easier to read © Copyright Salford Systems 2013
  138. 138. Split Improvement Measurement (No Missing Values for Splitter) © Copyright Salford Systems 2013 Parent Node N Percent Left Child N Percent Right Child N Percent Parent Impurity = 0.50 Left Child Impurity = 0.3967 Fraction of data in left child = 55% Right Child Impurity=0.3457 Fraction of data in right child = 45% Weighted average of child node diversity = .3737 Overall improvement of split = .1262
  139. 139. As expressed in the CART monograph Parent node impurity minus weighted average of the impurities in each child node • pL = probability of case going left (fraction of node going left) • pR = probability of case going right (fraction of node going right) • t = node • s = splitting rule • i = impurity ( , ) ( ) ( ) ( )t s i t p i t p i tL L R R impurityL impurityR Impurity Parent probL probR © Copyright Salford Systems 2013
  140. 140. Unbalanced Data and PRIORS EQUAL • Calculations for all key quantities become weighted when we use the CART default and the original data is unbalanced • Weighting is used to calculate – Fraction of the data belonging to each class – Fraction of the data in the left and right child nodes – Gini impurity in each node – Resulting improvement of the split (reduction in impurity) • We no longer can use simple ratios • Good news is that the mechanism for weighting is very simple and easy to remember – All counts are expressed as count in the node divided by the corresponding count in the root node © Copyright Salford Systems 2013
  141. 141. Calculations for Priors • Our training sample starts with N0 examples of class 0 and N1 examples of class 1 • Now look at any node t in the CART tree – N0(t) examples of class 0 – N1(t) examples of class 1 • Fraction of class 0 will now be calculated as (simplified) • In other words we convert every count to ratio of a count in a node (t) to the corresponding count in the root (sample) • Then the math is the same as usual © Copyright Salford Systems 2013 ( 0N t( )/ 0N ) ( 0N t( )/ 0N )+( 1N t( )/ 1N )
  142. 142. What fraction of the data is in a node • Again we use ratios instead of counts to calculate • For priors equal we just average – Fraction of all the Class 0 in a node – Frcation of all the Class 1 in a node • If the priors are not equal then all ratios are first multiplied by the corresponding prior (which acts as a weight) • When priors are equal the terms all cancel out © Copyright Salford Systems 2013 ( 0P0 N t( )/ 0N ) ( 0P0 N t( )/ 0N )+ ( 1P1N t( )/ 1N )
  143. 143. • pi(t) = Proportion of class i in node t • If priors DATA then • Proportions of class i in node t with data priors • Otherwise proportions are always calculated as weighted shares using priors adjusted pi Priors Incorporated Into Splitting Gini = 1 - pi 2 (i)= N N i N(t) (t)N (t)N (t)N =t i j i )p( Nj Nj(t) (j) Ni Ni(t) (i) tp )( © Copyright Salford Systems 2013
  144. 144. Run a Real World Example 79% Class 0 (Good) 21% Class 1 (Bad) © Copyright Salford Systems 2013 Data set BAD_RARE_X.XLS MODEL BAD = X15 just one predictor
  145. 145. Test method: 20% random sample for test © Copyright Salford Systems 2013 We only want to look at the root node split. But tree is quite predictive!
  146. 146. Root Node Split: Under PRIORS EQUAL © Copyright Salford Systems 2013 Main splitter improvement is reported to be .06264 Observe that the left hand child is considered to be Class 1 because the node Class 1 share of 41% is greater than the root share of 21.4%
  147. 147. Classic Output Typical user rarely consults classic output © Copyright Salford Systems 2013 Start by confirming the total record counts in the parent and child nodes Agrees with previous diagram in GUI
  148. 148. Next Confirm Target Class Breakdown © Copyright Salford Systems 2013 Here we see the same counts for Class 0 and Class 1 as in GUI
  149. 149. Priors Adjusted Computations © Copyright Salford Systems 2013 Note first that the parent node is reported to have 50% class 0 and 50% class 1 This is guaranteed for the root node under priors equal With 2 classes each is treated as if it represented half the data With 3 classes each would be treated as if it represented 1/3 of the data Our calculations of the Gini impurity would be based on these priors adjusted shares of the data (or node) The class breakdowns in the child nodes (left and right) are priors adjusted using the formulas presented earlier
  150. 150. Spreadsheet to Reproduce Results © Copyright Salford Systems 2013 Column C contains the counts for each class in the parent and child nodes Column H at the top records the priors Column G displays the priors adjusted shares (raw shares are in Column D) Column F displays raw and priors adjusted child node probabilities Column J displays the Gini diversity in the parent and child nodes and the improvement generated by the weighted average of the child diversities All we need to input are the class counts and the priors and formulas do the rest
  151. 151. Conclusion • Priors are an advanced control that the casual user need not worry about • The default setting is almost always reasonable and almost always yields valuable results • Tweaking the priors can change the details of the tree and can alter results – Sometimes considerably – Can be worth running some experiments • Further discussion in another tutorial © Copyright Salford Systems 2013
  152. 152. Modeling automation Report Develop model using a variety of strategies Here we display results for each of the 6 major tree growing methods. Entropy yields best performance here. This one of 18 different automation schemes.© Copyright Salford Systems 2013
  153. 153. Summary of Variable Importance Results Across alternative modeling strategies © Copyright Salford Systems 2013
  154. 154. Performance Curves of Alternative Models Error plotted against model complexity Four strategies yield similar results; one yields much worse© Copyright Salford Systems 2013
  155. 155. Alternative Modeling Automation Strategies Analyst Can Run All Strategies if desired © Copyright Salford Systems 2013
  156. 156. Automated Modeling: Vary Penalty on False Positives © Copyright Salford Systems 2013
  157. 157. Accuracy among YES and NO groups As penalty on false positive is varied (automatically) © Copyright Salford Systems 2013
  158. 158. Automatic Shaving: Backwards Elimination of Least Important Feature • © Copyright Salford Systems 2013
  159. 159. Hot Spot Detection: Search many trees for high value segments Lift in node plotted against sample size: Examination of individual nodes from many different trees to find best segments © Copyright Salford Systems 2013
  160. 160. Tabular detail: Hot spot search for special nodes Tree 18 Node 25 defines a segment with 85.3% of the target class Sample size in this segment is N=265 in the test set Clicking on any row brings up tree for examination and review © Copyright Salford Systems 2013
  161. 161. Constrained Trees • Many predictive models can benefit from Salford’s patent pending ―Structured Trees‖ • Trees constrained in how they are grown to reflect decision support requirements • In mobile phone example: want tree to first segment on customer characteristics and then complete using price variables – Price variables are under the control of the company – Customer characteristics are not under company control © Copyright Salford Systems 2013
  162. 162. Visualizing separate regions of tree © Copyright Salford Systems 2013
  163. 163. Constraint Dialog Model set up specifying allowable ranges for predictors Green indicates where in the tree variables of group are allowed to appear © Copyright Salford Systems 2013
  164. 164. Constrained Tree Mobile Phone Price variables appear only at bottom Demographic and spend information at top of tree Handset (HANDPRIC) and per minute pricing (USEPRICE) at bottom © Copyright Salford Systems 2013
  165. 165. Model Deployment -1 Translate Model into Reusable Programming Code New version supports JAVA, C, PMML, SQL, SAS® © Copyright Salford Systems 2013
  166. 166. Automatically Generated Code Can be deployed directly © Copyright Salford Systems 2013
  167. 167. Deployment –II Use Salford Scoring Engine/Server Controllable via scripting can be deployed in batch mode on server © Copyright Salford Systems 2013
  168. 168. Cross-Validation: Part 1 • Built-in automatic method of self testing a model for reliability • Honest assessment of the performance characteristics of a model – Will model perform as expected on previously unseen (new) data • Available for all principal Salford data mining engines • CART monograph 1984 was decisive in introducing cross- validation into data mining • Many important details relevant to decision trees and sequences of models developed in the monograph for the first time © Copyright Salford Systems 2013
  169. 169. Cross-Validation is a Testing Method • Why go through special trouble to construct a sophisticated testing method when we can just hold back some test data? • When working with plentiful data it makes perfect sense to reserve a good portion for testing – E.g. Credit risk data set with 150,000 training records and 100,000 test records, real world example – Direct Marketing data sets with 300,000 training records and 50,000 test records • Not all analytical projects have access to large volumes of data © Copyright Salford Systems 2013
  170. 170. Principal Reason for Cross-Validation Data Scarcity • When relevant data is scarce we face a data allocation dilemma – If we reserve sufficient data to conduct a reliable test we find ourselves lacking training data – If we insist on having enough training data to build a good model we will have little or nothing left for testing • Train Test • o---------------------------------------------------------------|-------------o • A common division of data is 80% train 20% test • With 300 data records in total this would amount to 240 train and 60 test © Copyright Salford Systems 2013
  171. 171. Tough decision: How much data to allocate to test • Train Test • o---------|-------------------------------------------------------------------o • Train Test • o------------------------------|----------------------------------------------o • Train Test • o-------------------------------------------------|---------------------------o • Train Test • o------------------------------------------------------------------------|----o © Copyright Salford Systems 2013
  172. 172. Unbalanced Target Data • In most classification studies the target (dependent variable) data distribution is unbalanced • Usually one large data segment (non-event) and a smaller data segment (event) which is the subject of the analysis – Who purchases on an e-commerce website? – Who clicks on a banner ad? – Who benefits from a given medical treatment? – What conditions lead to a manufacturing flaw? • When the data is substantially unbalanced the sample size problem is magnified dramatically – Think of your sample size as being equal to the smaller class – If you only have 100 clicks that is your data set size – Does not matter much that you have 1 million non-clicks. © Copyright Salford Systems 2013
  173. 173. Cross-Validation Strategy: Sample Re-use • Any one train/test partition of the data that leaves enough data for training will yield weak test results – based on just a fragment of the available data • But what if we were to repeat this process many times – using different test partitions? • Imagine the following: we divide the data into many 90/10 train/test partitions and repeat the modeling and testing • Suppose that in every trial we get at least 75% of the test data events classified correctly • This would increase our confidence dramatically in the reliability of the model performance – Because we have multiple at least slightly different tests © Copyright Salford Systems 2013
  174. 174. Cross-Validation Technical Details • Cross-Validation requires a specialized preparation of the data somewhat different than our example of repeated train/test partitioning • We start by dividing the data into K partitions. In the original CART monongraph Breiman, Friedman, Olshen, and Stone set K=10 • K=10 has become an industry standard due both to Breiman et. al. and other studies that followed (see final slides for details) • The K partitions should all have the same distribution of the target variable (same fraction of events) and if possible be equal in size Takes care to get this right when data cannot be evenly divided into K parts • This is all done automatically for you in SPM software © Copyright Salford Systems 2013
  175. 175. Cross-Validation Train/Test Procedure: K mutually exclusive partitions, 1 Test, K-1 Train 1 102 93 4 5 6 7 8 1 102 93 4 5 6 7 8 1 102 93 4 5 6 7 8 1 102 93 4 5 6 7 8 Test Test Test Test Learn Learn Learn LearnLearn ETC... Learn Above each partition is in the train sample 9 times and in the test sample 1 time
  176. 176. Build K Models • Once the data has been partitioned into the K parts we are ready to build K models – If we have 10 data partitions and we will build 10 models • Each model is constructed by reserving one part for test and the remaining K-1 parts for training – If K=5 then each model will based on an 80/20 split of data – If K=10 then each model will be based on a 90/10 split – There is nothing wrong with considering K=15 or K=20 or more • In this strategy it is important to observe that each of the K blocks of data is used as a test sample exactly once • If we could somehow combine all the test results we would have an aggregated test sample equal in size to that of the training data © Copyright Salford Systems 2013
  177. 177. Euro_Telco_Mini.xls Data Set Class=0 Class=1 CVCycle Learn Test Learn Test CVW 1 634 70 113 13 0.1026161 2 633 71 114 12 0.0960758 3 634 70 113 13 0.1026161 4 633 71 114 12 0.0960758 5 634 70 113 13 0.1026161 6 633 71 114 12 0.0960758 7 634 70 113 13 0.1026161 8 634 70 113 13 0.1026161 9 633 71 114 12 0.0960758 10 634 70 113 13 0.1026161 • Here we see the breakdown of the 830 record data set into the 10 CV folds • Table shows sample counts for majority and minority classes for learn and test partitions for each fold • Observe that CART has succeeded in making each fold almost identical in the learn/test division and in the balance between TARGET=0 and TARGET-1 • Last column is the WEIGHT that CART uses on each fold for certain calculations
  178. 178. Confusion Matrix Prediction Success Matrix • In two-class (e.g. Yes/No) classification test results can be represented via the 2x2 confusion matrix © Copyright Salford Systems 2013 Predicted Y=0 Predicted Y=1 Actual Y=0 20 4 Actual Y=1 1 5 Hypothetical results for the test set of a single Cross-validation fold Note test sample is quite small but there will be a number of these (e.g. 10)
  179. 179. Aligning the CV Trees All automatic and the user never sees this Main CV1 CV2 CV3 CV4 CV5 CV6 CV7 CV8 CV9 CV10 Nodes 2 2 3 2 2 2 2 2 2 2 2 Complexity 0.01523 0.11543 0.04915 0.12949 0.08684 0.1178 0.09157 0.11464 0.11911 0.11201 0.10531 Nodes 4 6 4 4 4 5 4 4 5 4 4 Complexity 0.01487 0.01736 0.02034 0.01598 0.03128 0.01518 0.03642 0.02188 0.01815 0.02083 0.02285 Nodes 5 7 4 4 4 5 4 4 5 4 7 Complexity 0.01189 0.01455 0.02034 0.01598 0.03128 0.01518 0.03642 0.02188 0.01815 0.02083 0.01342 Nodes 9 8 4 8 4 9 4 9 6 8 10 Complexity 0.00893 0.01118 0.02034 0.01042 0.03128 0.01219 0.03642 0.01229 0.0114 0.01259 0.01157 • We would expect that the trees would be aligned by number of nodes and this is approximately what happens • CART aligns the trees by a measure of ―complexity‖ discussed in other sessions • Alignment is required to determine the estimated error rate of the main tree when it has been pruned to a specific size (complexity) • Thus when the main tree is pruned to 4 terminal nodes align each CV trees appropriately. Eight of the CV trees are also pruned to 4 nodes, but one CV tree is pruned to 5 nodes and one to 6 nodes
  180. 180. Summing the Confusion Matrices • Each CV fold generates a test confusion matrix based on a completely separate subset of data • When summed the test partitions are equal to the entire original training data • Summing the confusion matrices yields an aggregate matrix that is based on a sample equal to the original data set • If we started with 300 records the assembled confusion matrix consists of 300 test records • Not a ―trick‖. Each record was genuinely reserved for test one time and was classified correctly or incorrectly in its fold • We have thus arrived at the largest possible test sample we could create: as if 100% of the data was used for test! © Copyright Salford Systems 2013
  181. 181. Test Results Extracted From Cross-Validation • Cross-validation is not a method for building a model • Cross-validation is a method for indirectly testing a model that on its own has no test performance results • In classic cross-validation we throw away the K models built on parts of the data. We keep only test results. • Modern options for using these K different models exist and you can save them in SPM – Could be used in a committee or ensemble of models – One of the CV models might turn out to be more interesting than the main model © Copyright Salford Systems 2013
  182. 182. Does Cross-Validation Really Work? • We have tested CV by extracting a small training data from a much larger database • We used CV to obtain a ―simulated‖ test performance • We then tested our main model against a genuine large test sample extracted from the larger database • Our results were always remarkably in agreement. CV gave essentially the same results as the true test set method • The CART monograph also discusses similar experiments conducted by Breiman Friedman Olshen and Stone (BFOS) • They come to the same conclusion while observing that 5- fold cross-validation tends to understate model performance and that 20-fold may be slightly more accurate than 10-fold © Copyright Salford Systems 2013
  183. 183. How Many Folds? • How many folds do we need to run to obtain reliable results • Think about 2 fold CV – Divide the data into two parts – First train on part 1 and test on part 2 – Then reverse roles of train and test – Assemble results • Problem with 2-fold CV is that we train on only half the available data – This is a severe disadvantage to the learning process unless we have a large amount of data • The spirit of CV is to use as much training as possible © Copyright Salford Systems 2013
  184. 184. How many CV folds? • In the original CART monograph the authors Breiman, Friedman, Olshen and Stone discussed some experiments • Using small numbers such as 5-fold was typically pessimistic – Results suggested the model was not as good as it really was • Using a substantial number of folds such as 20 was generally only slightly more accurate than 10-fold – CART authors suggested 10-fold as a default – Results hold for classification problems • These classification model results re-confirmed in a 1995 paper by Ronny Kohavi – A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In International Joint Conference on Artificial Intelligence (IJCAI 1995) © Copyright Salford Systems 2013
  185. 185. Creating Your Own Folds: Needs to be done with care with smaller samples • Suppose you have 100 records divided as – 92 records Y=0 – 8 records Y=1 • Each fold must have at least one record for each target class • Best we can do then is to have 8 folds • But we cannot divide 92 into 8 equal parts – 7 parts with 11 records Y=0 (response rate=.0833) – 1 part with 15 records Y=0 ( response rate=.0625) • Better to divide as – 4 parts with 11 records Y=0 (response rate=.0833) – 4 parts with 12 records Y=0 (response rate=.0769) – More equal balance across the folds yields more stable results © Copyright Salford Systems 2013
  186. 186. Points to Remember • The ―main‖ model in CV is always built on all the training data – Nothing is held back for testing • If you were to run CV in several different ways – Vary the number of folds – Vary construction of CV folds by varying random number seed • You would always get the exact same main model – Only the estimates of test performance could differ • Are the results sensitive to these parameters? – BATTERY CV re-runs the analysis with different numbers of folds • Larger numbers should converge – BATTERY CVR using the same number of folds but creates the K partitions based on different random number seeds • Is expected to yield reasonably stable results • Unstable results suggest considerable uncertainty regarding your model © Copyright Salford Systems 2013
  187. 187. Cross-Validation: Part II • In part I we reveiwed the main ideas behind cross-validation • We pointed out that CV is a method for testing a model • Especially useful when there is a shortage of data but can be used in any circumstance • A main model is built on all training data with nothing held back for testing • An additional set of K different models are built on different partitions of the data holding back some of the data for test • The test results for the K models are aggregated and then used as an estimate of the test set performance of the ―main‖ model © Copyright Salford Systems 2013
  188. 188. Cross-Validation Train/Test Procedure: K mutually exclusive partitions, 1 Test, K-1 Train 1 102 93 4 5 6 7 8 1 102 93 4 5 6 7 8 1 102 93 4 5 6 7 8 1 102 93 4 5 6 7 8 Test Test Test Test Learn Learn Learn LearnLearn ETC... Learn Above each partition is in the train sample 9 times and in the test sample 1 time
  189. 189. Alignment of Results • In this session we discuss a somewhat technical topic related to the mechanics of aligning test results from K CV models and the main model • Recall that CART grows a large tree and then prunes it back • Back pruning is conducted via ―cost-complexity‖ • Back pruning might prune off more than one terminal node at a time • Back pruning might prune back several nodes along the same branch • CV generates K different models each with its own maximal tree and its own sequence of back-pruned trees © Copyright Salford Systems 2013
  190. 190. CV Mechanics • Main Model Has No Test Data Each CV Model has test Data © Copyright Salford Systems 2013 Main Model CV Model 1 CV Model 2 CV Model 3 CV Model 10 Combine test results from all CV folds and attribute to main model
  191. 191. CART and CV Details • A CART tree model is actually a family of progressively smaller tree models one of which is normally deemed ―optimal‖ • So we don’t just have a main model and K CV models • We have a main tree sequence and K CV tree sequences • For every tree in the main sequence we need to match it up with its corresponding tree in each CV sequence • The most obvious way to do this is by tree size • To estimate the error rate of the 2-node tree in the main tree sequence match it up with the K 2-node trees found via CV • Then proceed to match up every other tree size found © Copyright Salford Systems 2013
  192. 192. CART Tree Alignment • Matching up trees from the different sequences is much more complicated than this • Each CV tree has its own sequence and its own maximal size • These sequences may not all contain the same tree sizes • The main tree might contain a subtree with 8 terminal nodes but not every CV tree will contain an 8 node tree – Back pruning sometimes skips over certain sizes jumping directly say from 9 terminal nodes to 7 • Not all tree sequences will have the same number of nodes in the maximal tree © Copyright Salford Systems 2013
  193. 193. Alignment via Cost Complexity • Cost complexity prunes trees by examining a trade off between error rate (cost) and size of the tree (complexity) • Error rate can be taken to be misclassification rate for this discussion (on the training data) • Suppose our maximal tree has a training data misclassification rate of .00 (not uncommon on training data) but that the tree is very large (e.g. 1000 terminal nodes) • Suppose we penalized terminal nodes at the rate of .0001 • Then the error rate of 0 would be counterbalanced by a penalty of 1000*(.0001)=0.10 • If we could prune off 500 nodes we would reduce the penalty to .05 but of course our misclassification would probably increase • If the increase in misclassification rate were say .04 then the total of misclass rate + penalty would be only .04 + .05 = .09 a benefit! © Copyright Salford Systems 2013
  194. 194. CART Cost Complexity Pruning • CART automatically tests different penalties to try to induce a smaller tree • We always start with a penalty of 0 and then start gradually increasing it • To prune back we prune off the so-called ―weakest-link‖ which is the node that increases the misclassification rate of the whole tree the least • Means that sample size of node is taken into account • A progressive search algorithm for finding the next penalty is described in the CART monograph © Copyright Salford Systems 2013
  195. 195. Cost-Complexity is the key to Alignment • For every CART tree sequence a specific penalty on nodes (e.g. .001) leads immediately to exactly one tree of a specific size • We can only find this tree by going through the pruning sequence (no shortcuts) • We align the CART CV trees by the penalty (complexity) rather than the tree size • So for a given penalty we find the tree that corresponds to it both in the main tree sequence and also in each CV tree • These aligned trees are used to extract the performance measures that will finally be assigned to the main tree of that size © Copyright Salford Systems 2013
  196. 196. Table of Alignments: Special extract report not automatically generated © Copyright Salford Systems 2013 • Table displays the aligned trees corresponding to each tree in the main sequence • In the first row the main tree has been pruned to 2 nodes as have all but one of the CV trees • When the main tree is pruned to 7 nodes it is aligned with trees of varying sizes ranging from 4 to 7 terminal nodes • The complexity penalties appear under the terminal node counts • Complexity penalties always increase as the tree becomes smaller