Advanced cart 2007

667 views

Published on

Published in: Technology, Education
1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total views
667
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
17
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

Advanced cart 2007

  1. 1. Data Mining with Decision Trees: An Introduction to CART ® Dan Steinberg Mikhail Golovnya [email_address] http://www.salford-systems.com
  2. 2. In The Beginning… <ul><li>In 1984 Berkeley and Stanford statisticians announced a remarkable new classification tool which: </li></ul><ul><ul><li>Could separate relevant from irrelevant predictors </li></ul></ul><ul><ul><li>Did not require any kind of variable transforms (logs, square roots) </li></ul></ul><ul><ul><li>Impervious to outliers and missing values </li></ul></ul><ul><ul><li>Could yield relatively simple and easy to comprehend models </li></ul></ul><ul><ul><li>Required little to no supervision by the analyst </li></ul></ul><ul><ul><li>Was frequently more accurate than traditional logistic regression or discriminant analysis, and other parametric tools </li></ul></ul>
  3. 3. The Years of Struggle <ul><li>CART was slow to gain widespread recognition in late 1980s and early 1990s for several reasons </li></ul><ul><ul><li>Monograph introducing CART is challenging to read </li></ul></ul><ul><ul><ul><li>brilliant book overflowing with insights into tree growing methodology but fairly technical and brief </li></ul></ul></ul><ul><ul><li>Method was not expounded in any textbooks </li></ul></ul><ul><ul><li>Originally taught only in advanced graduate statistics classes at a handful of leading Universities </li></ul></ul><ul><ul><li>Original standalone software came with slender documentation, and output was not self-explanatory </li></ul></ul><ul><ul><li>Method was such a radical departure from conventional statistics </li></ul></ul>
  4. 4. The Final Triumph <ul><li>Rising interest in data mining </li></ul><ul><ul><li>Availability of huge data sets requiring analysis </li></ul></ul><ul><ul><li>Need to automate or accelerate and improve analysis process Comparative performance studies </li></ul></ul><ul><li>Advantages of CART over other tree methods </li></ul><ul><ul><li>handling of missing values </li></ul></ul><ul><ul><li>assistance in interpretation of results (surrogates) </li></ul></ul><ul><ul><li>performance advantages: speed, accuracy </li></ul></ul><ul><li>New software and documentation make techniques accessible to end users </li></ul><ul><li>Word of mouth generated by early adopters </li></ul><ul><li>Easy to use professional software available from Salford Systems </li></ul>
  5. 5. So What is CART? <ul><li>Best illustrated with a famous example-the UCSD Heart Disease study </li></ul><ul><li>Given the diagnosis of a heart attack based on </li></ul><ul><ul><li>Chest pain, Indicative EKGs, Elevation of enzymes typically released by damaged heart muscle </li></ul></ul><ul><li>Predict who is at risk of a 2nd heart attack and early death within 30 days </li></ul><ul><ul><li>Prediction will determine treatment program (intensive care or not) </li></ul></ul><ul><li>For each patient about 100 variables were available, including: </li></ul><ul><ul><li>Demographics, medical history, lab results </li></ul></ul><ul><ul><li>19 noninvasive variables were used in the analysis </li></ul></ul>
  6. 6. Typical CART Solution <ul><li>Example of a CLASSIFICATION tree </li></ul><ul><li>Dependent variable is categorical (SURVIVE, DIE) </li></ul><ul><li>Want to predict class membership </li></ul>PATIENTS = 215 SURVIVE 178 82.8% DEAD 37 17.2% Is BP<=91? Terminal Node A SURVIVE 6 30.0% DEAD 14 70.0% NODE = DEAD Terminal Node B SURVIVE 102 98.1% DEAD 2 1.9% NODE = SURVIVE PATIENTS = 195 SURVIVE 172 88.2% DEAD 23 11.8% Is AGE<=62.5? Terminal Node C SURVIVE 14 50.0% DEAD 14 50.0% NODE = DEAD PATIENTS = 91 SURVIVE 70 76.9% DEAD 21 23.1% Is SINUS<=.5? Terminal Node D SURVIVE 56 88.9% DEAD 7 11.1% NODE = SURVIVE <= 91 > 91 <= 62.5 > 62.5 >.5 <=.5
  7. 7. How to Read It <ul><li>Entire tree represents a complete analysis or model </li></ul><ul><li>Has the form of a decision tree </li></ul><ul><li>Root of inverted tree contains all data </li></ul><ul><li>Root gives rise to child nodes </li></ul><ul><li>Child nodes can in turn give rise to their own children </li></ul><ul><li>At some point a given path ends in a terminal node </li></ul><ul><li>Terminal node classifies object </li></ul><ul><li>Path through the tree governed by the answers to QUESTIONS or RULES </li></ul>
  8. 8. Decision Questions <ul><li>Question is of the form: is statement TRUE? </li></ul><ul><li>Is continuous variable X  c ? </li></ul><ul><li>Does categorical variable D take on levels i, j, or k? </li></ul><ul><ul><li>e.g. Is geographic region 1, 2, 4, or 7? </li></ul></ul><ul><li>Standard split: </li></ul><ul><ul><li>If answer to question is YES a case goes left; otherwise it goes right </li></ul></ul><ul><ul><li>This is the form of all primary splits </li></ul></ul><ul><li>Question is formulated so that only two answers possible </li></ul><ul><ul><li>Called binary partitioning </li></ul></ul><ul><ul><li>In CART the YES answer always goes left </li></ul></ul>
  9. 9. The Tree is a Classifier <ul><li>Classification is determined by following a case’s path down the tree </li></ul><ul><li>Terminal nodes are associated with a single class </li></ul><ul><ul><li>Any case arriving at a terminal node is assigned to that class </li></ul></ul><ul><li>The implied classification threshold is constant across all nodes and depends on the currently set priors, costs, and observation weights (discussed later) </li></ul><ul><ul><li>Other algorithms often use only the majority rule or a limited set of fixed rules for node class assignment thus being less flexible </li></ul></ul><ul><li>In standard classification, tree assignment is not probabilistic </li></ul><ul><ul><li>Another type of CART tree, the class probability tree does report distributions of class membership in nodes (discussed later) </li></ul></ul><ul><li>With large data sets we can take the empirical distribution in a terminal node to represent the distribution of classes </li></ul>
  10. 10. The Importance of Binary Splits <ul><li>Some researchers suggested using 3-way, 4-way, etc. splits at each node (univariate) </li></ul><ul><li>There are strong theoretical and practical reasons why we feel strongly against such approaches: </li></ul><ul><ul><li>Exhaustive search for a binary split has linear algorithm complexity, the alternatives have much higher complexity </li></ul></ul><ul><ul><li>Choosing the best binary split is “less ambiguous” than the suggested alternatives </li></ul></ul><ul><ul><li>Most importantly, binary splits result to the smallest rate of data fragmentation thus allowing a larger set of predictors to enter a model </li></ul></ul><ul><ul><li>Any multi-way split can be replaced by a sequence of binary splits </li></ul></ul><ul><li>There are numerous practical illustrations on the validity of the above claims </li></ul>
  11. 11. Accuracy of a Tree <ul><li>For classification trees </li></ul><ul><ul><li>All objects reaching a terminal node are classified in the same way </li></ul></ul><ul><ul><ul><li>e.g. All heart attack victims with BP>91 and AGE less than 62.5 are classified as SURVIVORS </li></ul></ul></ul><ul><ul><li>Classification is same regardless of other medical history and lab results </li></ul></ul><ul><li>The performance of any classifier can be summarized in terms of the Prediction Success Table (a.k.a. Confusion Matrix) </li></ul><ul><li>The key to comparing the performance of different models is how one “wraps” the entire prediction success matrix into a single number </li></ul><ul><ul><li>Overall accuracy </li></ul></ul><ul><ul><li>Average within-class accuracy </li></ul></ul><ul><ul><li>Most generally – the Expected Cost </li></ul></ul>
  12. 12. Prediction Success Table <ul><li>A large number of different terms exists to name various simple ratios based on the table above: </li></ul><ul><ul><li>Sensitivity, specificity, hit rate, recall, overall accuracy, false/true positive, false/true negative, predictive value, etc. </li></ul></ul>Classified As TRUE Class 1 Class 2 &quot;Survivors&quot; &quot;Early Deaths&quot; Total % Correct Class 1 &quot;Survivors&quot; 158 20 178 88% Class 2 &quot;Early Deaths&quot; 9 28 37 75% Total 167 48 215 86.51%
  13. 13. Tree Interpretation and Use <ul><li>Practical decision tool </li></ul><ul><ul><li>Trees like this are used for real world decision making </li></ul></ul><ul><ul><ul><li>Rules for physicians </li></ul></ul></ul><ul><ul><ul><li>Decision tool for a nationwide team of salesmen needing to classify potential customers </li></ul></ul></ul><ul><li>Selection of the most important prognostic variables </li></ul><ul><ul><li>Variable screen for parametric model building </li></ul></ul><ul><ul><li>Example data set had 100 variables </li></ul></ul><ul><ul><li>Use results to find variables to include in a logistic regression </li></ul></ul><ul><li>Insight — in our example AGE is not relevant if BP is not high </li></ul><ul><ul><li>Suggests which interactions will be important </li></ul></ul>
  14. 14. Binary Recursive Partitioning <ul><li>Data is split into two partitions </li></ul><ul><ul><li>thus “binary” partition </li></ul></ul><ul><li>Partitions can also be split into sub-partitions </li></ul><ul><ul><li>hence procedure is recursive </li></ul></ul><ul><li>CART tree is generated by repeated dynamic partitioning of data set </li></ul><ul><ul><li>Automatically determines which variables to use </li></ul></ul><ul><ul><li>Automatically determines which regions to focus on </li></ul></ul><ul><ul><li>The whole process is totally data driven </li></ul></ul>
  15. 15. Three Major Stages <ul><li>Tree growing </li></ul><ul><ul><li>Consider all possible splits of a node </li></ul></ul><ul><ul><li>Find the best split according to the given splitting rule (best improvement) </li></ul></ul><ul><ul><li>Continue sequentially until the largest tree is grown </li></ul></ul><ul><li>Tree Pruning – creating a sequence of nested trees (pruning sequence) by systematic removal of the weakest branches </li></ul><ul><li>Optimal Tree Selection – using test sample or cross-validation to find the best tree in the pruning sequence </li></ul>
  16. 16. General Workflow Stage 1 Stage 2 Stage 3 Historical Data Learn Test Validate Build a Sequence of Nested Trees Monitor Performance Best Confirm Findings
  17. 17. Large Trees and Nearest Neighbor Classifier <ul><li>Fear of large trees is unwarranted! </li></ul><ul><li>To reach a terminal node of a large tree many conditions must be met </li></ul><ul><ul><li>Any new case meeting all these conditions will be very similar to the training case(s) in terminal node </li></ul></ul><ul><ul><li>Hence the new case should be similar on the target variable also </li></ul></ul><ul><li>Key Result: Largest tree possible has a misclassification rate R asymptotically not greater than twice the Bayes rate R * </li></ul><ul><ul><li>Bayes rate is the error rate of the most accurate possible classifier given the data </li></ul></ul><ul><li>Thus useful to check largest tree error rate </li></ul><ul><ul><li>For example, largest tree has relative error rate of .58 so theoretically best possible is .29 </li></ul></ul><ul><li>The exact analysis follows </li></ul>
  18. 18. Derivation <ul><li>Assume binary outcome and let p ( x ) be the conditional probability of focus class given x </li></ul><ul><li>Then the conditional Bayes risk at point x is r *( x ) = min [ p ( x ), 1 - p ( x ) ] </li></ul><ul><li>Now suppose that the actual prediction at X is governed by the class membership of the nearest neighbor x n , then the conditional NN risk is r ( x ) = p ( x ) [1- p ( x n )] + [1- p ( x )] p ( x n )  2 p ( x ) [1- p ( x )] </li></ul><ul><li>Hence r ( x ) = 2 r *( x ) [1- r *( x )] on large samples </li></ul><ul><li>Or in the global sense: R *  R  2 R * (1 – R *) </li></ul>
  19. 19. Illustration <ul><li>The error of the largest tree is asymptotically squeezed between the green and red curves </li></ul>
  20. 20. Impact on the Relative Error <ul><li>RE<x> stands for the asymptotically largest possible relative error of the largest tree assuming <x> initial proportion of the smallest of the target classes (smallest of the priors) </li></ul><ul><li>More unbalanced data might substantially inflate the relative errors of the largest trees </li></ul>
  21. 21. Searching All Possible Splits <ul><li>For any node CART will examine ALL possible splits </li></ul><ul><ul><li>Computationally intensive but there are only a finite number of splits </li></ul></ul><ul><li>Consider first the variable AGE--in our data set has minimum value 40 </li></ul><ul><ul><li>The rule </li></ul></ul><ul><ul><ul><li>Is AGE  40? will separate out these cases to the left — the youngest persons </li></ul></ul></ul><ul><li>Next increase the AGE threshold to the next youngest person </li></ul><ul><ul><ul><li>Is AGE  43 </li></ul></ul></ul><ul><ul><ul><li>This will direct six cases to the left </li></ul></ul></ul><ul><li>Continue increasing the splitting threshold value by value </li></ul>
  22. 22. Split Tables <ul><li>Split tables are used to locate continuous splits </li></ul><ul><ul><li>There will be as many splits as there are unique values minus one </li></ul></ul>Sorted by Age Sorted by Blood Pressure
  23. 23. Categorical Predictors <ul><li>Categorical predictors spawn many more splits </li></ul><ul><ul><li>For example, only four levels A, B, C, D result to 7 splits </li></ul></ul><ul><ul><li>In general, there will be 2 K-1 -1 splits </li></ul></ul><ul><li>CART considers all possible splits based on a categorical predictor unless the number of categories exceeds 15 </li></ul>Left Right 1 A B, C, D 2 B A, C, B 3 C A, B, D 4 D A, B, C 5 A, B C, D 6 A, C B, D 7 A, D B, C
  24. 24. Binary Target Shortcut <ul><li>When the target is binary, special algorithms reduce compute time to linear (K-1 instead of 2 K-1 -1) in levels </li></ul><ul><ul><li>30 levels takes twice as long as 15, not 10,000+ times as long </li></ul></ul><ul><li>When you have high-level categorical variables in a multi-class problem </li></ul><ul><ul><li>Create a binary classification problem </li></ul></ul><ul><ul><li>Try different definitions of DPV (which groups to combine) </li></ul></ul><ul><ul><li>Explore predictor groupings produced by CART </li></ul></ul><ul><ul><li>From a study of all results decide which are most informative </li></ul></ul><ul><ul><li>Create new grouped predicted variable for multi-class problem </li></ul></ul>
  25. 25. GYM Model Example <ul><li>The original data comes from a financial institution and disguised as a health club </li></ul><ul><li>Problem: need to understand a market research clustering scheme </li></ul><ul><li>Clusters were created externally using 18 variables and conventional clustering software </li></ul><ul><li>Need to find simple rules to describe cluster membership </li></ul><ul><li>CART tree provides a nice way to arrive at an intuitive story </li></ul>
  26. 26. Variable Definitions <ul><ul><li>ID Identification # of member </li></ul></ul><ul><ul><li>CLUSTER Cluster assigned from clustering scheme (10 level categorical coded 1-10) </li></ul></ul><ul><ul><li>ANYRAQT Racquet ball usage (binary indicator coded 0, 1) </li></ul></ul><ul><ul><li>ONRCT Number of on-peak racquet ball uses </li></ul></ul><ul><ul><li>ANYPOOL Pool usage (binary indicator coded 0, 1) </li></ul></ul><ul><ul><li>ONPOOL Number of on-peak pool uses </li></ul></ul><ul><ul><li>PLRQTPCT Percent of pool and racquet ball usage </li></ul></ul><ul><ul><li>TPLRCT Total number of pool and racquet ball uses </li></ul></ul><ul><ul><li>ONAER Number of on-peak aerobics classes attended </li></ul></ul><ul><ul><li>OFFAER Number of off-peak aerobics classes attended </li></ul></ul><ul><ul><li>SAERDIF Difference between number of on- and off-peak aerobics visits </li></ul></ul><ul><ul><li>TANNING Number of visits to tanning salon </li></ul></ul><ul><ul><li>PERSTRN Personal trainer (binary indicator coded 0, 1) </li></ul></ul><ul><ul><li>CLASSES Number of classes taken </li></ul></ul><ul><ul><li>NSUPPS Number of supplements/vitamins/frozen dinners purchased </li></ul></ul><ul><ul><li>SMALLBUS Small business discount (binary indicator coded 0, 1) </li></ul></ul><ul><ul><li>OFFER Terms of offer </li></ul></ul><ul><ul><li>IPAKPRIC Index variable for package price </li></ul></ul><ul><ul><li>MONFEE Monthly fee paid </li></ul></ul><ul><ul><li>FIT Fitness score </li></ul></ul><ul><ul><li>NFAMMEN Number of family members </li></ul></ul><ul><ul><li>HOME Home ownership (binary indicator coded 0, 1) </li></ul></ul>predictors not used target
  27. 27. Initial Runs <ul><li>Dataset: gymsmall.syd </li></ul><ul><li>Set the target and predictors </li></ul><ul><li>Set 50% test sample </li></ul><ul><li>Navigate the largest, smallest, and optimal trees </li></ul><ul><li>Check [Splitters] and [Tree Details] </li></ul><ul><li>Observe the logic of pruning </li></ul><ul><li>Check [Summary Reports] </li></ul><ul><li>Understand competitors and surrogates </li></ul><ul><li>Understand variable importance </li></ul>
  28. 28. Pruning Multiple Nodes <ul><li>Note that Terminal Node 5 has class assignment 1 </li></ul><ul><li>All the remaining nodes have class assignment 2 </li></ul><ul><li>Pruning merges Terminal Nodes 5 and 6 thus creating a “chain reaction” of redundant split removal </li></ul>
  29. 29. Pruning Weaker Split First <ul><li>Eliminating nodes 3 and 4 results to “lesser” damage – loosing one correct dominant class assignment </li></ul><ul><li>Eliminating nodes 1 and 2 or 5 and 6 would result to loosing one correct minority class assignment </li></ul>
  30. 30. Understanding Variable Importance <ul><li>Both FIT and CLASSES are reported as equally important </li></ul><ul><li>Yet the tree only includes CLASSES </li></ul><ul><li>ONPOOL is one of the main splitters yet is has very low importance </li></ul><ul><li>Want to understand why </li></ul>
  31. 31. Competitors and Surrogates <ul><li>Node details for the root node split are shown above </li></ul><ul><li>Improvement measures class separation between two sides </li></ul><ul><li>Competitors are next best splits in the given node ordered by improvement values </li></ul><ul><li>Surrogates are splits most resembling (case by case) the main split </li></ul><ul><ul><li>Association measures the degree of resemblance </li></ul></ul><ul><ul><li>Looking for competitors is free, finding surrogates is expensive </li></ul></ul><ul><ul><li>It is possible to have an empty list of surrogates </li></ul></ul><ul><li>FIT is a perfect surrogate (and competitor) for CLASSES </li></ul>
  32. 32. The Utility of Surrogates <ul><li>Suppose that the main splitter is INCOME and comes from a survey form </li></ul><ul><li>People with very low or very high income are likely not to report it – informative missingness in both tails of the income distribution </li></ul><ul><li>The splitter itself wants to separate low income on the left from high income on the right – hence, it is unreasonable to treat missing INCOME as a separate unique value (usual way to handle such case) </li></ul><ul><li>Using a surrogate will help to resolve ambiguity and redistribute missing income records between the two sides </li></ul><ul><li>For example, all low education subjects will join the low income side while all high education subjects will join the high income side </li></ul>
  33. 33. Competitors and Surrogates Are Different <ul><li>For symmetry considerations, both splits must result to identical improvements </li></ul><ul><li>Yet one split can’t be substituted for another </li></ul><ul><li>Hence Split X and Split Y are perfect competitors (same improvement) but very poor surrogates (low association) </li></ul><ul><li>The reverse does not hold – a perfect surrogate is always a perfect competitor </li></ul>A B C A B C Split X A B C A C B Split Y <ul><li>Three nominal classes </li></ul><ul><li>Equally present </li></ul><ul><li>Equal priors, unit costs, etc. </li></ul>
  34. 34. Calculating Association <ul><li>Ten observations in a node </li></ul><ul><li>Main splitter X </li></ul><ul><li>Split Y mismatches on 2 observations </li></ul><ul><li>The default rule sends all cases to the dominant side mismatching 3 observations </li></ul>          Split X Split Y           Default           <ul><li>Association = (Default – Split Y)/Default = (3-2)/3 = 1/3 </li></ul>
  35. 35. Notes on Association <ul><li>Measure local to a node, non-parametric in nature </li></ul><ul><li>Association is negative for nodes with very poor resemblance, not limited by -1 from below </li></ul><ul><li>Any positive association is useful (better strategy than the default rule) </li></ul><ul><li>Hence define surrogates as splits with the highest positive association </li></ul><ul><li>In calculating case mismatch, check the possibility for split reversal (if the rule is met, send cases to the right) </li></ul><ul><ul><li>CART marks straight splits as “s” and reverse splits as “r” </li></ul></ul>
  36. 36. Calculating Variable Importance <ul><li>List all main splitters as well as all surrogates along with the resulting improvements </li></ul><ul><li>Aggregate the above list by variable accumulating improvements </li></ul><ul><li>Sort the result descending </li></ul><ul><li>Divide each entry by the largest value and scale to 100% </li></ul><ul><li>The resulting importance list is based on the given tree exclusively, if a different tree a grown the importance list will most likely change </li></ul><ul><li>Variable importance reveals a deeper structure in stark contrast to a casual look at the main splitters </li></ul>
  37. 37. Mystery Solved <ul><li>In our example improvements in the root node dominate the whole tree </li></ul><ul><li>Consequently, the variable importance reflects the surrogates table in the root node </li></ul>
  38. 38. Penalizing Individual Variables <ul><li>We use variable penalty control to penalize CLASSES a little bit </li></ul><ul><li>This is enough to ensure FIT the main splitter position </li></ul><ul><li>The improvement associated with CLASSES is now downgraded by 1 minus the penalty factor </li></ul>
  39. 39. Penalizing Missing and Categorical Predictors <ul><li>Since splits on variables with missing values are evaluated on a reduced set of observations, such splits may have an unfair advantage </li></ul><ul><li>Introducing systematic penalty set to the proportion of the variable missing effectively removes the possible bias </li></ul><ul><li>Same holds for categorical variables with large number of distinct levels </li></ul><ul><li>We recommend that the user always set both controls to 1 </li></ul>
  40. 40. Splitting Rules - Introduction <ul><li>Splitting rule controls the tree growing stage </li></ul><ul><li>At each node, a finite pool of all possible splits exists </li></ul><ul><ul><li>The pool is independent of the target variable, it is based solely on the observations (and corresponding values of predictors) in the node </li></ul></ul><ul><li>The task of the splitting rule is to rank all splits in the pool according to their improvements </li></ul><ul><ul><li>The winner will become the main splitter </li></ul></ul><ul><ul><li>The top runner-ups will become competitors </li></ul></ul><ul><li>It is possible to construct different splitting rules emphasizing various approaches to the definition of the “best split” </li></ul>
  41. 41. Priors Adjusted Probabilities <ul><li>Assume K target classes </li></ul><ul><li>Introduce prior probabilities  i , i =1,…, K </li></ul><ul><li>Assume the number of LEARN observations within each class: N i , i =1,…, K </li></ul><ul><li>Consider node t containing n i , i =1,…, K LEARN records of each target class </li></ul><ul><li>Then the priors adjusted probability that a case lands in node t (node probability) is: p t =  ( n i / N i )  i </li></ul><ul><li>The conditional probability of class i given that a case reached node t is: p i ( t ) = ( n i / N i )  i / p t </li></ul><ul><li>The conditional node probability of a class is called within node probability and is used as an input to a splitting rule </li></ul>
  42. 42. GINI Splitting Rule <ul><li>Introduce node t impurity as I t = 1 -  p i 2 ( t ) =   i  j p i ( t ) p j ( t ) </li></ul><ul><ul><li>Observe that a pure node has zero impurity </li></ul></ul><ul><ul><li>The impurity function is convex </li></ul></ul><ul><ul><li>The impurity function is maximized when all within node probabilities are equal </li></ul></ul><ul><li>Consider any split and let p left be the probability that a case goes left given that we are in the parent node, p right – probability that a case goes right </li></ul><ul><ul><li>p left = p t left / p t parent and p right = p t right / p t parent </li></ul></ul><ul><li>Then the GINI split improvement is defined as: </li></ul><ul><ul><li> = I parent – p left I left – p right I right </li></ul></ul>
  43. 43. Illustration
  44. 44. GINI Properties <ul><li>The impurity reduction is weighted according to the node size upon which it operates </li></ul><ul><li>GINI splitting rule sequentially minimizes overall tree impurity (defined as the sum over all terminal nodes of node impurities times the node probabilities) </li></ul><ul><li>The splitting rule favors end-cut splits (when one class is separated from the rest) – which follows from its mathematical expression </li></ul>
  45. 45. Example <ul><li>GYM cluster example, using 10-level target </li></ul><ul><li>GINI splitting rule separates classes 3 and 8 from the rest in the root node </li></ul>
  46. 46. ENTROPY Splitting Rule <ul><li>Similar in spirit to GINI </li></ul><ul><li>The only difference is in the definition of the impurity function </li></ul><ul><ul><li>I t = -  p i ( t ) log p i ( t ) </li></ul></ul><ul><ul><li>The function is convex, zero for pure nodes, and maximized for equal mix of within node probabilities </li></ul></ul><ul><li>Because of the different shape, a tree may evolve differently </li></ul>
  47. 47. Example <ul><li>GYM cluster example, using 10-level target </li></ul><ul><li>ENTROPY splitting rule separates six classes on the left from four classes on the right in the root node </li></ul>
  48. 48. TWOING Splitting Rule <ul><li>Group K target classes into two Super-classes in the parent node according to the rule: C 1 = { i : p i ( t left )  p i ( t right ) } C 2 = { i : p i ( t left ) < p i ( t right ) } </li></ul><ul><li>Compute the GINI improvement assuming that we had a binary target defined as the super-classes above </li></ul><ul><li>This usually results to more balanced splits </li></ul><ul><li>Target classes of “similar” nature will tend to group together </li></ul><ul><ul><li>All minivans are separated from the pickup trucks ignoring the fine differences within each group </li></ul></ul>
  49. 49. Example <ul><li>GYM cluster example, using 10-level target </li></ul><ul><li>TWOING splitting rule separates five classes on the left from five classes on the right in the root node thus producing a very balanced split </li></ul>
  50. 50. Best Possible Split <ul><li>Suppose that we have a K-class multinomial problem, p i – within-node probabilities for the parent node, p left - probability of the left direction </li></ul><ul><li>Further assume that we have the largest possible pool of binary splits available (say, using categorical splitter with all unique values) </li></ul><ul><li>Question: what is the resulting best split under each of the splitting rules? </li></ul><ul><li>GINI: separate major class j from the rest, where p j = max i ( p i ) – end cut approach ! </li></ul><ul><li>TWOING and ENTROPY: group classes together such that | p left - 0.5| is minimized – balanced approach ! </li></ul>
  51. 51. Example <ul><li>GINI trees usually grow narrow and deep while TWOING trees usually grow shallow and wide </li></ul>A 40 B 30 C 20 D 10 A 40 B 30 C 20 D 10 GINI Best Split A 40 B 30 C 20 D 10 A 40 D 10 B 30 C 20 TWOING Best Split
  52. 52. Symmetric GINI <ul><li>One possible way to introduce cost matrix actively into the splitting rule I t =   i  j C i  j p i ( t ) p j ( t ) where C i  j stands for the cost of class i being misclassified as class j </li></ul><ul><li>Note that the costs are automatically symmetrized thus rendering the rule less useful for nominal targets with asymmetric costs </li></ul><ul><li>The rule is most useful when handling ordinal outcome with costs proportional to the rank distance between the levels </li></ul><ul><li>Another “esoteric” use would be to cancel asymmetric costs during the tree growing stage for some reason while allowing asymmetric costs during pruning and model selection </li></ul>
  53. 53. Ordered TWOING <ul><li>Assume ordered categorical target </li></ul><ul><li>Original twoing ignores the order or classes during the creation of super-classes (strictly nominal approach) </li></ul><ul><li>Ordered twoing restricts super-class creation only to ordered sets of levels, the rest remains identical to twoing </li></ul><ul><li>The rule is most useful when the ordered separation of target classes is desirable </li></ul><ul><ul><li>But recall that the splitting rule has no control over the pool of splits available, it is only used to evaluate and suggest the best possible split – hence, the “magic” of the ordered twoing splitting rule is somewhat limited </li></ul></ul><ul><ul><li>Using symmetric GINI with properly introduced costs usually provides more flexibility in this case </li></ul></ul>
  54. 54. Class Probability Rule <ul><li>During the pruning stage, any pair of adjacent terminal nodes with identical class assignment is automatically eliminated </li></ul><ul><li>This is because by default CART builds a classifier and keeping such a pair of nodes would contribute nothing to the model classification accuracy </li></ul><ul><li>In reality, we might be interested in seeing such splits </li></ul><ul><ul><li>For example, having “very likely” responder node next to “likely” responder node could be informative </li></ul></ul><ul><li>Class probability splitting rule does exactly that </li></ul><ul><ul><li>However, it comes at a price: due to convoluted theoretical reasons, optimal class probability trees become over-pruned </li></ul></ul><ul><ul><li>Therefore, only experienced CART users should try this </li></ul></ul>
  55. 55. Example
  56. 56. Favor Even Splits Control <ul><li>It is possible to further refine splitting rule behavior by introducing “Favor Even Splits” parameter (a.k.a. Power) </li></ul><ul><li>This control is most sensitive with the twoing splitting rule and was actually “inspired” by it </li></ul><ul><li>We recommend that the user always considers at least the following minimal set of modeling runs </li></ul><ul><ul><li>GINI with Power=0 </li></ul></ul><ul><ul><li>ENTROPY with Power=0 </li></ul></ul><ul><ul><li>TWOING with Power=1 </li></ul></ul><ul><li>BATTERY RULES can be used to accomplish this automatically </li></ul>
  57. 57. Fundamental Tree Building Mechanisms <ul><li>Splitting rules have somewhat limited impact on the overall tree structure, especially on binary targets </li></ul><ul><li>We are about to introduce a powerful set of PRIORS and COSTS controls that will make profound implications for all stages of tree development and evaluation </li></ul><ul><li>These controls lie at the core of successful model building techniques </li></ul>
  58. 58. Key Players <ul><li>Using historical data, build a model for the appropriately chosen sets of priors and costs </li></ul>DM Engine Analyst Model Dataset Priors  i Costs C i  j Population
  59. 59. Fraud Example <ul><li>There are 1909 observations with 42 outcomes of interest </li></ul><ul><li>Having such extreme class distribution poses a lot of modeling challenges </li></ul><ul><li>The effect of key control settings is usually magnified on such data </li></ul><ul><li>We will use this dataset to illustrate the workings of priors and costs below </li></ul>
  60. 60. Fraud Example – PRIORS EQUAL (0.5, 0.5) <ul><li>High accuracy at 93% </li></ul><ul><li>Richest node has 37% </li></ul><ul><li>Small relative error 0.145 </li></ul><ul><li>Low class assignment threshold at 2.2% </li></ul>
  61. 61. Fraud Example – PRIORS SPECIFY (0.9, 0.1) <ul><li>Reasonable accuracy at 64% </li></ul><ul><li>Richest node has 57% </li></ul><ul><li>Larger relative error 0.516 </li></ul><ul><li>Class assignment threshold around 15% </li></ul>
  62. 62. Fraud Example – PRIORS DATA (0.98, 0.02) <ul><li>Poor accuracy at 26% </li></ul><ul><li>Richest node has 91% </li></ul><ul><li>High relative error 0.857 </li></ul><ul><li>Class assignment threshold at 50% </li></ul>
  63. 63. Internal Class Assignment Rule <ul><li>Recall the definition of the within node probability p i ( t ) = ( n i / N i )  i / p t </li></ul><ul><li>Internal class assignment rule is always the same: assign node to the class with the highest p i ( t ) </li></ul><ul><li>For PRIORS DATA this simplifies to the simple node majority rule: p i ( t )  n i </li></ul><ul><li>For PRIORS EQUAL this simplifies to class assignment with the highest lift with respect to the root node presence: p i ( t )  n i / N i </li></ul>
  64. 64. Putting It All Together <ul><li>As prior on the given class increases  </li></ul><ul><li>The class assignment threshold decreases  </li></ul><ul><li>Node richness goes down  </li></ul><ul><li>But class accuracy goes up  </li></ul><ul><li>PRIORS EQUAL use the root node class ratio as the class assignment threshold – hence, most favorable conditions to build a tree </li></ul><ul><li>PRIORS DATA use the majority rule as the class assignment threshold – hence, difficult modeling conditions on unbalanced classes </li></ul>
  65. 65. The Impact of Priors <ul><li>We have a mixture of two overlapping classes </li></ul><ul><li>The vertical lines show root node splits for different sets of priors (the left child is classified as red, the right child is classified as blue) </li></ul><ul><li>Varying priors provides effective control over the tradeoff between class purity and class accuracy </li></ul> r =0.1 ,  b =0.9  r =0.9 ,  b =0.1  r =0.5 ,  b =0.5
  66. 66. Model Evaluation <ul><li>Knowing model Prediction Success (PS) table on a given dataset, calculate the Expected Cost for the given cost matrix assuming the prior distribution of classes in the population </li></ul>Evaluator Analyst Expected Cost Dataset Priors  i Costs C i  j Model PS Table Cells n i  j Population
  67. 67. Estimating Expected Cost <ul><li>The following estimator is used R =   i  j C i  j ( n i  j / N i )  i </li></ul><ul><li>For PRIORS DATA, unit costs:  i = N i / N R = [   i  j n i  j ] / N overall misclassification rate </li></ul><ul><li>For PRIORS EQUAL, unit costs:  i = 1 / K R = {  (  i  j n i  j ) / N i } / K average within class misclassification rate </li></ul><ul><li>The relative cost is computed by dividing by the expected cost of the trivial classifier (all classes are assigned to the dominant cost-adjusted class) </li></ul>
  68. 68. PRIORS DATA – Relative Cost <ul><li>Expected cost R = (31+5)/1909 = 36/1909 = 1.9% </li></ul><ul><li>Trivial model R 0 = 42/1909 = 2.2% </li></ul><ul><ul><li>This is already a very small error value </li></ul></ul><ul><li>Relative cost Rel = R / R 0 = 36/42 = 0.857 </li></ul><ul><ul><li>Relative cost will be less than 1.0 only if the total number of cases misclassified is below 42 – a very difficult requirement </li></ul></ul>
  69. 69. PRIORS EQUAL – Relative Cost <ul><li>Expected cost R = (138/1867+3/42) / 2 = 7.3% </li></ul><ul><li>Trivial model R 0 = (0/1867+42/42) / 2 = 50% </li></ul><ul><ul><li>A very large error value independent of the initial class distribution </li></ul></ul><ul><li>Relative cost Rel = R / R 0 = 138/1867 + 3/42 = 0.145 </li></ul><ul><ul><li>Relative cost will be less than 1.0 for a wide variety of modeling outcomes </li></ul></ul><ul><ul><li>The minimum usually tends to equalize individual class accuracies </li></ul></ul>
  70. 70. Using Equivalent Costs <ul><li>Setting the ratio of costs mimic the ratio of DATA priors reproduces the PRIORS EQUAL run as seen both from the tree structure and the relative error value </li></ul><ul><li>On binary problems both mechanisms produce identical range of models </li></ul>
  71. 71. Automated CART Runs <ul><li>Starting with CART 6.0 we have added a new feature called batteries </li></ul><ul><li>A battery is a collection of runs obtained via a predefined procedure </li></ul><ul><li>A trivial set of batteries is varying one of the key CART parameters: </li></ul><ul><ul><li>RULES – run every major splitting rule </li></ul></ul><ul><ul><li>ATOM – vary minimum parent node size </li></ul></ul><ul><ul><li>MINCHILD – vary minimum node size </li></ul></ul><ul><ul><li>DEPTH – vary tree depth </li></ul></ul><ul><ul><li>NODES – vary maximum number of nodes allowed </li></ul></ul><ul><ul><li>FLIP – flipping the learn and test samples (two runs only) </li></ul></ul>
  72. 72. More Batteries <ul><li>DRAW – each run uses a randomly drawn learn sample from the original data </li></ul><ul><li>CV – varying number of folds used in cross-validation </li></ul><ul><li>CVR – repeated cross-validation each time with a different random number seed </li></ul><ul><li>KEEP – select a given number of predictors at random from the initial larger set </li></ul><ul><ul><li>Supports a core set of predictors that must always be present </li></ul></ul><ul><li>ONEOFF – one predictor model for each predictor in the given list </li></ul><ul><li>The remaining batteries are more advanced and will be discussed individually </li></ul>
  73. 73. Battery MCT <ul><li>The primary purpose is to test possible model overfitting by running Monte Carlo simulation </li></ul><ul><ul><li>Randomly permute the target variable </li></ul></ul><ul><ul><li>Build a CART tree </li></ul></ul><ul><ul><li>Repeat a number of times </li></ul></ul><ul><ul><li>The resulting family of profiles measures intrinsic overfitting due to the learning process </li></ul></ul><ul><ul><li>Compare the actual run profile to the family of random profiles to check the significance of the results </li></ul></ul><ul><li>This battery is especially useful when the relative error of the original run exceeds 90% </li></ul>
  74. 74. Battery MVI <ul><li>Designed to check the impact of the missing values indicators on the model performance </li></ul><ul><ul><li>Build a standard model, missing values are handled via the mechanism of surrogates </li></ul></ul><ul><ul><li>Build a model using MVIs only – measures the amount of predictive information contained in the missing value indicators alone </li></ul></ul><ul><ul><li>Combine the two </li></ul></ul><ul><ul><li>Same as above with missing penalties turned on </li></ul></ul>
  75. 75. Battery PRIOR <ul><li>Vary the priors for the specified class within a given range </li></ul><ul><ul><li>For example, vary the prior on class 1 from 0.2 to 0.8 in increments of 0.02 </li></ul></ul><ul><li>Will result to a family of models with drastically different performance in terms of class richness and accuracy (see the earlier discussion) </li></ul><ul><li>Can be used for hot spot detection </li></ul><ul><li>Can also be used to construct a very efficient voting ensemble of trees </li></ul>
  76. 76. Battery SHAVING <ul><li>Automates variable selection process </li></ul><ul><li>Shaving from the top – each step the most important variable is eliminated </li></ul><ul><ul><li>Capable to detect and eliminate “model hijackers” – variables that appear to be important on the learn sample but in general may hurt ultimate model performance (for example, ID variable) </li></ul></ul><ul><li>Shaving from the bottom – each step the least important variable is eliminated </li></ul><ul><ul><li>May drastically reduce the number of variables used by the model </li></ul></ul><ul><li>SHAVING ERROR – at each iteration all current variables are tried for elimination one at a time to determine which predictor contributes the least </li></ul><ul><ul><li>May result to a very long sequence of runs (quadratic complexity) </li></ul></ul>
  77. 77. Battery SAMPLE <ul><li>Build a series of models in which the learn sample is systematically reduced </li></ul><ul><li>Used to determine the impact of the learn sample size on the model performance </li></ul><ul><li>Can be combined with BATTERY DRAW to get additional spread </li></ul>
  78. 78. Battery TARGET <ul><li>Attempt to build a model for each variable in the given list as a target using the remaining variables as predictors </li></ul><ul><li>Very effective in capturing inter-variable dependency patterns </li></ul><ul><li>Can be used as an input to variable clustering and multi-dimensional scaling approaches </li></ul><ul><li>When run on MVIs only, can reveal missing value patterns in the data </li></ul>
  79. 79. Recommended Reading <ul><li>Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: Wadsworth </li></ul><ul><li>Breiman, L (1996), Bagging Predictors, Machine Learning, 24, 123-140 </li></ul><ul><li>Breiman, L (1996), Stacking Regressions, Machine Learning, 24, 49-64. </li></ul><ul><li>Friedman, J. H., Hastie, T. and Tibshirani, R. &quot;Additive Logistic Regression: a Statistical View of Boosting.&quot; (Aug. 1998) </li></ul>

×