Your SlideShare is downloading. ×
Advanced cart 2007
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Advanced cart 2007


Published on

Published in: Technology, Education
1 Comment
  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Data Mining with Decision Trees: An Introduction to CART ® Dan Steinberg Mikhail Golovnya [email_address]
  • 2. In The Beginning…
    • In 1984 Berkeley and Stanford statisticians announced a remarkable new classification tool which:
      • Could separate relevant from irrelevant predictors
      • Did not require any kind of variable transforms (logs, square roots)
      • Impervious to outliers and missing values
      • Could yield relatively simple and easy to comprehend models
      • Required little to no supervision by the analyst
      • Was frequently more accurate than traditional logistic regression or discriminant analysis, and other parametric tools
  • 3. The Years of Struggle
    • CART was slow to gain widespread recognition in late 1980s and early 1990s for several reasons
      • Monograph introducing CART is challenging to read
        • brilliant book overflowing with insights into tree growing methodology but fairly technical and brief
      • Method was not expounded in any textbooks
      • Originally taught only in advanced graduate statistics classes at a handful of leading Universities
      • Original standalone software came with slender documentation, and output was not self-explanatory
      • Method was such a radical departure from conventional statistics
  • 4. The Final Triumph
    • Rising interest in data mining
      • Availability of huge data sets requiring analysis
      • Need to automate or accelerate and improve analysis process Comparative performance studies
    • Advantages of CART over other tree methods
      • handling of missing values
      • assistance in interpretation of results (surrogates)
      • performance advantages: speed, accuracy
    • New software and documentation make techniques accessible to end users
    • Word of mouth generated by early adopters
    • Easy to use professional software available from Salford Systems
  • 5. So What is CART?
    • Best illustrated with a famous example-the UCSD Heart Disease study
    • Given the diagnosis of a heart attack based on
      • Chest pain, Indicative EKGs, Elevation of enzymes typically released by damaged heart muscle
    • Predict who is at risk of a 2nd heart attack and early death within 30 days
      • Prediction will determine treatment program (intensive care or not)
    • For each patient about 100 variables were available, including:
      • Demographics, medical history, lab results
      • 19 noninvasive variables were used in the analysis
  • 6. Typical CART Solution
    • Example of a CLASSIFICATION tree
    • Dependent variable is categorical (SURVIVE, DIE)
    • Want to predict class membership
    PATIENTS = 215 SURVIVE 178 82.8% DEAD 37 17.2% Is BP<=91? Terminal Node A SURVIVE 6 30.0% DEAD 14 70.0% NODE = DEAD Terminal Node B SURVIVE 102 98.1% DEAD 2 1.9% NODE = SURVIVE PATIENTS = 195 SURVIVE 172 88.2% DEAD 23 11.8% Is AGE<=62.5? Terminal Node C SURVIVE 14 50.0% DEAD 14 50.0% NODE = DEAD PATIENTS = 91 SURVIVE 70 76.9% DEAD 21 23.1% Is SINUS<=.5? Terminal Node D SURVIVE 56 88.9% DEAD 7 11.1% NODE = SURVIVE <= 91 > 91 <= 62.5 > 62.5 >.5 <=.5
  • 7. How to Read It
    • Entire tree represents a complete analysis or model
    • Has the form of a decision tree
    • Root of inverted tree contains all data
    • Root gives rise to child nodes
    • Child nodes can in turn give rise to their own children
    • At some point a given path ends in a terminal node
    • Terminal node classifies object
    • Path through the tree governed by the answers to QUESTIONS or RULES
  • 8. Decision Questions
    • Question is of the form: is statement TRUE?
    • Is continuous variable X  c ?
    • Does categorical variable D take on levels i, j, or k?
      • e.g. Is geographic region 1, 2, 4, or 7?
    • Standard split:
      • If answer to question is YES a case goes left; otherwise it goes right
      • This is the form of all primary splits
    • Question is formulated so that only two answers possible
      • Called binary partitioning
      • In CART the YES answer always goes left
  • 9. The Tree is a Classifier
    • Classification is determined by following a case’s path down the tree
    • Terminal nodes are associated with a single class
      • Any case arriving at a terminal node is assigned to that class
    • The implied classification threshold is constant across all nodes and depends on the currently set priors, costs, and observation weights (discussed later)
      • Other algorithms often use only the majority rule or a limited set of fixed rules for node class assignment thus being less flexible
    • In standard classification, tree assignment is not probabilistic
      • Another type of CART tree, the class probability tree does report distributions of class membership in nodes (discussed later)
    • With large data sets we can take the empirical distribution in a terminal node to represent the distribution of classes
  • 10. The Importance of Binary Splits
    • Some researchers suggested using 3-way, 4-way, etc. splits at each node (univariate)
    • There are strong theoretical and practical reasons why we feel strongly against such approaches:
      • Exhaustive search for a binary split has linear algorithm complexity, the alternatives have much higher complexity
      • Choosing the best binary split is “less ambiguous” than the suggested alternatives
      • Most importantly, binary splits result to the smallest rate of data fragmentation thus allowing a larger set of predictors to enter a model
      • Any multi-way split can be replaced by a sequence of binary splits
    • There are numerous practical illustrations on the validity of the above claims
  • 11. Accuracy of a Tree
    • For classification trees
      • All objects reaching a terminal node are classified in the same way
        • e.g. All heart attack victims with BP>91 and AGE less than 62.5 are classified as SURVIVORS
      • Classification is same regardless of other medical history and lab results
    • The performance of any classifier can be summarized in terms of the Prediction Success Table (a.k.a. Confusion Matrix)
    • The key to comparing the performance of different models is how one “wraps” the entire prediction success matrix into a single number
      • Overall accuracy
      • Average within-class accuracy
      • Most generally – the Expected Cost
  • 12. Prediction Success Table
    • A large number of different terms exists to name various simple ratios based on the table above:
      • Sensitivity, specificity, hit rate, recall, overall accuracy, false/true positive, false/true negative, predictive value, etc.
    Classified As TRUE Class 1 Class 2 &quot;Survivors&quot; &quot;Early Deaths&quot; Total % Correct Class 1 &quot;Survivors&quot; 158 20 178 88% Class 2 &quot;Early Deaths&quot; 9 28 37 75% Total 167 48 215 86.51%
  • 13. Tree Interpretation and Use
    • Practical decision tool
      • Trees like this are used for real world decision making
        • Rules for physicians
        • Decision tool for a nationwide team of salesmen needing to classify potential customers
    • Selection of the most important prognostic variables
      • Variable screen for parametric model building
      • Example data set had 100 variables
      • Use results to find variables to include in a logistic regression
    • Insight — in our example AGE is not relevant if BP is not high
      • Suggests which interactions will be important
  • 14. Binary Recursive Partitioning
    • Data is split into two partitions
      • thus “binary” partition
    • Partitions can also be split into sub-partitions
      • hence procedure is recursive
    • CART tree is generated by repeated dynamic partitioning of data set
      • Automatically determines which variables to use
      • Automatically determines which regions to focus on
      • The whole process is totally data driven
  • 15. Three Major Stages
    • Tree growing
      • Consider all possible splits of a node
      • Find the best split according to the given splitting rule (best improvement)
      • Continue sequentially until the largest tree is grown
    • Tree Pruning – creating a sequence of nested trees (pruning sequence) by systematic removal of the weakest branches
    • Optimal Tree Selection – using test sample or cross-validation to find the best tree in the pruning sequence
  • 16. General Workflow Stage 1 Stage 2 Stage 3 Historical Data Learn Test Validate Build a Sequence of Nested Trees Monitor Performance Best Confirm Findings
  • 17. Large Trees and Nearest Neighbor Classifier
    • Fear of large trees is unwarranted!
    • To reach a terminal node of a large tree many conditions must be met
      • Any new case meeting all these conditions will be very similar to the training case(s) in terminal node
      • Hence the new case should be similar on the target variable also
    • Key Result: Largest tree possible has a misclassification rate R asymptotically not greater than twice the Bayes rate R *
      • Bayes rate is the error rate of the most accurate possible classifier given the data
    • Thus useful to check largest tree error rate
      • For example, largest tree has relative error rate of .58 so theoretically best possible is .29
    • The exact analysis follows
  • 18. Derivation
    • Assume binary outcome and let p ( x ) be the conditional probability of focus class given x
    • Then the conditional Bayes risk at point x is r *( x ) = min [ p ( x ), 1 - p ( x ) ]
    • Now suppose that the actual prediction at X is governed by the class membership of the nearest neighbor x n , then the conditional NN risk is r ( x ) = p ( x ) [1- p ( x n )] + [1- p ( x )] p ( x n )  2 p ( x ) [1- p ( x )]
    • Hence r ( x ) = 2 r *( x ) [1- r *( x )] on large samples
    • Or in the global sense: R *  R  2 R * (1 – R *)
  • 19. Illustration
    • The error of the largest tree is asymptotically squeezed between the green and red curves
  • 20. Impact on the Relative Error
    • RE<x> stands for the asymptotically largest possible relative error of the largest tree assuming <x> initial proportion of the smallest of the target classes (smallest of the priors)
    • More unbalanced data might substantially inflate the relative errors of the largest trees
  • 21. Searching All Possible Splits
    • For any node CART will examine ALL possible splits
      • Computationally intensive but there are only a finite number of splits
    • Consider first the variable AGE--in our data set has minimum value 40
      • The rule
        • Is AGE  40? will separate out these cases to the left — the youngest persons
    • Next increase the AGE threshold to the next youngest person
        • Is AGE  43
        • This will direct six cases to the left
    • Continue increasing the splitting threshold value by value
  • 22. Split Tables
    • Split tables are used to locate continuous splits
      • There will be as many splits as there are unique values minus one
    Sorted by Age Sorted by Blood Pressure
  • 23. Categorical Predictors
    • Categorical predictors spawn many more splits
      • For example, only four levels A, B, C, D result to 7 splits
      • In general, there will be 2 K-1 -1 splits
    • CART considers all possible splits based on a categorical predictor unless the number of categories exceeds 15
    Left Right 1 A B, C, D 2 B A, C, B 3 C A, B, D 4 D A, B, C 5 A, B C, D 6 A, C B, D 7 A, D B, C
  • 24. Binary Target Shortcut
    • When the target is binary, special algorithms reduce compute time to linear (K-1 instead of 2 K-1 -1) in levels
      • 30 levels takes twice as long as 15, not 10,000+ times as long
    • When you have high-level categorical variables in a multi-class problem
      • Create a binary classification problem
      • Try different definitions of DPV (which groups to combine)
      • Explore predictor groupings produced by CART
      • From a study of all results decide which are most informative
      • Create new grouped predicted variable for multi-class problem
  • 25. GYM Model Example
    • The original data comes from a financial institution and disguised as a health club
    • Problem: need to understand a market research clustering scheme
    • Clusters were created externally using 18 variables and conventional clustering software
    • Need to find simple rules to describe cluster membership
    • CART tree provides a nice way to arrive at an intuitive story
  • 26. Variable Definitions
      • ID Identification # of member
      • CLUSTER Cluster assigned from clustering scheme (10 level categorical coded 1-10)
      • ANYRAQT Racquet ball usage (binary indicator coded 0, 1)
      • ONRCT Number of on-peak racquet ball uses
      • ANYPOOL Pool usage (binary indicator coded 0, 1)
      • ONPOOL Number of on-peak pool uses
      • PLRQTPCT Percent of pool and racquet ball usage
      • TPLRCT Total number of pool and racquet ball uses
      • ONAER Number of on-peak aerobics classes attended
      • OFFAER Number of off-peak aerobics classes attended
      • SAERDIF Difference between number of on- and off-peak aerobics visits
      • TANNING Number of visits to tanning salon
      • PERSTRN Personal trainer (binary indicator coded 0, 1)
      • CLASSES Number of classes taken
      • NSUPPS Number of supplements/vitamins/frozen dinners purchased
      • SMALLBUS Small business discount (binary indicator coded 0, 1)
      • OFFER Terms of offer
      • IPAKPRIC Index variable for package price
      • MONFEE Monthly fee paid
      • FIT Fitness score
      • NFAMMEN Number of family members
      • HOME Home ownership (binary indicator coded 0, 1)
    predictors not used target
  • 27. Initial Runs
    • Dataset: gymsmall.syd
    • Set the target and predictors
    • Set 50% test sample
    • Navigate the largest, smallest, and optimal trees
    • Check [Splitters] and [Tree Details]
    • Observe the logic of pruning
    • Check [Summary Reports]
    • Understand competitors and surrogates
    • Understand variable importance
  • 28. Pruning Multiple Nodes
    • Note that Terminal Node 5 has class assignment 1
    • All the remaining nodes have class assignment 2
    • Pruning merges Terminal Nodes 5 and 6 thus creating a “chain reaction” of redundant split removal
  • 29. Pruning Weaker Split First
    • Eliminating nodes 3 and 4 results to “lesser” damage – loosing one correct dominant class assignment
    • Eliminating nodes 1 and 2 or 5 and 6 would result to loosing one correct minority class assignment
  • 30. Understanding Variable Importance
    • Both FIT and CLASSES are reported as equally important
    • Yet the tree only includes CLASSES
    • ONPOOL is one of the main splitters yet is has very low importance
    • Want to understand why
  • 31. Competitors and Surrogates
    • Node details for the root node split are shown above
    • Improvement measures class separation between two sides
    • Competitors are next best splits in the given node ordered by improvement values
    • Surrogates are splits most resembling (case by case) the main split
      • Association measures the degree of resemblance
      • Looking for competitors is free, finding surrogates is expensive
      • It is possible to have an empty list of surrogates
    • FIT is a perfect surrogate (and competitor) for CLASSES
  • 32. The Utility of Surrogates
    • Suppose that the main splitter is INCOME and comes from a survey form
    • People with very low or very high income are likely not to report it – informative missingness in both tails of the income distribution
    • The splitter itself wants to separate low income on the left from high income on the right – hence, it is unreasonable to treat missing INCOME as a separate unique value (usual way to handle such case)
    • Using a surrogate will help to resolve ambiguity and redistribute missing income records between the two sides
    • For example, all low education subjects will join the low income side while all high education subjects will join the high income side
  • 33. Competitors and Surrogates Are Different
    • For symmetry considerations, both splits must result to identical improvements
    • Yet one split can’t be substituted for another
    • Hence Split X and Split Y are perfect competitors (same improvement) but very poor surrogates (low association)
    • The reverse does not hold – a perfect surrogate is always a perfect competitor
    A B C A B C Split X A B C A C B Split Y
    • Three nominal classes
    • Equally present
    • Equal priors, unit costs, etc.
  • 34. Calculating Association
    • Ten observations in a node
    • Main splitter X
    • Split Y mismatches on 2 observations
    • The default rule sends all cases to the dominant side mismatching 3 observations
              Split X Split Y           Default          
    • Association = (Default – Split Y)/Default = (3-2)/3 = 1/3
  • 35. Notes on Association
    • Measure local to a node, non-parametric in nature
    • Association is negative for nodes with very poor resemblance, not limited by -1 from below
    • Any positive association is useful (better strategy than the default rule)
    • Hence define surrogates as splits with the highest positive association
    • In calculating case mismatch, check the possibility for split reversal (if the rule is met, send cases to the right)
      • CART marks straight splits as “s” and reverse splits as “r”
  • 36. Calculating Variable Importance
    • List all main splitters as well as all surrogates along with the resulting improvements
    • Aggregate the above list by variable accumulating improvements
    • Sort the result descending
    • Divide each entry by the largest value and scale to 100%
    • The resulting importance list is based on the given tree exclusively, if a different tree a grown the importance list will most likely change
    • Variable importance reveals a deeper structure in stark contrast to a casual look at the main splitters
  • 37. Mystery Solved
    • In our example improvements in the root node dominate the whole tree
    • Consequently, the variable importance reflects the surrogates table in the root node
  • 38. Penalizing Individual Variables
    • We use variable penalty control to penalize CLASSES a little bit
    • This is enough to ensure FIT the main splitter position
    • The improvement associated with CLASSES is now downgraded by 1 minus the penalty factor
  • 39. Penalizing Missing and Categorical Predictors
    • Since splits on variables with missing values are evaluated on a reduced set of observations, such splits may have an unfair advantage
    • Introducing systematic penalty set to the proportion of the variable missing effectively removes the possible bias
    • Same holds for categorical variables with large number of distinct levels
    • We recommend that the user always set both controls to 1
  • 40. Splitting Rules - Introduction
    • Splitting rule controls the tree growing stage
    • At each node, a finite pool of all possible splits exists
      • The pool is independent of the target variable, it is based solely on the observations (and corresponding values of predictors) in the node
    • The task of the splitting rule is to rank all splits in the pool according to their improvements
      • The winner will become the main splitter
      • The top runner-ups will become competitors
    • It is possible to construct different splitting rules emphasizing various approaches to the definition of the “best split”
  • 41. Priors Adjusted Probabilities
    • Assume K target classes
    • Introduce prior probabilities  i , i =1,…, K
    • Assume the number of LEARN observations within each class: N i , i =1,…, K
    • Consider node t containing n i , i =1,…, K LEARN records of each target class
    • Then the priors adjusted probability that a case lands in node t (node probability) is: p t =  ( n i / N i )  i
    • The conditional probability of class i given that a case reached node t is: p i ( t ) = ( n i / N i )  i / p t
    • The conditional node probability of a class is called within node probability and is used as an input to a splitting rule
  • 42. GINI Splitting Rule
    • Introduce node t impurity as I t = 1 -  p i 2 ( t ) =   i  j p i ( t ) p j ( t )
      • Observe that a pure node has zero impurity
      • The impurity function is convex
      • The impurity function is maximized when all within node probabilities are equal
    • Consider any split and let p left be the probability that a case goes left given that we are in the parent node, p right – probability that a case goes right
      • p left = p t left / p t parent and p right = p t right / p t parent
    • Then the GINI split improvement is defined as:
      •  = I parent – p left I left – p right I right
  • 43. Illustration
  • 44. GINI Properties
    • The impurity reduction is weighted according to the node size upon which it operates
    • GINI splitting rule sequentially minimizes overall tree impurity (defined as the sum over all terminal nodes of node impurities times the node probabilities)
    • The splitting rule favors end-cut splits (when one class is separated from the rest) – which follows from its mathematical expression
  • 45. Example
    • GYM cluster example, using 10-level target
    • GINI splitting rule separates classes 3 and 8 from the rest in the root node
  • 46. ENTROPY Splitting Rule
    • Similar in spirit to GINI
    • The only difference is in the definition of the impurity function
      • I t = -  p i ( t ) log p i ( t )
      • The function is convex, zero for pure nodes, and maximized for equal mix of within node probabilities
    • Because of the different shape, a tree may evolve differently
  • 47. Example
    • GYM cluster example, using 10-level target
    • ENTROPY splitting rule separates six classes on the left from four classes on the right in the root node
  • 48. TWOING Splitting Rule
    • Group K target classes into two Super-classes in the parent node according to the rule: C 1 = { i : p i ( t left )  p i ( t right ) } C 2 = { i : p i ( t left ) < p i ( t right ) }
    • Compute the GINI improvement assuming that we had a binary target defined as the super-classes above
    • This usually results to more balanced splits
    • Target classes of “similar” nature will tend to group together
      • All minivans are separated from the pickup trucks ignoring the fine differences within each group
  • 49. Example
    • GYM cluster example, using 10-level target
    • TWOING splitting rule separates five classes on the left from five classes on the right in the root node thus producing a very balanced split
  • 50. Best Possible Split
    • Suppose that we have a K-class multinomial problem, p i – within-node probabilities for the parent node, p left - probability of the left direction
    • Further assume that we have the largest possible pool of binary splits available (say, using categorical splitter with all unique values)
    • Question: what is the resulting best split under each of the splitting rules?
    • GINI: separate major class j from the rest, where p j = max i ( p i ) – end cut approach !
    • TWOING and ENTROPY: group classes together such that | p left - 0.5| is minimized – balanced approach !
  • 51. Example
    • GINI trees usually grow narrow and deep while TWOING trees usually grow shallow and wide
    A 40 B 30 C 20 D 10 A 40 B 30 C 20 D 10 GINI Best Split A 40 B 30 C 20 D 10 A 40 D 10 B 30 C 20 TWOING Best Split
  • 52. Symmetric GINI
    • One possible way to introduce cost matrix actively into the splitting rule I t =   i  j C i  j p i ( t ) p j ( t ) where C i  j stands for the cost of class i being misclassified as class j
    • Note that the costs are automatically symmetrized thus rendering the rule less useful for nominal targets with asymmetric costs
    • The rule is most useful when handling ordinal outcome with costs proportional to the rank distance between the levels
    • Another “esoteric” use would be to cancel asymmetric costs during the tree growing stage for some reason while allowing asymmetric costs during pruning and model selection
  • 53. Ordered TWOING
    • Assume ordered categorical target
    • Original twoing ignores the order or classes during the creation of super-classes (strictly nominal approach)
    • Ordered twoing restricts super-class creation only to ordered sets of levels, the rest remains identical to twoing
    • The rule is most useful when the ordered separation of target classes is desirable
      • But recall that the splitting rule has no control over the pool of splits available, it is only used to evaluate and suggest the best possible split – hence, the “magic” of the ordered twoing splitting rule is somewhat limited
      • Using symmetric GINI with properly introduced costs usually provides more flexibility in this case
  • 54. Class Probability Rule
    • During the pruning stage, any pair of adjacent terminal nodes with identical class assignment is automatically eliminated
    • This is because by default CART builds a classifier and keeping such a pair of nodes would contribute nothing to the model classification accuracy
    • In reality, we might be interested in seeing such splits
      • For example, having “very likely” responder node next to “likely” responder node could be informative
    • Class probability splitting rule does exactly that
      • However, it comes at a price: due to convoluted theoretical reasons, optimal class probability trees become over-pruned
      • Therefore, only experienced CART users should try this
  • 55. Example
  • 56. Favor Even Splits Control
    • It is possible to further refine splitting rule behavior by introducing “Favor Even Splits” parameter (a.k.a. Power)
    • This control is most sensitive with the twoing splitting rule and was actually “inspired” by it
    • We recommend that the user always considers at least the following minimal set of modeling runs
      • GINI with Power=0
      • ENTROPY with Power=0
      • TWOING with Power=1
    • BATTERY RULES can be used to accomplish this automatically
  • 57. Fundamental Tree Building Mechanisms
    • Splitting rules have somewhat limited impact on the overall tree structure, especially on binary targets
    • We are about to introduce a powerful set of PRIORS and COSTS controls that will make profound implications for all stages of tree development and evaluation
    • These controls lie at the core of successful model building techniques
  • 58. Key Players
    • Using historical data, build a model for the appropriately chosen sets of priors and costs
    DM Engine Analyst Model Dataset Priors  i Costs C i  j Population
  • 59. Fraud Example
    • There are 1909 observations with 42 outcomes of interest
    • Having such extreme class distribution poses a lot of modeling challenges
    • The effect of key control settings is usually magnified on such data
    • We will use this dataset to illustrate the workings of priors and costs below
  • 60. Fraud Example – PRIORS EQUAL (0.5, 0.5)
    • High accuracy at 93%
    • Richest node has 37%
    • Small relative error 0.145
    • Low class assignment threshold at 2.2%
  • 61. Fraud Example – PRIORS SPECIFY (0.9, 0.1)
    • Reasonable accuracy at 64%
    • Richest node has 57%
    • Larger relative error 0.516
    • Class assignment threshold around 15%
  • 62. Fraud Example – PRIORS DATA (0.98, 0.02)
    • Poor accuracy at 26%
    • Richest node has 91%
    • High relative error 0.857
    • Class assignment threshold at 50%
  • 63. Internal Class Assignment Rule
    • Recall the definition of the within node probability p i ( t ) = ( n i / N i )  i / p t
    • Internal class assignment rule is always the same: assign node to the class with the highest p i ( t )
    • For PRIORS DATA this simplifies to the simple node majority rule: p i ( t )  n i
    • For PRIORS EQUAL this simplifies to class assignment with the highest lift with respect to the root node presence: p i ( t )  n i / N i
  • 64. Putting It All Together
    • As prior on the given class increases 
    • The class assignment threshold decreases 
    • Node richness goes down 
    • But class accuracy goes up 
    • PRIORS EQUAL use the root node class ratio as the class assignment threshold – hence, most favorable conditions to build a tree
    • PRIORS DATA use the majority rule as the class assignment threshold – hence, difficult modeling conditions on unbalanced classes
  • 65. The Impact of Priors
    • We have a mixture of two overlapping classes
    • The vertical lines show root node splits for different sets of priors (the left child is classified as red, the right child is classified as blue)
    • Varying priors provides effective control over the tradeoff between class purity and class accuracy
     r =0.1 ,  b =0.9  r =0.9 ,  b =0.1  r =0.5 ,  b =0.5
  • 66. Model Evaluation
    • Knowing model Prediction Success (PS) table on a given dataset, calculate the Expected Cost for the given cost matrix assuming the prior distribution of classes in the population
    Evaluator Analyst Expected Cost Dataset Priors  i Costs C i  j Model PS Table Cells n i  j Population
  • 67. Estimating Expected Cost
    • The following estimator is used R =   i  j C i  j ( n i  j / N i )  i
    • For PRIORS DATA, unit costs:  i = N i / N R = [   i  j n i  j ] / N overall misclassification rate
    • For PRIORS EQUAL, unit costs:  i = 1 / K R = {  (  i  j n i  j ) / N i } / K average within class misclassification rate
    • The relative cost is computed by dividing by the expected cost of the trivial classifier (all classes are assigned to the dominant cost-adjusted class)
  • 68. PRIORS DATA – Relative Cost
    • Expected cost R = (31+5)/1909 = 36/1909 = 1.9%
    • Trivial model R 0 = 42/1909 = 2.2%
      • This is already a very small error value
    • Relative cost Rel = R / R 0 = 36/42 = 0.857
      • Relative cost will be less than 1.0 only if the total number of cases misclassified is below 42 – a very difficult requirement
  • 69. PRIORS EQUAL – Relative Cost
    • Expected cost R = (138/1867+3/42) / 2 = 7.3%
    • Trivial model R 0 = (0/1867+42/42) / 2 = 50%
      • A very large error value independent of the initial class distribution
    • Relative cost Rel = R / R 0 = 138/1867 + 3/42 = 0.145
      • Relative cost will be less than 1.0 for a wide variety of modeling outcomes
      • The minimum usually tends to equalize individual class accuracies
  • 70. Using Equivalent Costs
    • Setting the ratio of costs mimic the ratio of DATA priors reproduces the PRIORS EQUAL run as seen both from the tree structure and the relative error value
    • On binary problems both mechanisms produce identical range of models
  • 71. Automated CART Runs
    • Starting with CART 6.0 we have added a new feature called batteries
    • A battery is a collection of runs obtained via a predefined procedure
    • A trivial set of batteries is varying one of the key CART parameters:
      • RULES – run every major splitting rule
      • ATOM – vary minimum parent node size
      • MINCHILD – vary minimum node size
      • DEPTH – vary tree depth
      • NODES – vary maximum number of nodes allowed
      • FLIP – flipping the learn and test samples (two runs only)
  • 72. More Batteries
    • DRAW – each run uses a randomly drawn learn sample from the original data
    • CV – varying number of folds used in cross-validation
    • CVR – repeated cross-validation each time with a different random number seed
    • KEEP – select a given number of predictors at random from the initial larger set
      • Supports a core set of predictors that must always be present
    • ONEOFF – one predictor model for each predictor in the given list
    • The remaining batteries are more advanced and will be discussed individually
  • 73. Battery MCT
    • The primary purpose is to test possible model overfitting by running Monte Carlo simulation
      • Randomly permute the target variable
      • Build a CART tree
      • Repeat a number of times
      • The resulting family of profiles measures intrinsic overfitting due to the learning process
      • Compare the actual run profile to the family of random profiles to check the significance of the results
    • This battery is especially useful when the relative error of the original run exceeds 90%
  • 74. Battery MVI
    • Designed to check the impact of the missing values indicators on the model performance
      • Build a standard model, missing values are handled via the mechanism of surrogates
      • Build a model using MVIs only – measures the amount of predictive information contained in the missing value indicators alone
      • Combine the two
      • Same as above with missing penalties turned on
  • 75. Battery PRIOR
    • Vary the priors for the specified class within a given range
      • For example, vary the prior on class 1 from 0.2 to 0.8 in increments of 0.02
    • Will result to a family of models with drastically different performance in terms of class richness and accuracy (see the earlier discussion)
    • Can be used for hot spot detection
    • Can also be used to construct a very efficient voting ensemble of trees
  • 76. Battery SHAVING
    • Automates variable selection process
    • Shaving from the top – each step the most important variable is eliminated
      • Capable to detect and eliminate “model hijackers” – variables that appear to be important on the learn sample but in general may hurt ultimate model performance (for example, ID variable)
    • Shaving from the bottom – each step the least important variable is eliminated
      • May drastically reduce the number of variables used by the model
    • SHAVING ERROR – at each iteration all current variables are tried for elimination one at a time to determine which predictor contributes the least
      • May result to a very long sequence of runs (quadratic complexity)
  • 77. Battery SAMPLE
    • Build a series of models in which the learn sample is systematically reduced
    • Used to determine the impact of the learn sample size on the model performance
    • Can be combined with BATTERY DRAW to get additional spread
  • 78. Battery TARGET
    • Attempt to build a model for each variable in the given list as a target using the remaining variables as predictors
    • Very effective in capturing inter-variable dependency patterns
    • Can be used as an input to variable clustering and multi-dimensional scaling approaches
    • When run on MVIs only, can reveal missing value patterns in the data
  • 79. Recommended Reading
    • Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: Wadsworth
    • Breiman, L (1996), Bagging Predictors, Machine Learning, 24, 123-140
    • Breiman, L (1996), Stacking Regressions, Machine Learning, 24, 49-64.
    • Friedman, J. H., Hastie, T. and Tibshirani, R. &quot;Additive Logistic Regression: a Statistical View of Boosting.&quot; (Aug. 1998)