Your SlideShare is downloading. ×
Advanced cart 2007
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Advanced cart 2007

408

Published on

Published in: Technology, Education
1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total Views
408
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
1
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Mining with Decision Trees: An Introduction to CART ® Dan Steinberg Mikhail Golovnya [email_address] http://www.salford-systems.com
  • 2. In The Beginning…
    • In 1984 Berkeley and Stanford statisticians announced a remarkable new classification tool which:
      • Could separate relevant from irrelevant predictors
      • Did not require any kind of variable transforms (logs, square roots)
      • Impervious to outliers and missing values
      • Could yield relatively simple and easy to comprehend models
      • Required little to no supervision by the analyst
      • Was frequently more accurate than traditional logistic regression or discriminant analysis, and other parametric tools
  • 3. The Years of Struggle
    • CART was slow to gain widespread recognition in late 1980s and early 1990s for several reasons
      • Monograph introducing CART is challenging to read
        • brilliant book overflowing with insights into tree growing methodology but fairly technical and brief
      • Method was not expounded in any textbooks
      • Originally taught only in advanced graduate statistics classes at a handful of leading Universities
      • Original standalone software came with slender documentation, and output was not self-explanatory
      • Method was such a radical departure from conventional statistics
  • 4. The Final Triumph
    • Rising interest in data mining
      • Availability of huge data sets requiring analysis
      • Need to automate or accelerate and improve analysis process Comparative performance studies
    • Advantages of CART over other tree methods
      • handling of missing values
      • assistance in interpretation of results (surrogates)
      • performance advantages: speed, accuracy
    • New software and documentation make techniques accessible to end users
    • Word of mouth generated by early adopters
    • Easy to use professional software available from Salford Systems
  • 5. So What is CART?
    • Best illustrated with a famous example-the UCSD Heart Disease study
    • Given the diagnosis of a heart attack based on
      • Chest pain, Indicative EKGs, Elevation of enzymes typically released by damaged heart muscle
    • Predict who is at risk of a 2nd heart attack and early death within 30 days
      • Prediction will determine treatment program (intensive care or not)
    • For each patient about 100 variables were available, including:
      • Demographics, medical history, lab results
      • 19 noninvasive variables were used in the analysis
  • 6. Typical CART Solution
    • Example of a CLASSIFICATION tree
    • Dependent variable is categorical (SURVIVE, DIE)
    • Want to predict class membership
    PATIENTS = 215 SURVIVE 178 82.8% DEAD 37 17.2% Is BP<=91? Terminal Node A SURVIVE 6 30.0% DEAD 14 70.0% NODE = DEAD Terminal Node B SURVIVE 102 98.1% DEAD 2 1.9% NODE = SURVIVE PATIENTS = 195 SURVIVE 172 88.2% DEAD 23 11.8% Is AGE<=62.5? Terminal Node C SURVIVE 14 50.0% DEAD 14 50.0% NODE = DEAD PATIENTS = 91 SURVIVE 70 76.9% DEAD 21 23.1% Is SINUS<=.5? Terminal Node D SURVIVE 56 88.9% DEAD 7 11.1% NODE = SURVIVE <= 91 > 91 <= 62.5 > 62.5 >.5 <=.5
  • 7. How to Read It
    • Entire tree represents a complete analysis or model
    • Has the form of a decision tree
    • Root of inverted tree contains all data
    • Root gives rise to child nodes
    • Child nodes can in turn give rise to their own children
    • At some point a given path ends in a terminal node
    • Terminal node classifies object
    • Path through the tree governed by the answers to QUESTIONS or RULES
  • 8. Decision Questions
    • Question is of the form: is statement TRUE?
    • Is continuous variable X  c ?
    • Does categorical variable D take on levels i, j, or k?
      • e.g. Is geographic region 1, 2, 4, or 7?
    • Standard split:
      • If answer to question is YES a case goes left; otherwise it goes right
      • This is the form of all primary splits
    • Question is formulated so that only two answers possible
      • Called binary partitioning
      • In CART the YES answer always goes left
  • 9. The Tree is a Classifier
    • Classification is determined by following a case’s path down the tree
    • Terminal nodes are associated with a single class
      • Any case arriving at a terminal node is assigned to that class
    • The implied classification threshold is constant across all nodes and depends on the currently set priors, costs, and observation weights (discussed later)
      • Other algorithms often use only the majority rule or a limited set of fixed rules for node class assignment thus being less flexible
    • In standard classification, tree assignment is not probabilistic
      • Another type of CART tree, the class probability tree does report distributions of class membership in nodes (discussed later)
    • With large data sets we can take the empirical distribution in a terminal node to represent the distribution of classes
  • 10. The Importance of Binary Splits
    • Some researchers suggested using 3-way, 4-way, etc. splits at each node (univariate)
    • There are strong theoretical and practical reasons why we feel strongly against such approaches:
      • Exhaustive search for a binary split has linear algorithm complexity, the alternatives have much higher complexity
      • Choosing the best binary split is “less ambiguous” than the suggested alternatives
      • Most importantly, binary splits result to the smallest rate of data fragmentation thus allowing a larger set of predictors to enter a model
      • Any multi-way split can be replaced by a sequence of binary splits
    • There are numerous practical illustrations on the validity of the above claims
  • 11. Accuracy of a Tree
    • For classification trees
      • All objects reaching a terminal node are classified in the same way
        • e.g. All heart attack victims with BP>91 and AGE less than 62.5 are classified as SURVIVORS
      • Classification is same regardless of other medical history and lab results
    • The performance of any classifier can be summarized in terms of the Prediction Success Table (a.k.a. Confusion Matrix)
    • The key to comparing the performance of different models is how one “wraps” the entire prediction success matrix into a single number
      • Overall accuracy
      • Average within-class accuracy
      • Most generally – the Expected Cost
  • 12. Prediction Success Table
    • A large number of different terms exists to name various simple ratios based on the table above:
      • Sensitivity, specificity, hit rate, recall, overall accuracy, false/true positive, false/true negative, predictive value, etc.
    Classified As TRUE Class 1 Class 2 &quot;Survivors&quot; &quot;Early Deaths&quot; Total % Correct Class 1 &quot;Survivors&quot; 158 20 178 88% Class 2 &quot;Early Deaths&quot; 9 28 37 75% Total 167 48 215 86.51%
  • 13. Tree Interpretation and Use
    • Practical decision tool
      • Trees like this are used for real world decision making
        • Rules for physicians
        • Decision tool for a nationwide team of salesmen needing to classify potential customers
    • Selection of the most important prognostic variables
      • Variable screen for parametric model building
      • Example data set had 100 variables
      • Use results to find variables to include in a logistic regression
    • Insight — in our example AGE is not relevant if BP is not high
      • Suggests which interactions will be important
  • 14. Binary Recursive Partitioning
    • Data is split into two partitions
      • thus “binary” partition
    • Partitions can also be split into sub-partitions
      • hence procedure is recursive
    • CART tree is generated by repeated dynamic partitioning of data set
      • Automatically determines which variables to use
      • Automatically determines which regions to focus on
      • The whole process is totally data driven
  • 15. Three Major Stages
    • Tree growing
      • Consider all possible splits of a node
      • Find the best split according to the given splitting rule (best improvement)
      • Continue sequentially until the largest tree is grown
    • Tree Pruning – creating a sequence of nested trees (pruning sequence) by systematic removal of the weakest branches
    • Optimal Tree Selection – using test sample or cross-validation to find the best tree in the pruning sequence
  • 16. General Workflow Stage 1 Stage 2 Stage 3 Historical Data Learn Test Validate Build a Sequence of Nested Trees Monitor Performance Best Confirm Findings
  • 17. Large Trees and Nearest Neighbor Classifier
    • Fear of large trees is unwarranted!
    • To reach a terminal node of a large tree many conditions must be met
      • Any new case meeting all these conditions will be very similar to the training case(s) in terminal node
      • Hence the new case should be similar on the target variable also
    • Key Result: Largest tree possible has a misclassification rate R asymptotically not greater than twice the Bayes rate R *
      • Bayes rate is the error rate of the most accurate possible classifier given the data
    • Thus useful to check largest tree error rate
      • For example, largest tree has relative error rate of .58 so theoretically best possible is .29
    • The exact analysis follows
  • 18. Derivation
    • Assume binary outcome and let p ( x ) be the conditional probability of focus class given x
    • Then the conditional Bayes risk at point x is r *( x ) = min [ p ( x ), 1 - p ( x ) ]
    • Now suppose that the actual prediction at X is governed by the class membership of the nearest neighbor x n , then the conditional NN risk is r ( x ) = p ( x ) [1- p ( x n )] + [1- p ( x )] p ( x n )  2 p ( x ) [1- p ( x )]
    • Hence r ( x ) = 2 r *( x ) [1- r *( x )] on large samples
    • Or in the global sense: R *  R  2 R * (1 – R *)
  • 19. Illustration
    • The error of the largest tree is asymptotically squeezed between the green and red curves
  • 20. Impact on the Relative Error
    • RE<x> stands for the asymptotically largest possible relative error of the largest tree assuming <x> initial proportion of the smallest of the target classes (smallest of the priors)
    • More unbalanced data might substantially inflate the relative errors of the largest trees
  • 21. Searching All Possible Splits
    • For any node CART will examine ALL possible splits
      • Computationally intensive but there are only a finite number of splits
    • Consider first the variable AGE--in our data set has minimum value 40
      • The rule
        • Is AGE  40? will separate out these cases to the left — the youngest persons
    • Next increase the AGE threshold to the next youngest person
        • Is AGE  43
        • This will direct six cases to the left
    • Continue increasing the splitting threshold value by value
  • 22. Split Tables
    • Split tables are used to locate continuous splits
      • There will be as many splits as there are unique values minus one
    Sorted by Age Sorted by Blood Pressure
  • 23. Categorical Predictors
    • Categorical predictors spawn many more splits
      • For example, only four levels A, B, C, D result to 7 splits
      • In general, there will be 2 K-1 -1 splits
    • CART considers all possible splits based on a categorical predictor unless the number of categories exceeds 15
    Left Right 1 A B, C, D 2 B A, C, B 3 C A, B, D 4 D A, B, C 5 A, B C, D 6 A, C B, D 7 A, D B, C
  • 24. Binary Target Shortcut
    • When the target is binary, special algorithms reduce compute time to linear (K-1 instead of 2 K-1 -1) in levels
      • 30 levels takes twice as long as 15, not 10,000+ times as long
    • When you have high-level categorical variables in a multi-class problem
      • Create a binary classification problem
      • Try different definitions of DPV (which groups to combine)
      • Explore predictor groupings produced by CART
      • From a study of all results decide which are most informative
      • Create new grouped predicted variable for multi-class problem
  • 25. GYM Model Example
    • The original data comes from a financial institution and disguised as a health club
    • Problem: need to understand a market research clustering scheme
    • Clusters were created externally using 18 variables and conventional clustering software
    • Need to find simple rules to describe cluster membership
    • CART tree provides a nice way to arrive at an intuitive story
  • 26. Variable Definitions
      • ID Identification # of member
      • CLUSTER Cluster assigned from clustering scheme (10 level categorical coded 1-10)
      • ANYRAQT Racquet ball usage (binary indicator coded 0, 1)
      • ONRCT Number of on-peak racquet ball uses
      • ANYPOOL Pool usage (binary indicator coded 0, 1)
      • ONPOOL Number of on-peak pool uses
      • PLRQTPCT Percent of pool and racquet ball usage
      • TPLRCT Total number of pool and racquet ball uses
      • ONAER Number of on-peak aerobics classes attended
      • OFFAER Number of off-peak aerobics classes attended
      • SAERDIF Difference between number of on- and off-peak aerobics visits
      • TANNING Number of visits to tanning salon
      • PERSTRN Personal trainer (binary indicator coded 0, 1)
      • CLASSES Number of classes taken
      • NSUPPS Number of supplements/vitamins/frozen dinners purchased
      • SMALLBUS Small business discount (binary indicator coded 0, 1)
      • OFFER Terms of offer
      • IPAKPRIC Index variable for package price
      • MONFEE Monthly fee paid
      • FIT Fitness score
      • NFAMMEN Number of family members
      • HOME Home ownership (binary indicator coded 0, 1)
    predictors not used target
  • 27. Initial Runs
    • Dataset: gymsmall.syd
    • Set the target and predictors
    • Set 50% test sample
    • Navigate the largest, smallest, and optimal trees
    • Check [Splitters] and [Tree Details]
    • Observe the logic of pruning
    • Check [Summary Reports]
    • Understand competitors and surrogates
    • Understand variable importance
  • 28. Pruning Multiple Nodes
    • Note that Terminal Node 5 has class assignment 1
    • All the remaining nodes have class assignment 2
    • Pruning merges Terminal Nodes 5 and 6 thus creating a “chain reaction” of redundant split removal
  • 29. Pruning Weaker Split First
    • Eliminating nodes 3 and 4 results to “lesser” damage – loosing one correct dominant class assignment
    • Eliminating nodes 1 and 2 or 5 and 6 would result to loosing one correct minority class assignment
  • 30. Understanding Variable Importance
    • Both FIT and CLASSES are reported as equally important
    • Yet the tree only includes CLASSES
    • ONPOOL is one of the main splitters yet is has very low importance
    • Want to understand why
  • 31. Competitors and Surrogates
    • Node details for the root node split are shown above
    • Improvement measures class separation between two sides
    • Competitors are next best splits in the given node ordered by improvement values
    • Surrogates are splits most resembling (case by case) the main split
      • Association measures the degree of resemblance
      • Looking for competitors is free, finding surrogates is expensive
      • It is possible to have an empty list of surrogates
    • FIT is a perfect surrogate (and competitor) for CLASSES
  • 32. The Utility of Surrogates
    • Suppose that the main splitter is INCOME and comes from a survey form
    • People with very low or very high income are likely not to report it – informative missingness in both tails of the income distribution
    • The splitter itself wants to separate low income on the left from high income on the right – hence, it is unreasonable to treat missing INCOME as a separate unique value (usual way to handle such case)
    • Using a surrogate will help to resolve ambiguity and redistribute missing income records between the two sides
    • For example, all low education subjects will join the low income side while all high education subjects will join the high income side
  • 33. Competitors and Surrogates Are Different
    • For symmetry considerations, both splits must result to identical improvements
    • Yet one split can’t be substituted for another
    • Hence Split X and Split Y are perfect competitors (same improvement) but very poor surrogates (low association)
    • The reverse does not hold – a perfect surrogate is always a perfect competitor
    A B C A B C Split X A B C A C B Split Y
    • Three nominal classes
    • Equally present
    • Equal priors, unit costs, etc.
  • 34. Calculating Association
    • Ten observations in a node
    • Main splitter X
    • Split Y mismatches on 2 observations
    • The default rule sends all cases to the dominant side mismatching 3 observations
              Split X Split Y           Default          
    • Association = (Default – Split Y)/Default = (3-2)/3 = 1/3
  • 35. Notes on Association
    • Measure local to a node, non-parametric in nature
    • Association is negative for nodes with very poor resemblance, not limited by -1 from below
    • Any positive association is useful (better strategy than the default rule)
    • Hence define surrogates as splits with the highest positive association
    • In calculating case mismatch, check the possibility for split reversal (if the rule is met, send cases to the right)
      • CART marks straight splits as “s” and reverse splits as “r”
  • 36. Calculating Variable Importance
    • List all main splitters as well as all surrogates along with the resulting improvements
    • Aggregate the above list by variable accumulating improvements
    • Sort the result descending
    • Divide each entry by the largest value and scale to 100%
    • The resulting importance list is based on the given tree exclusively, if a different tree a grown the importance list will most likely change
    • Variable importance reveals a deeper structure in stark contrast to a casual look at the main splitters
  • 37. Mystery Solved
    • In our example improvements in the root node dominate the whole tree
    • Consequently, the variable importance reflects the surrogates table in the root node
  • 38. Penalizing Individual Variables
    • We use variable penalty control to penalize CLASSES a little bit
    • This is enough to ensure FIT the main splitter position
    • The improvement associated with CLASSES is now downgraded by 1 minus the penalty factor
  • 39. Penalizing Missing and Categorical Predictors
    • Since splits on variables with missing values are evaluated on a reduced set of observations, such splits may have an unfair advantage
    • Introducing systematic penalty set to the proportion of the variable missing effectively removes the possible bias
    • Same holds for categorical variables with large number of distinct levels
    • We recommend that the user always set both controls to 1
  • 40. Splitting Rules - Introduction
    • Splitting rule controls the tree growing stage
    • At each node, a finite pool of all possible splits exists
      • The pool is independent of the target variable, it is based solely on the observations (and corresponding values of predictors) in the node
    • The task of the splitting rule is to rank all splits in the pool according to their improvements
      • The winner will become the main splitter
      • The top runner-ups will become competitors
    • It is possible to construct different splitting rules emphasizing various approaches to the definition of the “best split”
  • 41. Priors Adjusted Probabilities
    • Assume K target classes
    • Introduce prior probabilities  i , i =1,…, K
    • Assume the number of LEARN observations within each class: N i , i =1,…, K
    • Consider node t containing n i , i =1,…, K LEARN records of each target class
    • Then the priors adjusted probability that a case lands in node t (node probability) is: p t =  ( n i / N i )  i
    • The conditional probability of class i given that a case reached node t is: p i ( t ) = ( n i / N i )  i / p t
    • The conditional node probability of a class is called within node probability and is used as an input to a splitting rule
  • 42. GINI Splitting Rule
    • Introduce node t impurity as I t = 1 -  p i 2 ( t ) =   i  j p i ( t ) p j ( t )
      • Observe that a pure node has zero impurity
      • The impurity function is convex
      • The impurity function is maximized when all within node probabilities are equal
    • Consider any split and let p left be the probability that a case goes left given that we are in the parent node, p right – probability that a case goes right
      • p left = p t left / p t parent and p right = p t right / p t parent
    • Then the GINI split improvement is defined as:
      •  = I parent – p left I left – p right I right
  • 43. Illustration
  • 44. GINI Properties
    • The impurity reduction is weighted according to the node size upon which it operates
    • GINI splitting rule sequentially minimizes overall tree impurity (defined as the sum over all terminal nodes of node impurities times the node probabilities)
    • The splitting rule favors end-cut splits (when one class is separated from the rest) – which follows from its mathematical expression
  • 45. Example
    • GYM cluster example, using 10-level target
    • GINI splitting rule separates classes 3 and 8 from the rest in the root node
  • 46. ENTROPY Splitting Rule
    • Similar in spirit to GINI
    • The only difference is in the definition of the impurity function
      • I t = -  p i ( t ) log p i ( t )
      • The function is convex, zero for pure nodes, and maximized for equal mix of within node probabilities
    • Because of the different shape, a tree may evolve differently
  • 47. Example
    • GYM cluster example, using 10-level target
    • ENTROPY splitting rule separates six classes on the left from four classes on the right in the root node
  • 48. TWOING Splitting Rule
    • Group K target classes into two Super-classes in the parent node according to the rule: C 1 = { i : p i ( t left )  p i ( t right ) } C 2 = { i : p i ( t left ) < p i ( t right ) }
    • Compute the GINI improvement assuming that we had a binary target defined as the super-classes above
    • This usually results to more balanced splits
    • Target classes of “similar” nature will tend to group together
      • All minivans are separated from the pickup trucks ignoring the fine differences within each group
  • 49. Example
    • GYM cluster example, using 10-level target
    • TWOING splitting rule separates five classes on the left from five classes on the right in the root node thus producing a very balanced split
  • 50. Best Possible Split
    • Suppose that we have a K-class multinomial problem, p i – within-node probabilities for the parent node, p left - probability of the left direction
    • Further assume that we have the largest possible pool of binary splits available (say, using categorical splitter with all unique values)
    • Question: what is the resulting best split under each of the splitting rules?
    • GINI: separate major class j from the rest, where p j = max i ( p i ) – end cut approach !
    • TWOING and ENTROPY: group classes together such that | p left - 0.5| is minimized – balanced approach !
  • 51. Example
    • GINI trees usually grow narrow and deep while TWOING trees usually grow shallow and wide
    A 40 B 30 C 20 D 10 A 40 B 30 C 20 D 10 GINI Best Split A 40 B 30 C 20 D 10 A 40 D 10 B 30 C 20 TWOING Best Split
  • 52. Symmetric GINI
    • One possible way to introduce cost matrix actively into the splitting rule I t =   i  j C i  j p i ( t ) p j ( t ) where C i  j stands for the cost of class i being misclassified as class j
    • Note that the costs are automatically symmetrized thus rendering the rule less useful for nominal targets with asymmetric costs
    • The rule is most useful when handling ordinal outcome with costs proportional to the rank distance between the levels
    • Another “esoteric” use would be to cancel asymmetric costs during the tree growing stage for some reason while allowing asymmetric costs during pruning and model selection
  • 53. Ordered TWOING
    • Assume ordered categorical target
    • Original twoing ignores the order or classes during the creation of super-classes (strictly nominal approach)
    • Ordered twoing restricts super-class creation only to ordered sets of levels, the rest remains identical to twoing
    • The rule is most useful when the ordered separation of target classes is desirable
      • But recall that the splitting rule has no control over the pool of splits available, it is only used to evaluate and suggest the best possible split – hence, the “magic” of the ordered twoing splitting rule is somewhat limited
      • Using symmetric GINI with properly introduced costs usually provides more flexibility in this case
  • 54. Class Probability Rule
    • During the pruning stage, any pair of adjacent terminal nodes with identical class assignment is automatically eliminated
    • This is because by default CART builds a classifier and keeping such a pair of nodes would contribute nothing to the model classification accuracy
    • In reality, we might be interested in seeing such splits
      • For example, having “very likely” responder node next to “likely” responder node could be informative
    • Class probability splitting rule does exactly that
      • However, it comes at a price: due to convoluted theoretical reasons, optimal class probability trees become over-pruned
      • Therefore, only experienced CART users should try this
  • 55. Example
  • 56. Favor Even Splits Control
    • It is possible to further refine splitting rule behavior by introducing “Favor Even Splits” parameter (a.k.a. Power)
    • This control is most sensitive with the twoing splitting rule and was actually “inspired” by it
    • We recommend that the user always considers at least the following minimal set of modeling runs
      • GINI with Power=0
      • ENTROPY with Power=0
      • TWOING with Power=1
    • BATTERY RULES can be used to accomplish this automatically
  • 57. Fundamental Tree Building Mechanisms
    • Splitting rules have somewhat limited impact on the overall tree structure, especially on binary targets
    • We are about to introduce a powerful set of PRIORS and COSTS controls that will make profound implications for all stages of tree development and evaluation
    • These controls lie at the core of successful model building techniques
  • 58. Key Players
    • Using historical data, build a model for the appropriately chosen sets of priors and costs
    DM Engine Analyst Model Dataset Priors  i Costs C i  j Population
  • 59. Fraud Example
    • There are 1909 observations with 42 outcomes of interest
    • Having such extreme class distribution poses a lot of modeling challenges
    • The effect of key control settings is usually magnified on such data
    • We will use this dataset to illustrate the workings of priors and costs below
  • 60. Fraud Example – PRIORS EQUAL (0.5, 0.5)
    • High accuracy at 93%
    • Richest node has 37%
    • Small relative error 0.145
    • Low class assignment threshold at 2.2%
  • 61. Fraud Example – PRIORS SPECIFY (0.9, 0.1)
    • Reasonable accuracy at 64%
    • Richest node has 57%
    • Larger relative error 0.516
    • Class assignment threshold around 15%
  • 62. Fraud Example – PRIORS DATA (0.98, 0.02)
    • Poor accuracy at 26%
    • Richest node has 91%
    • High relative error 0.857
    • Class assignment threshold at 50%
  • 63. Internal Class Assignment Rule
    • Recall the definition of the within node probability p i ( t ) = ( n i / N i )  i / p t
    • Internal class assignment rule is always the same: assign node to the class with the highest p i ( t )
    • For PRIORS DATA this simplifies to the simple node majority rule: p i ( t )  n i
    • For PRIORS EQUAL this simplifies to class assignment with the highest lift with respect to the root node presence: p i ( t )  n i / N i
  • 64. Putting It All Together
    • As prior on the given class increases 
    • The class assignment threshold decreases 
    • Node richness goes down 
    • But class accuracy goes up 
    • PRIORS EQUAL use the root node class ratio as the class assignment threshold – hence, most favorable conditions to build a tree
    • PRIORS DATA use the majority rule as the class assignment threshold – hence, difficult modeling conditions on unbalanced classes
  • 65. The Impact of Priors
    • We have a mixture of two overlapping classes
    • The vertical lines show root node splits for different sets of priors (the left child is classified as red, the right child is classified as blue)
    • Varying priors provides effective control over the tradeoff between class purity and class accuracy
     r =0.1 ,  b =0.9  r =0.9 ,  b =0.1  r =0.5 ,  b =0.5
  • 66. Model Evaluation
    • Knowing model Prediction Success (PS) table on a given dataset, calculate the Expected Cost for the given cost matrix assuming the prior distribution of classes in the population
    Evaluator Analyst Expected Cost Dataset Priors  i Costs C i  j Model PS Table Cells n i  j Population
  • 67. Estimating Expected Cost
    • The following estimator is used R =   i  j C i  j ( n i  j / N i )  i
    • For PRIORS DATA, unit costs:  i = N i / N R = [   i  j n i  j ] / N overall misclassification rate
    • For PRIORS EQUAL, unit costs:  i = 1 / K R = {  (  i  j n i  j ) / N i } / K average within class misclassification rate
    • The relative cost is computed by dividing by the expected cost of the trivial classifier (all classes are assigned to the dominant cost-adjusted class)
  • 68. PRIORS DATA – Relative Cost
    • Expected cost R = (31+5)/1909 = 36/1909 = 1.9%
    • Trivial model R 0 = 42/1909 = 2.2%
      • This is already a very small error value
    • Relative cost Rel = R / R 0 = 36/42 = 0.857
      • Relative cost will be less than 1.0 only if the total number of cases misclassified is below 42 – a very difficult requirement
  • 69. PRIORS EQUAL – Relative Cost
    • Expected cost R = (138/1867+3/42) / 2 = 7.3%
    • Trivial model R 0 = (0/1867+42/42) / 2 = 50%
      • A very large error value independent of the initial class distribution
    • Relative cost Rel = R / R 0 = 138/1867 + 3/42 = 0.145
      • Relative cost will be less than 1.0 for a wide variety of modeling outcomes
      • The minimum usually tends to equalize individual class accuracies
  • 70. Using Equivalent Costs
    • Setting the ratio of costs mimic the ratio of DATA priors reproduces the PRIORS EQUAL run as seen both from the tree structure and the relative error value
    • On binary problems both mechanisms produce identical range of models
  • 71. Automated CART Runs
    • Starting with CART 6.0 we have added a new feature called batteries
    • A battery is a collection of runs obtained via a predefined procedure
    • A trivial set of batteries is varying one of the key CART parameters:
      • RULES – run every major splitting rule
      • ATOM – vary minimum parent node size
      • MINCHILD – vary minimum node size
      • DEPTH – vary tree depth
      • NODES – vary maximum number of nodes allowed
      • FLIP – flipping the learn and test samples (two runs only)
  • 72. More Batteries
    • DRAW – each run uses a randomly drawn learn sample from the original data
    • CV – varying number of folds used in cross-validation
    • CVR – repeated cross-validation each time with a different random number seed
    • KEEP – select a given number of predictors at random from the initial larger set
      • Supports a core set of predictors that must always be present
    • ONEOFF – one predictor model for each predictor in the given list
    • The remaining batteries are more advanced and will be discussed individually
  • 73. Battery MCT
    • The primary purpose is to test possible model overfitting by running Monte Carlo simulation
      • Randomly permute the target variable
      • Build a CART tree
      • Repeat a number of times
      • The resulting family of profiles measures intrinsic overfitting due to the learning process
      • Compare the actual run profile to the family of random profiles to check the significance of the results
    • This battery is especially useful when the relative error of the original run exceeds 90%
  • 74. Battery MVI
    • Designed to check the impact of the missing values indicators on the model performance
      • Build a standard model, missing values are handled via the mechanism of surrogates
      • Build a model using MVIs only – measures the amount of predictive information contained in the missing value indicators alone
      • Combine the two
      • Same as above with missing penalties turned on
  • 75. Battery PRIOR
    • Vary the priors for the specified class within a given range
      • For example, vary the prior on class 1 from 0.2 to 0.8 in increments of 0.02
    • Will result to a family of models with drastically different performance in terms of class richness and accuracy (see the earlier discussion)
    • Can be used for hot spot detection
    • Can also be used to construct a very efficient voting ensemble of trees
  • 76. Battery SHAVING
    • Automates variable selection process
    • Shaving from the top – each step the most important variable is eliminated
      • Capable to detect and eliminate “model hijackers” – variables that appear to be important on the learn sample but in general may hurt ultimate model performance (for example, ID variable)
    • Shaving from the bottom – each step the least important variable is eliminated
      • May drastically reduce the number of variables used by the model
    • SHAVING ERROR – at each iteration all current variables are tried for elimination one at a time to determine which predictor contributes the least
      • May result to a very long sequence of runs (quadratic complexity)
  • 77. Battery SAMPLE
    • Build a series of models in which the learn sample is systematically reduced
    • Used to determine the impact of the learn sample size on the model performance
    • Can be combined with BATTERY DRAW to get additional spread
  • 78. Battery TARGET
    • Attempt to build a model for each variable in the given list as a target using the remaining variables as predictors
    • Very effective in capturing inter-variable dependency patterns
    • Can be used as an input to variable clustering and multi-dimensional scaling approaches
    • When run on MVIs only, can reveal missing value patterns in the data
  • 79. Recommended Reading
    • Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: Wadsworth
    • Breiman, L (1996), Bagging Predictors, Machine Learning, 24, 123-140
    • Breiman, L (1996), Stacking Regressions, Machine Learning, 24, 49-64.
    • Friedman, J. H., Hastie, T. and Tibshirani, R. &quot;Additive Logistic Regression: a Statistical View of Boosting.&quot; (Aug. 1998)

×