Classification is determined by following a case’s path down the tree
Terminal nodes are associated with a single class
Any case arriving at a terminal node is assigned to that class
The implied classification threshold is constant across all nodes and depends on the currently set priors, costs, and observation weights (discussed later)
Other algorithms often use only the majority rule or a limited set of fixed rules for node class assignment thus being less flexible
In standard classification, tree assignment is not probabilistic
Another type of CART tree, the class probability tree does report distributions of class membership in nodes (discussed later)
With large data sets we can take the empirical distribution in a terminal node to represent the distribution of classes
A large number of different terms exists to name various simple ratios based on the table above:
Sensitivity, specificity, hit rate, recall, overall accuracy, false/true positive, false/true negative, predictive value, etc.
Classified As TRUE Class 1 Class 2 "Survivors" "Early Deaths" Total % Correct Class 1 "Survivors" 158 20 178 88% Class 2 "Early Deaths" 9 28 37 75% Total 167 48 215 86.51%
Find the best split according to the given splitting rule (best improvement)
Continue sequentially until the largest tree is grown
Tree Pruning – creating a sequence of nested trees (pruning sequence) by systematic removal of the weakest branches
Optimal Tree Selection – using test sample or cross-validation to find the best tree in the pruning sequence
16.
General Workflow Stage 1 Stage 2 Stage 3 Historical Data Learn Test Validate Build a Sequence of Nested Trees Monitor Performance Best Confirm Findings
Assume binary outcome and let p ( x ) be the conditional probability of focus class given x
Then the conditional Bayes risk at point x is r *( x ) = min [ p ( x ), 1 - p ( x ) ]
Now suppose that the actual prediction at X is governed by the class membership of the nearest neighbor x n , then the conditional NN risk is r ( x ) = p ( x ) [1- p ( x n )] + [1- p ( x )] p ( x n ) 2 p ( x ) [1- p ( x )]
Hence r ( x ) = 2 r *( x ) [1- r *( x )] on large samples
RE<x> stands for the asymptotically largest possible relative error of the largest tree assuming <x> initial proportion of the smallest of the target classes (smallest of the priors)
More unbalanced data might substantially inflate the relative errors of the largest trees
Suppose that the main splitter is INCOME and comes from a survey form
People with very low or very high income are likely not to report it – informative missingness in both tails of the income distribution
The splitter itself wants to separate low income on the left from high income on the right – hence, it is unreasonable to treat missing INCOME as a separate unique value (usual way to handle such case)
Using a surrogate will help to resolve ambiguity and redistribute missing income records between the two sides
For example, all low education subjects will join the low income side while all high education subjects will join the high income side
Introduce node t impurity as I t = 1 - p i 2 ( t ) = i j p i ( t ) p j ( t )
Observe that a pure node has zero impurity
The impurity function is convex
The impurity function is maximized when all within node probabilities are equal
Consider any split and let p left be the probability that a case goes left given that we are in the parent node, p right – probability that a case goes right
p left = p t left / p t parent and p right = p t right / p t parent
The impurity reduction is weighted according to the node size upon which it operates
GINI splitting rule sequentially minimizes overall tree impurity (defined as the sum over all terminal nodes of node impurities times the node probabilities)
The splitting rule favors end-cut splits (when one class is separated from the rest) – which follows from its mathematical expression
Group K target classes into two Super-classes in the parent node according to the rule: C 1 = { i : p i ( t left ) p i ( t right ) } C 2 = { i : p i ( t left ) < p i ( t right ) }
Compute the GINI improvement assuming that we had a binary target defined as the super-classes above
This usually results to more balanced splits
Target classes of “similar” nature will tend to group together
All minivans are separated from the pickup trucks ignoring the fine differences within each group
One possible way to introduce cost matrix actively into the splitting rule I t = i j C i j p i ( t ) p j ( t ) where C i j stands for the cost of class i being misclassified as class j
Note that the costs are automatically symmetrized thus rendering the rule less useful for nominal targets with asymmetric costs
The rule is most useful when handling ordinal outcome with costs proportional to the rank distance between the levels
Another “esoteric” use would be to cancel asymmetric costs during the tree growing stage for some reason while allowing asymmetric costs during pruning and model selection
Original twoing ignores the order or classes during the creation of super-classes (strictly nominal approach)
Ordered twoing restricts super-class creation only to ordered sets of levels, the rest remains identical to twoing
The rule is most useful when the ordered separation of target classes is desirable
But recall that the splitting rule has no control over the pool of splits available, it is only used to evaluate and suggest the best possible split – hence, the “magic” of the ordered twoing splitting rule is somewhat limited
Using symmetric GINI with properly introduced costs usually provides more flexibility in this case
Splitting rules have somewhat limited impact on the overall tree structure, especially on binary targets
We are about to introduce a powerful set of PRIORS and COSTS controls that will make profound implications for all stages of tree development and evaluation
These controls lie at the core of successful model building techniques
Knowing model Prediction Success (PS) table on a given dataset, calculate the Expected Cost for the given cost matrix assuming the prior distribution of classes in the population
Evaluator Analyst Expected Cost Dataset Priors i Costs C i j Model PS Table Cells n i j Population
The following estimator is used R = i j C i j ( n i j / N i ) i
For PRIORS DATA, unit costs: i = N i / N R = [ i j n i j ] / N overall misclassification rate
For PRIORS EQUAL, unit costs: i = 1 / K R = { ( i j n i j ) / N i } / K average within class misclassification rate
The relative cost is computed by dividing by the expected cost of the trivial classifier (all classes are assigned to the dominant cost-adjusted class)
Setting the ratio of costs mimic the ratio of DATA priors reproduces the PRIORS EQUAL run as seen both from the tree structure and the relative error value
On binary problems both mechanisms produce identical range of models
Shaving from the top – each step the most important variable is eliminated
Capable to detect and eliminate “model hijackers” – variables that appear to be important on the learn sample but in general may hurt ultimate model performance (for example, ID variable)
Shaving from the bottom – each step the least important variable is eliminated
May drastically reduce the number of variables used by the model
SHAVING ERROR – at each iteration all current variables are tried for elimination one at a time to determine which predictor contributes the least
May result to a very long sequence of runs (quadratic complexity)