Decision Tree - C4.5&CART

Decision Tree
-- C4.5 and CART
Xueping Peng
Xueping.peng@uts.edu.au

ID3 (1/2)
 ID3 is a strong system that
 Uses hill-climbing search based on the information gain measure
to search through the space of decision trees
 Outputs a single hypothesis
 Never backtracks.It converges to locally optimal solutions
 Uses all training examples at each step, contrary to methods that
make decisions incrementally
 Uses statistical properties of all examples:the search is less
sensitive to errors in individual training examples

ID3 (2/2)
 However, ID3 has some drawbacks
 It can only deal with nominal data (e.g. continuous data)
 It may be not robust in presence of noise (e.g. overfitting)
 It is not able to deal with noisy data sets (e.g. missing values)

How C4.5 enhances C4.5 to
 Be robust in the presence of noise. Avoid overfitting
 Deal with continuous attributes
 Deal with missing data
 Convert trees to rules

What’s Overfitting?
 Overfitting = Given a hypothesis space H, a hypothesis hєH is
said to overfit the training data if there exists some
alternative hypothesis h’єH, such that
 h has smaller error than h’ over the training examples, but
 h’ has a smaller error than h over the entire distribution of
instances.

Why May my System Overfit?
 In domains with noise or uncertainty
 the system may try to decrease the training error by completely
fitting all the training examples

How to Avoid Overfitting?
 Pre-prune: Stop growing a branch when information
becomes unreliable
 Post-prune:Take a fully-grown decision tree and discard
unreliable parts

Post-pruning
 First, build the full tree
 Then, prune it
 Fully-grown tree shows all attribute interactions
 Two pruning operations:
 Subtree replacement
 Subtree raising
 Possible strategies:
 error estimation
 significance testing
 MDL principle

Subtree Replacement
 Bottom up approach
 Consider replacing a tree after considering all its subtrees
 Ex: labor negotiations

Subtree Replacement
 Algorithm:
 Split the data into training and validation set
 Do until further pruning is harmful:
 Evaluate impact on the validation set of
pruning each possible node
 Select the node whose removal most
increases the validation set accuracy

Deal with continuous attributes
 When dealing with nominal data
 We evaluated the grain for each possible value
 In continuous data, we have infinite values
 What should we do?
 Continuous-valued attributes may take infinite values, but we
have a limited number of values in our instances (at most N if we
have N instances)
 Therefore, simulate that you have N nominal values
 Evaluate information gain for every possible split point of the attribute
 Choose the best split point
 The information gain of the attribute is the information gain of the best split

 Example

 Split on temperature attribute:
 E.g.: temperature < 71.5: yes/4, no/2
temperature ≥ 71.5: yes/5, no/3
 Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939 bits
 Place split points halfway between values
 Can evaluate all split points in one pass!

 To speed up
 Entropy only needs to be evaluated between points of different
classes
Potential optimal breakpoints
Breakpoints between values of the same class cannot be optimal

Deal with Missing Data
 Treat missing values as a separate value
 Missing value denoted “?” in C4.X
 Simple idea: treat missing as a separate value
 Split instances with missing values into pieces
 A piece going down a branch receives a weight proportional to the
popularity of the branch
 weights sum to 1
 Info gain works with fractional instances
 Use sums of weights instead of counts
 During classification, split the instance into pieces in the same way
 Merge probability distribution using weights

What is CART?
 Classification And Regression Trees
 Developed by Breiman, Friedman, Olshen, Stone in early 80’s.
 Introduced tree-based modeling into the statistical mainstream
 Rigorous approach involving cross-validation to select the optimal
tree
 One of many tree-based modeling techniques.
 CART -- the classic
 CHAID
 C5.0
 Software package variants (SAS, S-Plus, R…)
 Note: the “rpart” package in “R” is freely available

The Key Idea
Recursive Partitioning
 Take all of your data.
 Consider all possible values of all variables.
 Select the variable/value (X=t1) that produces the greatest
“separation” in the target.
 (X=t1) is called a “split”.
 If X< t1 then send the data to the “left”; otherwise, send data
point to the “right”.
 Now repeat same process on these two “nodes”
 You get a “tree”
 Note: CART only uses binary splits.
|

Splitting Rules
 Select the variable value (X=t1) that produces the
greatest “separation” in the target variable.
 “Separation” defined in many ways.
 Regression Trees (continuous target): use sum of
squared errors.
 Classification Trees (categorical target): choice of
entropy, Gini measure, “twoing” splitting rule.

RegressionTrees
 Tree-based modeling for continuous target variable
 most intuitively appropriate method for loss ratio analysis
 Find split that produces greatest separation in
∑[y – E(y)]2
 i.e.: find nodes with minimal within variance
 and therefore greatest between variance
 like credibility theory
 Every record in a node is assigned the same yhat
 model is a step function

ClassificationTrees
 Tree-based modeling for discrete target variable
 In contrast with regression trees, various measures of purity
are used
 Common measures of purity:
 Gini, entropy, “twoing”
 Intuition: an ideal retention model would produce nodes that
contain either defectors only or non-defectors only
 completely pure nodes

ClassificationTrees vs. RegressionTrees
 Splitting Criteria:
 Gini, Entropy,Twoing
 Goodness of fit measure:
 misclassification rates
 Prior probabilities and
misclassification costs
 available as model
“tuning parameters”
 Splitting Criterion:
 sum of squared errors
 Goodness of fit:
 same measure!
 sum of squared errors
 No priors or
misclassification costs…
 … just let it run

Cost-Complexity Pruning
 Definition: Cost-Complexity Criterion
Rα= MC + αL
 MC = misclassification rate
 Relative to # misclassifications in root node.
 L = # leaves (terminal nodes)
 You get a credit for lower MC.
 But you also get a penalty for more leaves.
 Let T0 be the biggest tree.
 Find sub-tree of Tα of T0 that minimizes Rα.
 Optimal trade-off of accuracy and complexity.

References
 Introduction to Machine Learning - Lecture
6,http://www.slideshare.net/aorriols/lecture6-c45
 CART fromA to B,
https://www.casact.org/education/specsem/f2005/handouts/
cart.ppt

Decision Tree - C4.5&CART

More Related Content

What's hot

Viewers also liked

Similar to Decision Tree - C4.5&CART

Decision Tree - C4.5&CART