Decision Tree
-- C4.5 and CART
Xueping Peng
Xueping.peng@uts.edu.au
ID3 (1/2)
 ID3 is a strong system that
 Uses hill-climbing search based on the information gain measure
to search through the space of decision trees
 Outputs a single hypothesis
 Never backtracks.It converges to locally optimal solutions
 Uses all training examples at each step, contrary to methods that
make decisions incrementally
 Uses statistical properties of all examples:the search is less
sensitive to errors in individual training examples
ID3 (2/2)
 However, ID3 has some drawbacks
 It can only deal with nominal data (e.g. continuous data)
 It may be not robust in presence of noise (e.g. overfitting)
 It is not able to deal with noisy data sets (e.g. missing values)
How C4.5 enhances C4.5 to
 Be robust in the presence of noise. Avoid overfitting
 Deal with continuous attributes
 Deal with missing data
 Convert trees to rules
What’s Overfitting?
 Overfitting = Given a hypothesis space H, a hypothesis hєH is
said to overfit the training data if there exists some
alternative hypothesis h’єH, such that
 h has smaller error than h’ over the training examples, but
 h’ has a smaller error than h over the entire distribution of
instances.
Why May my System Overfit?
 In domains with noise or uncertainty
 the system may try to decrease the training error by completely
fitting all the training examples
How to Avoid Overfitting?
 Pre-prune: Stop growing a branch when information
becomes unreliable
 Post-prune:Take a fully-grown decision tree and discard
unreliable parts
Post-pruning
 First, build the full tree
 Then, prune it
 Fully-grown tree shows all attribute interactions
 Two pruning operations:
 Subtree replacement
 Subtree raising
 Possible strategies:
 error estimation
 significance testing
 MDL principle
Subtree Replacement
 Bottom up approach
 Consider replacing a tree after considering all its subtrees
 Ex: labor negotiations
Subtree Replacement
 Algorithm:
 Split the data into training and validation set
 Do until further pruning is harmful:
 Evaluate impact on the validation set of
pruning each possible node
 Select the node whose removal most
increases the validation set accuracy
Deal with continuous attributes
 When dealing with nominal data
 We evaluated the grain for each possible value
 In continuous data, we have infinite values
 What should we do?
 Continuous-valued attributes may take infinite values, but we
have a limited number of values in our instances (at most N if we
have N instances)
 Therefore, simulate that you have N nominal values
 Evaluate information gain for every possible split point of the attribute
 Choose the best split point
 The information gain of the attribute is the information gain of the best split
Deal with continuous attributes
 Example
Deal with continuous attributes
 Split on temperature attribute:
 E.g.: temperature < 71.5: yes/4, no/2
temperature ≥ 71.5: yes/5, no/3
 Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939 bits
 Place split points halfway between values
 Can evaluate all split points in one pass!
Deal with continuous attributes
 To speed up
 Entropy only needs to be evaluated between points of different
classes
Potential optimal breakpoints
Breakpoints between values of the same class cannot be optimal
Deal with Missing Data
 Treat missing values as a separate value
 Missing value denoted “?” in C4.X
 Simple idea: treat missing as a separate value
 Split instances with missing values into pieces
 A piece going down a branch receives a weight proportional to the
popularity of the branch
 weights sum to 1
 Info gain works with fractional instances
 Use sums of weights instead of counts
 During classification, split the instance into pieces in the same way
 Merge probability distribution using weights
What is CART?
 Classification And Regression Trees
 Developed by Breiman, Friedman, Olshen, Stone in early 80’s.
 Introduced tree-based modeling into the statistical mainstream
 Rigorous approach involving cross-validation to select the optimal
tree
 One of many tree-based modeling techniques.
 CART -- the classic
 CHAID
 C5.0
 Software package variants (SAS, S-Plus, R…)
 Note: the “rpart” package in “R” is freely available
The Key Idea
Recursive Partitioning
 Take all of your data.
 Consider all possible values of all variables.
 Select the variable/value (X=t1) that produces the greatest
“separation” in the target.
 (X=t1) is called a “split”.
 If X< t1 then send the data to the “left”; otherwise, send data
point to the “right”.
 Now repeat same process on these two “nodes”
 You get a “tree”
 Note: CART only uses binary splits.
|
Splitting Rules
 Select the variable value (X=t1) that produces the
greatest “separation” in the target variable.
 “Separation” defined in many ways.
 Regression Trees (continuous target): use sum of
squared errors.
 Classification Trees (categorical target): choice of
entropy, Gini measure, “twoing” splitting rule.
RegressionTrees
 Tree-based modeling for continuous target variable
 most intuitively appropriate method for loss ratio analysis
 Find split that produces greatest separation in
∑[y – E(y)]2
 i.e.: find nodes with minimal within variance
 and therefore greatest between variance
 like credibility theory
 Every record in a node is assigned the same yhat
 model is a step function
ClassificationTrees
 Tree-based modeling for discrete target variable
 In contrast with regression trees, various measures of purity
are used
 Common measures of purity:
 Gini, entropy, “twoing”
 Intuition: an ideal retention model would produce nodes that
contain either defectors only or non-defectors only
 completely pure nodes
ClassificationTrees vs. RegressionTrees
 Splitting Criteria:
 Gini, Entropy,Twoing
 Goodness of fit measure:
 misclassification rates
 Prior probabilities and
misclassification costs
 available as model
“tuning parameters”
 Splitting Criterion:
 sum of squared errors
 Goodness of fit:
 same measure!
 sum of squared errors
 No priors or
misclassification costs…
 … just let it run
Cost-Complexity Pruning
 Definition: Cost-Complexity Criterion
Rα= MC + αL
 MC = misclassification rate
 Relative to # misclassifications in root node.
 L = # leaves (terminal nodes)
 You get a credit for lower MC.
 But you also get a penalty for more leaves.
 Let T0 be the biggest tree.
 Find sub-tree of Tα of T0 that minimizes Rα.
 Optimal trade-off of accuracy and complexity.
References
 Introduction to Machine Learning - Lecture
6,http://www.slideshare.net/aorriols/lecture6-c45
 CART fromA to B,
https://www.casact.org/education/specsem/f2005/handouts/
cart.ppt

Decision Tree - C4.5&CART

  • 1.
    Decision Tree -- C4.5and CART Xueping Peng Xueping.peng@uts.edu.au
  • 3.
    ID3 (1/2)  ID3is a strong system that  Uses hill-climbing search based on the information gain measure to search through the space of decision trees  Outputs a single hypothesis  Never backtracks.It converges to locally optimal solutions  Uses all training examples at each step, contrary to methods that make decisions incrementally  Uses statistical properties of all examples:the search is less sensitive to errors in individual training examples
  • 4.
    ID3 (2/2)  However,ID3 has some drawbacks  It can only deal with nominal data (e.g. continuous data)  It may be not robust in presence of noise (e.g. overfitting)  It is not able to deal with noisy data sets (e.g. missing values)
  • 5.
    How C4.5 enhancesC4.5 to  Be robust in the presence of noise. Avoid overfitting  Deal with continuous attributes  Deal with missing data  Convert trees to rules
  • 6.
    What’s Overfitting?  Overfitting= Given a hypothesis space H, a hypothesis hєH is said to overfit the training data if there exists some alternative hypothesis h’єH, such that  h has smaller error than h’ over the training examples, but  h’ has a smaller error than h over the entire distribution of instances.
  • 7.
    Why May mySystem Overfit?  In domains with noise or uncertainty  the system may try to decrease the training error by completely fitting all the training examples
  • 8.
    How to AvoidOverfitting?  Pre-prune: Stop growing a branch when information becomes unreliable  Post-prune:Take a fully-grown decision tree and discard unreliable parts
  • 9.
    Post-pruning  First, buildthe full tree  Then, prune it  Fully-grown tree shows all attribute interactions  Two pruning operations:  Subtree replacement  Subtree raising  Possible strategies:  error estimation  significance testing  MDL principle
  • 10.
    Subtree Replacement  Bottomup approach  Consider replacing a tree after considering all its subtrees  Ex: labor negotiations
  • 11.
    Subtree Replacement  Algorithm: Split the data into training and validation set  Do until further pruning is harmful:  Evaluate impact on the validation set of pruning each possible node  Select the node whose removal most increases the validation set accuracy
  • 12.
    Deal with continuousattributes  When dealing with nominal data  We evaluated the grain for each possible value  In continuous data, we have infinite values  What should we do?  Continuous-valued attributes may take infinite values, but we have a limited number of values in our instances (at most N if we have N instances)  Therefore, simulate that you have N nominal values  Evaluate information gain for every possible split point of the attribute  Choose the best split point  The information gain of the attribute is the information gain of the best split
  • 13.
    Deal with continuousattributes  Example
  • 14.
    Deal with continuousattributes  Split on temperature attribute:  E.g.: temperature < 71.5: yes/4, no/2 temperature ≥ 71.5: yes/5, no/3  Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939 bits  Place split points halfway between values  Can evaluate all split points in one pass!
  • 15.
    Deal with continuousattributes  To speed up  Entropy only needs to be evaluated between points of different classes Potential optimal breakpoints Breakpoints between values of the same class cannot be optimal
  • 16.
    Deal with MissingData  Treat missing values as a separate value  Missing value denoted “?” in C4.X  Simple idea: treat missing as a separate value  Split instances with missing values into pieces  A piece going down a branch receives a weight proportional to the popularity of the branch  weights sum to 1  Info gain works with fractional instances  Use sums of weights instead of counts  During classification, split the instance into pieces in the same way  Merge probability distribution using weights
  • 17.
    What is CART? Classification And Regression Trees  Developed by Breiman, Friedman, Olshen, Stone in early 80’s.  Introduced tree-based modeling into the statistical mainstream  Rigorous approach involving cross-validation to select the optimal tree  One of many tree-based modeling techniques.  CART -- the classic  CHAID  C5.0  Software package variants (SAS, S-Plus, R…)  Note: the “rpart” package in “R” is freely available
  • 18.
    The Key Idea RecursivePartitioning  Take all of your data.  Consider all possible values of all variables.  Select the variable/value (X=t1) that produces the greatest “separation” in the target.  (X=t1) is called a “split”.  If X< t1 then send the data to the “left”; otherwise, send data point to the “right”.  Now repeat same process on these two “nodes”  You get a “tree”  Note: CART only uses binary splits. |
  • 19.
    Splitting Rules  Selectthe variable value (X=t1) that produces the greatest “separation” in the target variable.  “Separation” defined in many ways.  Regression Trees (continuous target): use sum of squared errors.  Classification Trees (categorical target): choice of entropy, Gini measure, “twoing” splitting rule.
  • 20.
    RegressionTrees  Tree-based modelingfor continuous target variable  most intuitively appropriate method for loss ratio analysis  Find split that produces greatest separation in ∑[y – E(y)]2  i.e.: find nodes with minimal within variance  and therefore greatest between variance  like credibility theory  Every record in a node is assigned the same yhat  model is a step function
  • 21.
    ClassificationTrees  Tree-based modelingfor discrete target variable  In contrast with regression trees, various measures of purity are used  Common measures of purity:  Gini, entropy, “twoing”  Intuition: an ideal retention model would produce nodes that contain either defectors only or non-defectors only  completely pure nodes
  • 22.
    ClassificationTrees vs. RegressionTrees Splitting Criteria:  Gini, Entropy,Twoing  Goodness of fit measure:  misclassification rates  Prior probabilities and misclassification costs  available as model “tuning parameters”  Splitting Criterion:  sum of squared errors  Goodness of fit:  same measure!  sum of squared errors  No priors or misclassification costs…  … just let it run
  • 23.
    Cost-Complexity Pruning  Definition:Cost-Complexity Criterion Rα= MC + αL  MC = misclassification rate  Relative to # misclassifications in root node.  L = # leaves (terminal nodes)  You get a credit for lower MC.  But you also get a penalty for more leaves.  Let T0 be the biggest tree.  Find sub-tree of Tα of T0 that minimizes Rα.  Optimal trade-off of accuracy and complexity.
  • 24.
    References  Introduction toMachine Learning - Lecture 6,http://www.slideshare.net/aorriols/lecture6-c45  CART fromA to B, https://www.casact.org/education/specsem/f2005/handouts/ cart.ppt