Lecture6 - C4.5


Published on

Published in: Education
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Lecture6 - C4.5

  1. 1. Introduction to Machine Learning Lecture 6 Albert Orriols i Puig aorriols@salle.url.edu i l @ ll ld Artificial Intelligence – Machine Learning Enginyeria i Arquitectura La Salle gy q Universitat Ramon Llull
  2. 2. Recap of Lecture 4 ID3 is a strong system that gy Uses hill-climbing search based on the information gain measure to sea c through the space o dec s o trees easu e o search oug e of decision ees Outputs a single hypothesis. Never b kt k It converges to locally optimal solutions. N backtracks. tl ll ti l l ti Uses all training examples at each step, contrary to methods that th t make decisions i k d ii incrementally. t ll Uses statistical properties of all examples: the search is less sensitive t errors i i di id l t i i examples. iti to in individual training l Can handle noisy data by modifying its termination criterion to accept hypotheses that imperfectly fit the data. th th th t i f tl th d t Slide 2 Artificial Intelligence Machine Learning
  3. 3. Recap of Lecture 4 However, ID3 has some drawbacks It I can only deal with nominal d l d l ih i l data It is not able to deal with noisy data sets It may be not robust in presence of noise Slide 3 Artificial Intelligence Machine Learning
  4. 4. Today’s Agenda Going from ID3 to C4.5 How C4.5 enhances C4.5 to Be robust in the presence of noise. Avoid overfitting Deal with continuous attributes Deal with missing data Convert trees to rules Slide 4 Artificial Intelligence Machine Learning
  5. 5. What’s Overfitting? Overfitting = Given a hypothesis space H, a hypothesis hєH is said to overfit the training data if there exists some alternative hypothesis h’єH, such that h has smaller error than h’ over the training examples, but h examples 1. 1 h’ has a smaller error than h over the entire distribution of instances. 2. Slide 5 Artificial Intelligence Machine Learning
  6. 6. Why May my System Overfit? In domains with noise or uncertainty y the system may try to decrease the training error by completely fitting a the training e a p es g all e a g examples The learner overfits to correctly classify the noisy instances Noisy instances Occam’s razor: Prefer the simplest hypothesis that fits the data with high accuracy Slide 6 Artificial Intelligence Machine Learning
  7. 7. How to Avoid Overfitting? Ok, my system may overfit… Can I avoid it? , yy y Sure! Do not include branches that fit data too specifically How? H? Pre-prune: Stop growing a branch when information becomes 1. unreliable li bl Post-prune: Take a fully-grown decision tree and discard 2. unreliable parts li bl Slide 7 Artificial Intelligence Machine Learning
  8. 8. Pre-pruning Based on statistical significance test g Stop growing the tree when there is no statistically significant assoc a o between any attribute and e class at particular association be ee a y a bu e a d the c ass a a pa cu a node Use all available da a for training a d app y the s a s ca test a a a ab e data o a g and apply e statistical es to estimate whether expanding/pruning a node is to produce an improvement beyond the training set Most popular test: chi-squared test ID3 used chi-squared test in addition to information gain Only statistically significant attributes were allowed to be selected by information gain procedure Slide 8 Artificial Intelligence Machine Learning
  9. 9. Pre-pruning Early stopping: Pre-pruning may stop the growth process prematurely Classic example: XOR/Parity-problem No individual attribute exhibits any significant association to the class Structure is only visible in fully expanded tree Pre-pruning won t Pre pruning won’t expand the root node But: XOR-type problems rare in practice And: pre-pruning faster than post-pruning x1 x2 Class 1 0 0 0 01 10 2 0 1 1 3 1 0 1 4 1 1 0 00 10 Slide 9 Artificial Intelligence Machine Learning
  10. 10. Post-pruning First, build the full tree , Then, prune it Fully-grown Fully grown tree shows all attribute interactions Problem: some subtrees might be due to chance effects Two pruning operations: Subtree replacement 1. Subtree raising 2. Possible strategies: error estimation significance t ti i ifi testing MDL principle Slide 10 Artificial Intelligence Machine Learning
  11. 11. Subtree Replacement Bottom up approach p pp Consider replacing a tree after considering all its subtrees Ex: labor negotiations Slide 11 Artificial Intelligence Machine Learning
  12. 12. Subtree Replacement Algorithm: 1. Split the data into training and validation set 2. Do until further pruning is harmful: a. Evaluate impact on the validation set of pruning each possible node b. Select th b S l t the node whose removal most i d h l t increases the validation set accuracy Slide 12 Artificial Intelligence Machine Learning
  13. 13. Subtree Raising Delete node Redistribute instances Slower than subtree replacement (Worthwhile?) X Slide 13 Artificial Intelligence Machine Learning
  14. 14. Estimating Error Rates Ok we can prune. But when? p Prune only if it reduces the estimated error Error on the training data is NOT a useful estimator Q: Why it would result in very little pruning? Use hold-out set for pruning hold out Training T ii Separate a validation set Data set’ Training g Use this validation set to test the improvement Data set Validation C4.5 s C4 5’s method set Derive confidence interval from training data Use a heuristic limit derived from this for pruning limit, this, Standard Bernoulli-process-based method Shaky statistical assumptions (based on training data) y p ( g ) Slide 14 Artificial Intelligence Machine Learning
  15. 15. Deal with continuous attributes When dealing with nominal data g We evaluated the grain for each possible value In I continuous data, we have infinite values. ti dt h i fi it l What should we do? Continuous-valued attributes may take infinite values, but we have a limited number of values in our instances (at most N if we have N instances) Therefore, simulate that you have N nominal values Evaluate information gain for every possible split point of the attribute Choose the best split point The information gain of the attribute is the information gain of the best split Slide 15 Artificial Intelligence Machine Learning
  16. 16. Deal with continuous attributes Example Outlook Temperature Humidity Windy Play Sunny y 85 85 False No Sunny 80 90 True No Overcast 83 86 False Yes Rainy 75 80 False Yes … … … … … Continuous attributes Slide 16 Artificial Intelligence Machine Learning
  17. 17. Deal with continuous attributes Split on temperature attribute: 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes Y N No Y Yes Y Yes Yes Y No N No N Yes Y Y Yes Yes Y No N Y Yes Yes N Y No E.g.: temperature < 71.5: yes/ , no/2 g te pe atu e 5 yes/4, o/ temperature ≥ 71.5: yes/5, no/3 Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939 bits Place split points halfway between values Can evaluate all split points in one pass! Slide 17 Artificial Intelligence Machine Learning
  18. 18. Deal with continuous attributes To speed up p p Entropy only needs to be evaluated between points of different c asses classes value 64 65 68 69 70 71 72 72 75 75 80 81 83 85 class Yes X No Yes Yes Yes No No Yes Yes Yes No Yes Yes No Potential optimal breakpoints Breakpoints between values of the same class cannot be optimal Slide 18 Artificial Intelligence Machine Learning
  19. 19. Deal with Missing Data Treat missing values as a separate value g p Missing value denoted “?” in C4.X Simple idea: treat missing as a separate value Q: When this is not appropriate? A: Wh A When values are missing d to diff l i i due different reasons Example 1: gene expression could be missing when it is very high or very low Example 2: field IsPregnant=missing for a male patient should be treated differently (no) than for a female patient of age 25 (unknown) Slide 19 Artificial Intelligence Machine Learning
  20. 20. Deal with Missing Data Split instances with missing values into pieces A piece going down a branch receives a weight proportional to the popularity of the branch weights sum to 1 Info gain works with fractional instances Use sums of weights instead of counts During classification, split the instance into pieces in the same way Merge probability distribution using weights Slide 20 Artificial Intelligence Machine Learning
  21. 21. From Trees to Rules I finally g a tree from domains with y got Noisy instances Missing l Mi i values Continuous attributes But I prefer rules… No context dependent Procedure Generate a rule for each tree Get context-independent rules Slide 21 Artificial Intelligence Machine Learning
  22. 22. From Trees to Rules A procedure a little more sophisticated: C4.5Rules p p C4.5rules: greedily prune conditions from each rule if this reduces its es a ed e o educes s estimated error Can produce duplicate rules Check for this at the end Then look at each class in turn consider the rules for that class find a “good” subset (guided by MDL) good Then rank the subsets to avoid conflicts Finally, remove rules (greedily) if this decreases error on the training data Slide 22 Artificial Intelligence Machine Learning
  23. 23. Next Class Instance-based Classifiers Slide 23 Artificial Intelligence Machine Learning
  24. 24. Introduction to Machine Learning Lecture 6 Albert Orriols i Puig aorriols@salle.url.edu i l @ ll ld Artificial Intelligence – Machine Learning Enginyeria i Arquitectura La Salle gy q Universitat Ramon Llull