Chapter 4 Classification


Published on

Chapter 4 classification from the book, introduction to data mining

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Chapter 4 Classification

  1. 1. Khalid Elshafie.<br /><br />Database / Bioinformatics Lab<br />Chungbuk National University, Korea<br />Classification : Basic Concepts<br />December, 12 2009<br />
  2. 2. Outline<br />Chapter 4: Classification<br />11 December 2009<br />2/46<br />
  3. 3. Introduction<br />
  4. 4. Introduction (1/4)<br />Classification: Definition<br />Given a collection of records (training set )<br />Each record contains a set of attributes, one of the attributes is the class.<br />Find a model for class attribute as a function of the values of other attributes.<br />Goal: previously unseen records should be assigned a class as accurately as possible.<br />A test set is used to determine the accuracy of the model. <br />Classification Model<br />Output<br />Class Label<br />Input<br />Attribute set<br />Chapter 4: Classification<br />11 December 2009<br />4/46<br />
  5. 5. Introduction(2/4)<br />Classification:<br />Two step process:<br />1-learning step:<br />Training data are analyzed by classification algorithm and a model (classifier) is learned.<br />2- Classification:<br />Test data are used to estimate the accuracy of the classification rules.<br />Usually the given data set is divided into training and test sets.<br />Chapter 4: Classification<br />11 December 2009<br />5/46<br />
  6. 6. Introduction (3/4)<br />Examples of Classification:<br />Predicting tumor cells as benign or malignant<br />Classifying credit card transactions as legitimate or fraudulent<br />Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil<br />Categorizing news stories as finance, weather, entertainment, sports, etc<br />Chapter 4: Classification<br />11 December 2009<br />6/46<br />
  7. 7. Introduction (4/4)<br />Classification Techniques: <br />Decision Trees Based Methods.<br />Rule Based Methods.<br />Neural Networks.<br />Naïve Bayes and Bayesian Belief Networks.<br />Support Vector Machines.<br />Chapter 4: Classification<br />11 December 2009<br />7/46<br />
  8. 8. General Approach to Solving a Classification Problem<br />
  9. 9. General Approach To Solving a Classification Problem (1/2)<br />General Approach for building a classification model.<br />Chapter 4: Classification<br />11 December 2009<br />9/46<br />
  10. 10. General Approach To Solving a Classification Problem (2/2)<br />Performance evaluation.<br />Evaluating the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model.<br />Although a confusion matrix provides the information needed to determine how well a classification model perform, summarizing this information with a single number would make it more convenient to compare the performance to a different models.<br />Confusion matrix for a 2-class problem<br />Chapter 4: Classification<br />11 December 2009<br />10/46<br />
  11. 11. Decision Tree Induction<br />
  12. 12. Decision Tree Induction (1/15)<br />What is a decision tree?<br />A decision tree is a flowchart-like tree structure.<br />Each internal node (none leaf node) denotes a test on an attribute, each branch represents an outcome of the test and each leaf node (terminal node) holds a class label.<br />Single, Divorced<br />Internal node<br />MarSt<br />Married<br />Root node<br />Refund<br />NO<br />No<br />Yes<br />TaxInc<br />NO<br />&gt; 80K<br />&lt; 80K<br />YES<br />NO<br />Leaf nodes<br />Chapter 4: Classification<br />11 December 2009<br />12/46<br />
  13. 13. Decision Tree Induction (2/15)<br />How to build a decision tree?<br />Let Dt be the set of training records that reach a node t<br />General Procedure:<br />If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt<br />If Dt is an empty set, then t is a leaf node labeled by the default class, yd<br />If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.<br />Dt<br />?<br />Chapter 4: Classification<br />11 December 2009<br />13/46<br />
  14. 14. Decision Tree Induction (3/15)<br />How to build a decision tree?<br />Tree induction:<br />Greedy strategy.<br />Split the record based on an attribute test that optimizes certain condition.<br />Tree induction issues:<br />Determine how to split the record?<br />How to specify the attribute test condition?<br />How to determine the best split?<br />Determine when to stop splitting.<br />Chapter 4: Classification<br />11 December 2009<br />14/46<br />
  15. 15. Decision Tree Induction (4/15)<br />How to specify test condition?<br />Depends on attribute types<br />Nominal.<br />Ordinal.<br />Continuous. <br />Depends on number of ways to split.<br />2-way split.<br />Multi-way split.<br />Chapter 4: Classification<br />11 December 2009<br />15/46<br />
  16. 16. Decision Tree Induction (5/15)<br />Splitting based on nominal attributes.<br />Multi-way split<br />Use as many partition as distinct values.<br />Binary split.<br />Divides the values into two subsets.<br />CarType<br />Family<br />Luxury<br />Sports<br />CarType<br />CarType<br />{Family, Luxury}<br />{Sports, Luxury}<br />{Sports}<br />{Family}<br />OR<br />Chapter 4: Classification<br />11 December 2009<br />16/46<br />
  17. 17. Decision Tree Induction (6/15)<br />Splitting based on ordinal attributes.<br />Multi-way split<br />Use as many partition as distinct values.<br />Binary split.<br />Divides the values into two subsets. as long as it doesn’t violate the order property of the attribute<br />Size<br />Small<br />Large<br />Medium<br />Size<br />Size<br />{Small, Medium}<br />{Medium, Large}<br />{Large}<br />{Small}<br />OR<br />Chapter 4: Classification<br />11 December 2009<br />17/46<br />
  18. 18. Decision Tree Induction (7/15)<br />Splitting based on continuous attributes.<br />Multi-way split<br />Must consider all possible test for continuous values.<br />One approach, Discretization.<br />Binary split.<br />The test condition can be expressed as a comparison test.<br />(A &lt; v) or (A  v)<br />Chapter 4: Classification<br />11 December 2009<br />18/46<br />
  19. 19. Decision Tree Induction (8/15)<br />How to determine the best split?<br />Attribute Selection Measure.<br />A heuristic for selecting the splitting criterion that best separate a given data set.<br />Information gain.<br />Gain Ratio.<br />Chapter 4: Classification<br />11 December 2009<br />19/46<br />
  20. 20. Decision Tree Induction (9/15)<br />Information Gain.<br />Used by ID3 algorithm as its attribute selection measure.<br />Select the attribute with the heights information gain.<br />Expected information (entropy) needed to classify a tuple in D:<br />Information needed (after using A to split D into v partitions) to classify D:<br />Information gained by branching on attribute A<br />Chapter 4: Classification<br />11 December 2009<br />20/46<br />
  21. 21. Decision Tree Induction (10/15)<br />Information Gain.<br />14 record<br />Class “Yes”=9 records.<br />Class “No”= 5 records.<br />Similarly, <br />Chapter 4: Classification<br />11 December 2009<br />21/46<br />
  22. 22. Decision Tree Induction (11/15)<br />Information Gain.<br />age?<br />senior<br />youth<br />Middle age<br />Yes<br />Chapter 4: Classification<br />11 December 2009<br />22/46<br />
  23. 23. Decision Tree Induction (12/15)<br />Gain ratio.<br />Information gain measure is biased towards attributes with a large number of values<br />C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)<br /><ul><li>Example:
  24. 24. For attribute income:
  25. 25. Gain(Income)=0.029
  26. 26. Therefore, GainRatio(Income)=0.029/0.926=0.031</li></ul>Chapter 4: Classification<br />11 December 2009<br />23/46<br />
  27. 27. Decision Tree Induction (14/15)<br />Comparing attribute selection measures<br />Information gain: <br />biased towards multi-valued attributes.<br />Gain ratio: <br />tends to prefer unbalanced splits in which one partition is much smaller than the others.<br />Chapter 4: Classification<br />11 December 2009<br />24/46<br />
  28. 28. Decision Tree Induction (15/15)<br />Decision Tree Induction<br />Advantages: <br />Inexpensive to construct.<br />Easy to interpret for small-sized trees.<br />Extremely fast at classifying unknown records<br />Disadvantages:<br />decision tree could be suboptimal (i.e., over fitting) <br />Chapter 4: Classification<br />11 December 2009<br />25/46<br />
  29. 29. Model Overfitting<br />
  30. 30. Model Overfitting (1/5)<br />Model Overfitting:<br />Type of errors committed by a classification model:<br />Training errors.<br />Number of misclassification errors committed on training record.<br />Generalization error.<br />The expected error of the model on previously unseen records.<br />Good model must have low training error as well as low generalization error.<br />The model that fit the training data too well can have a poorer generalization error than a model with a high training error.<br />Chapter 4: Classification<br />overfitting<br />11 December 2009<br />27/46<br />
  31. 31. Model Overfitting (2/5)<br />Reasons of overfitting<br />The presence of Noisein the dataset.<br />Chapter 4: Classification<br />11 December 2009<br />28/46<br />
  32. 32. Model Overfitting (2/5)<br />Reasons of overfitting<br />The presence of Noisein the dataset.<br />Chapter 4: Classification<br />Misclassified<br /><br />11 December 2009<br />29/46<br />
  33. 33. Model Overfitting(3/5)<br />Reasons of overfitting<br />Lack of Representative Samples.<br />Chapter 4: Classification<br />Misclassified<br /><br />11 December 2009<br />30/46<br />
  34. 34. Model Overfitting(4/5)<br />Handling overfitting<br />Pre-Pruning (Early Stopping Rule)<br />Stop the algorithm before it becomes a fully-grown tree<br />Typical stopping conditions for a node:<br /> Stop if all instances belong to the same class<br /> Stop if all the attribute values are the same<br />More restrictive conditions:<br /> Stop if number of instances is less than some user-specified threshold<br /> Stop if class distribution of instances are independent of the available features (e.g., using  2 test)<br /> Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).<br />Chapter 4: Classification<br />11 December 2009<br />31/46<br />
  35. 35. Model Overfitting(5/5)<br />Handling overfitting<br />Post-pruning<br />Grow decision tree to its entirety<br />Trim the nodes of the decision tree in a bottom-up fashion<br />If generalization error improves after trimming, replace sub-tree by a leaf node.<br />Class label of leaf node is determined from majority class of instances in the sub-tree<br />Chapter 4: Classification<br />In practice , Post-Pruning is preferable since early pruning can “stop too early”<br />11 December 2009<br />32/46<br />
  36. 36. Performance Evaluation<br />
  37. 37. Performance Evaluation(1/3)<br />Holdout Method <br />Partition: Training-and-testing<br />use two independent data sets, e.g., training set (2/3), test set (1/3)<br />used for data set with large number of samples<br />Chapter 4: Classification<br />30%<br />Divide randomly<br />Available examples<br />Training Set<br />used to develop one tree<br />check accuracy<br />11 December 2009<br />34/46<br />
  38. 38. Performance Evaluation(2/3)<br />Cross-Validation<br />divide the data set into k subsamples<br />use k-1 subsamples as training data and one sub-sample as test data<br /> k-fold cross-validation<br />used for data set with moderate size<br />10-fold cross-validation<br />the standard and most popular technique of estimating a classifier accuracy<br />Chapter 4: Classification<br />Available examples<br />10%<br />90%<br />Test Set<br />Training Set<br />used to develop 10 different trees<br />check accuracy<br />11 December 2009<br />35/46<br />
  39. 39. Performance Evaluation(3/3)<br />Bootstrapping<br />Based on the sampling with replacement<br />The initial dataset is sampled N times<br />N : the total number of samples in the dataset, with replacement, to form another set of N samples for training.<br />Since some samples in this new &quot;set&quot; will be repeated, so it means that some samples from the initial dataset will not appear in this training set. These samples will form a test set. <br />Used for small size dataset.<br />Chapter 4: Classification<br />11 December 2009<br />36/46<br />
  40. 40. Summary <br />Chapter 4: Classification<br />11 December 2009<br />37/46<br />
  41. 41. Summary <br />Refund<br />Yes<br />No<br />MarSt<br />NO<br />Married<br />Single, Divorced<br />TaxInc<br />NO<br />&lt; 80K<br />&gt; 80K<br />YES<br />NO<br />Apply model to test data<br />Test Data<br />Start from the root of tree.<br />Chapter 4: Classification<br />11 December 2009<br />38/46<br />
  42. 42. Summary <br />Refund<br />Yes<br />No<br />MarSt<br />NO<br />Married<br />Single, Divorced<br />TaxInc<br />NO<br />&lt; 80K<br />&gt; 80K<br />YES<br />NO<br />Apply model to test data<br />Test Data<br />Chapter 4: Classification<br />11 December 2009<br />39/46<br />
  43. 43. Summary <br />Refund<br />Yes<br />No<br />MarSt<br />NO<br />Married<br />Single, Divorced<br />TaxInc<br />NO<br />&lt; 80K<br />&gt; 80K<br />YES<br />NO<br />Apply model to test data<br />Test Data<br />Chapter 4: Classification<br />11 December 2009<br />40/46<br />
  44. 44. Summary <br />Refund<br />Yes<br />No<br />MarSt<br />NO<br />Married<br />Single, Divorced<br />TaxInc<br />NO<br />&lt; 80K<br />&gt; 80K<br />YES<br />NO<br />Apply model to test data<br />Test Data<br />Chapter 4: Classification<br />11 December 2009<br />41/46<br />
  45. 45. Summary <br />Refund<br />Yes<br />No<br />MarSt<br />NO<br />Married <br />Single, Divorced<br />TaxInc<br />NO<br />&lt; 80K<br />&gt; 80K<br />YES<br />NO<br />Apply model to test data<br />Test Data<br />Chapter 4: Classification<br />11 December 2009<br />42/46<br />
  46. 46. Summary <br />Refund<br />Yes<br />No<br />MarSt<br />NO<br />Married<br />Single, Divorced<br />TaxInc<br />NO<br />&lt; 80K<br />&gt; 80K<br />YES<br />NO<br />Apply model to test data<br />Test Data<br />Assign Cheat to “No”<br />Chapter 4: Classification<br />11 December 2009<br />43/46<br />
  47. 47. Chapter 4: Classification<br />11 December 2009<br />44/46<br />
  48. 48. Summary<br />Classification is one of the most important technique in detaining.<br />Have so much application in real world.<br />Decision tree <br />Powerful classification technique.<br />Decision trees are easy to understand.<br />Strength:<br />Easy to understand, fast in classifying records.<br />Weakness:<br />Suffer from oversetting.<br />Large tree size cause some memory handling issue<br />Handling overfitting:<br />Pruning.<br />Evaluation methods<br />Chapter 4: Classification<br />11 December 2009<br />45/46<br />
  49. 49. Thank you !<br />Any Comments & Questions ?<br />