Classification Continued


Published on

Introduction to Classification: Part II

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Classification Continued

  1. 1. Data Mining Classification: Alternative Techniques<br />
  2. 2. Rule-Based Classifier<br />Classify records by using a collection of “if…then…” rules<br />Rule: (Condition)  y<br />where <br /> Condition is a conjunctions of attributes <br /> y is the class label<br />LHS: rule antecedent or condition<br />RHS: rule consequent<br />
  3. 3. Characteristics of Rule-Based Classifier<br />Mutually exclusive rules<br />Classifier contains mutually exclusive rules if the rules are independent of each other<br />Every record is covered by at most one rule<br />Exhaustive rules<br />Classifier has exhaustive coverage if it accounts for every possible combination of attribute values<br />Each record is covered by at least one rule<br />
  4. 4. Building Classification Rules<br />Direct Method: <br /> Extract rules directly from data<br /> e.g.: RIPPER, CN2, Holte’s 1R<br />Indirect Method:<br /> Extract rules from other classification models (e.g. decision trees, neural networks, etc).<br />e.g: C4.5rules<br />
  5. 5. Direct Method: Sequential Covering<br />Start from an empty rule<br />Grow a rule using the Learn-One-Rule function<br />Remove training records covered by the rule<br />Repeat Step (2) and (3) until stopping criterion is met <br />
  6. 6. Aspects of Sequential Covering<br />Rule Growing<br />Instance Elimination<br />Rule Evaluation<br />Stopping Criterion<br />Rule Pruning<br />
  7. 7. Contd…<br />Grow a single rule<br />Remove Instances from rule<br />Prune the rule (if necessary)<br />Add rule to Current Rule Set<br />Repeat<br />
  8. 8. Indirect Method: C4.5rules<br />Extract rules from an unpruned decision tree<br />For each rule, r: A  y, <br />consider an alternative rule r’: A’  y where A’ is obtained by removing one of the conjuncts in A<br />Compare the pessimistic error rate for r against all r’s<br />Prune if one of the r’s has lower pessimistic error rate<br />Repeat until we can no longer improve generalization error<br />
  9. 9. Indirect Method: C4.5rules<br />Instead of ordering the rules, order subsets of rules (class ordering)<br />Each subset is a collection of rules with the same rule consequent (class)<br />Compute description length of each subset<br /> Description length = L(error) + g L(model)<br /> g is a parameter that takes into account the presence of redundant attributes in a rule set (default value = 0.5)<br />
  10. 10. Advantages of Rule-Based Classifiers<br />As highly expressive as decision trees<br />Easy to interpret<br />Easy to generate<br />Can classify new instances rapidly<br />Performance comparable to decision trees<br />
  11. 11. Nearest Neighbor Classifiers<br /><ul><li>Requires three things
  12. 12. The set of stored records
  13. 13. Distance Metric to compute distance between records
  14. 14. The value of k, the number of nearest neighbors to retrieve
  15. 15. To classify an unknown record:
  16. 16. Compute distance to other training records
  17. 17. Identify k nearest neighbors
  18. 18. Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote</li></li></ul><li>Definition of Nearest Neighbor<br />K-nearest neighbors of a record x are data points that have the k smallest distance to x<br />
  19. 19. Nearest Neighbor Classification…<br />Choosing the value of k:<br />If k is too small, sensitive to noise points<br />If k is too large, neighborhood may include points from other classes<br />Scaling issues<br />Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes<br />Example:<br /> height of a person may vary from 1.5m to 1.8m<br /> weight of a person may vary from 90lb to 300lb<br />
  20. 20. Nearest neighbor Classification…<br />k-NN classifiers are lazy learners <br />It does not build models explicitly<br />Unlike eager learners such as decision tree induction and rule-based systems<br />Classifying unknown records are relatively expensive<br />
  21. 21. Bayes Classifier<br />A probabilistic framework for solving classification problems<br />Conditional Probability:<br />Bayes theorem:<br />
  22. 22. Example of Bayes Theorem<br />Given: <br />A doctor knows that meningitis causes stiff neck 50% of the time<br />Prior probability of any patient having meningitis is 1/50,000<br />Prior probability of any patient having stiff neck is 1/20<br /> If a patient has stiff neck, what’s the probability he/she has meningitis?<br />
  23. 23. Naïve Bayes Classifier<br />Assume independence among attributes Ai when class is given: <br />P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)<br />Can estimate P(Ai| Cj) for all Ai and Cj.<br />New point is classified to Cj if P(Cj)  P(Ai| Cj) is maximal.<br />
  24. 24. Naïve Bayes Classifier<br />If one of the conditional probability is zero, then the entire expression becomes zero<br />Probability estimation:<br />c: number of classes<br />p: prior probability<br />m: parameter<br />
  25. 25. Naïve Bayes (Summary)<br />Robust to isolated noise points<br />Handle missing values by ignoring the instance during probability estimate calculations<br />Robust to irrelevant attributes<br />Independence assumption may not hold for some attributes<br />Use other techniques such as Bayesian Belief Networks (BBN)<br />
  26. 26. Artificial Neural Networks (ANN)<br />Model is an assembly of inter-connected nodes and weighted links<br />Output node sums up each of its input value according to the weights of its links<br />Compare output node against some threshold t<br />
  27. 27. General Structure of ANN<br />Training ANN means learning the weights of the neurons<br />
  28. 28. Algorithm for learning ANN<br />Initialize the weights (w0, w1, …, wk)<br />Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples<br />Objective function:<br />Find the weights wi’s that minimize the above objective function<br /> e.g., backpropagation algorithm <br />
  29. 29. Ensemble Methods<br />Construct a set of classifiers from the training data<br />Predict class label of previously unseen records by aggregating predictions made by multiple classifiers<br />
  30. 30. General Idea<br />
  31. 31. Why does it work?<br />Suppose there are 25 base classifiers<br />Each classifier has error rate,  = 0.35<br />Assume classifiers are independent<br />Probability that the ensemble classifier makes a wrong prediction:<br />
  32. 32. Examples of Ensemble Methods<br />How to generate an ensemble of classifiers?<br />Bagging<br />Boosting<br />
  33. 33. Bagging<br />Sampling with replacement<br />Build classifier on each bootstrap sample<br />Each sample has probability (1 – 1/n)n of being selected<br />
  34. 34. Boosting<br />An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records<br />Initially, all N records are assigned equal weights<br />Unlike bagging, weights may change at the end of boosting round<br />
  35. 35. Visit more self help tutorials<br />Pick a tutorial of your choice and browse through it at your own pace.<br />The tutorials section is free, self-guiding and will not involve any additional support.<br />Visit us at<br />