Upcoming SlideShare
×

# Dbm630 lecture06

769 views

Published on

Published in: Technology, Education
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
769
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
62
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Dbm630 lecture06

1. 1. DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Semester 2/2011 Lecture 6 Classification and Prediction Decision Tree and Classification Rules by Kritsada Sriphaew (sriphaew.k AT gmail.com)1
2. 2. Topics What Is Classification, What Is Prediction? Decision Tree Classification Rule: Covering Algorithm 2 Data Warehousing and Data Mining by Kritsada Sriphaew
3. 3. What Is Classification? Case  A bank loans officer needs analysis of her data in order to learn which loan applicants are “safe” and which are “risky” for the bank  A marketing manager needs data analysis to help guess whether a customer with a given profile will buy a new computer or not  A medical researcher wants to analyze breast cancer data in order to predict which one of three specific treatments a patient receive The data analysis task is classification, where the model or classifier is constructed to predict categorical labels The model is a classifier 3 Data Warehousing and Data Mining by Kritsada Sriphaew
4. 4. What Is Prediction? Suppose that the marketing manager would like to predict how much a given customer will spend during a sale at the shop This data analysis task is numeric prediction, where the model constructed predicts a continuous value or ordered values, as opposed to a categorical label This model is a predictor Regression analysis is a statistical methodology that is most often used for numeric prediction 4 Data Warehousing and Data Mining by Kritsada Sriphaew
5. 5. How does classification work? Data classification is a two-step process In the first step, -- learning step or training phase  A model is built describing a predetermined set of data classes or concepts  Data tuples used to build the classification model are called training data set  If the class label is provided, this step is known as supervised learning, otherwise called unsupervised learning  The learned model may be represented in the form of classification rules, decision trees, Bayesian, mathematical formulae, etc. 5 Data Warehousing and Data Mining by Kritsada Sriphaew
6. 6. How does classification work? In the second step,  The learned model is used for classification  Estimate the predictive accuracy of the model using hold-out data set (a test set of class-labeled samples which are randomly selected and are independent of the training samples)  If the accuracy of the model were estimate based on the training data set -> the model tends to overfit the data  If the accuracy of the model is considered acceptable, the model can be used to classify future data tuples or objects for which the class label is unknown  In the experiment, there are three kinds of dataset, training data set, hold-out data set (or validation data set), and test data set 6 Data Warehousing and Data Mining by Kritsada Sriphaew
7. 7. Issues Regarding Classification/Prediction Comparing classification methods  The criteria to compare and evaluate classification and prediction methods  Accuracy: an ability of a given classifier to correctly predict the class label of new or unseen data  Speed: the computation costs involved in generating and using the given classifier or predictor  Robustness: an ability of the classifier or predictor to make correct predictions given noisy data or data with missing values  Scalability: an ability to construct the classifier or predictor efficiently given large amounts of data  Interpretability: the level of understanding and insight that is provided by the classifier or predictor – subjective and more difficult to assess 7 Data Warehousing and Data Mining by Kritsada Sriphaew
8. 8. Decision Tree A decision tree is a flow-chart-like tree structure,  each internal node denotes a test on an attribute,  each branch represents an outcome of the test  leaf node represent classes  Top-most node in a tree is the root node  Instead of using the complete set of features jointly to make a decision, different subsets of features are used at different levels of the tree during making a decision Age? The decision tree <=30 >40 represents the concept 31…40 buys_computer student? Credit_rating? yes no yes excellent fair no yes no yes 8 Classification – Decision Tree
9. 9. Decision Tree Induction Normal procedure: greedy algorithm by top down in recursive divide-and-conquer fashion  First: attribute is selected for root node and branch is created for each possible attribute value  Then: the instances are split into subsets (one for each branch extending from the node)  Finally: procedure is repeated recursively for each branch, using only instances that reach the branch  Process stops if  All instances for a given node belong to the same class  No remaining attribute on which the samples may be further partitioned  majority vote is employed  No sample for the branch to test the attrbiute  majority vote is employed 9 Classification – Decision Tree
10. 10. Decision Tree Representation(An Example)  The decision tree (DT) of the weather example is:Outlook Temp. Humid. Windy Play Decision Tree sunny hot high false N Induction sunny hot high true Novercast hot high false Y rainy mild high false Y rainy cool normal false Y outlook rainy cool normal true Novercast cool normal true Y sunny rainy sunny mild high false N overcast sunny cool normal false Y humidity windy yes rainy mild normal false Y high normal false true sunny mild normal true Yovercast mild high true Y no yes yes noovercast hot normal false Y rainy mild high 10 true N Classification – Decision Tree
11. 11. An Example(Which attribute is the best?)  There are four possibilities for each split 11 Classification – Decision Tree
12. 12. Criterions for Attribute Selection Which is the best attribute?  The one which will result in the smallest tree  Heuristic: choose the attribute that produces the “purest” nodes Popular impurity criterion: information gain  Information gain increases with the average purity of the subsets that an attribute produces Strategy: choose the attribute with the highest information gain is chosen as the test attribute for the current nodes12 Classification – Decision Tree
13. 13. Computing “Information” Information is measured in bits  Given a probability distribution, the information required to predict an event is the distribution‟s entropy  Entropy gives the information required in bits (this can involve fractions of bits!)  Information gain measures the goodness of split Formula for computing expected information:  Let S be a set consisting of s data instances, the class label attribute has n distinct classes, Ci (for i = 1, …, n)  Let si be the number of instances in class Ci  The expected information or entropy is info([s1,s2,…,sn]) = entropy(s1/s, s2/s, sn/s) = - S pi(log2 pi) where pi is the probability that the instance belongs to class, pi = si/s Formula for computing information gain:  Find an information gain of attribute A gain(A) = info. before splitting – info. after splitting 13 Classification – Decision Tree
14. 14. Expected Information for “Outlook” “Outlook” = “sunny”: info([2,3]) = entropy(2/5,3/5) = -(2/5)log2(2/5) - (3/5)log2(3/5) = 0.971 bits “Outlook” = “overcast”: Outlook Temp. Humid. Windy Play sunny hot high false N info([4,0]) = entropy(1,0) = -(1)log2(1) - (0)log2(0) sunny overcast hot hot high high true false N Y = 0 bits rainy mild high false Y “Outlook” = “rainy”: rainy rainy cool cool normal normal false true Y N info([3,2]) = entropy(3/5,2/5) overcast cool normal true Y = - (3/5)log2(3/5) - (2/5)log2(2/5) sunny mild high false N sunny cool normal false Y = 0.971 bits rainy mild normal false Y Expected information for attribute “Outlook”: sunny overcast mild mild normal high true true Y Y info([2,3],[4,0],[3,2]) overcast hot normal false Y = (5/14)info([2,3]) + (4/14)info([4,0]) + (5/14)info([3,2]) rainy mild high true N = [ (5/14)0.971 ] +[ (4/14)x0 ] +[ (5/14)x0.971 ] = 0.693 bits 14 Classification – Decision Tree
15. 15. Information Gain for “Outlook” Information gain:  info. before splitting – info. after splitting gain(”Outlook”) = info([9,5]) - info([2,3],[4,0],[3,2]) = 0.940-0.693 = 0.247 bits Information gain for attributes from weather data: gain(”Outlook”) = 0.247 bits gain(”Temperature”) = 0.029 bits gain(“Humidity”) = 0.152 bits gain(“Windy”) = 0.048 bits 15 Classification – Decision Tree
16. 16. An Example of Gain Criterion(Which attribute is the best?) Gain(outlook) = Gain(humidity) = info([9,5]) - info([9,5]) - info([2,3],[4,0],[3,2]) info([3,4],[6,1]) = 0.247 = 0.152 The best Gain(humidity) = info([9,5]) - Gain(outlook) = info([6,2],[3,3]) info([9,5]) - = 0.048 info([2,2],[4,2],[3,1]) = 0.029 16 Classification – Decision Tree
17. 17. Continuing to Split If “Outlook” = “sunny” gain(”Temperature”) = 0.571 bits gain(“Humidity”) = 0.971 bits gain(“Windy”) = 0.020 bits17 Classification – Decision Tree
18. 18. The Final Decision TreeNote: not all leaves need to be pure; sometimes identicalinstances have different classes Splitting stops when data can‟t be split any further 18 Classification – Decision Tree
19. 19. Properties for a Purity Measure Properties we require from a purity measure:  When node is pure, measure should be zero  When impurity is maximal (i. e. all classes equally likely), measure should be maximal  Measure should obey multistage property (i. e. decisions can be made in several stages): measure([2,3,4]) = measure([2,7]) + (7/9) measure([3,4]) Entropy is the only function that satisfies all three properties!19 Classification – Decision Tree
20. 20. Some Properties for the Entropy The multistage property: entropy(p,q,r) = entropy(p,q+r) + [(q+r)/(p+q+r)] × entropy(q, r) Ex.: info(2,3,4) can be calculated as = {- (2/9)log2(2/9) + (7/9)log2(7/9)}–{7/9}*{[(3/7)log2(3/7) + (4/7)log2(4/7)} = - (2/9)log2(2/9) - (7/9) [ log2 (7/9) + (3/7)log2(3/7) + (4/7)log2(4/7) ] = - (2/9)log2(2/9) - (7/9) [ (3/7)log2(7/9) + (4/7)log2(7/9) +(3/7)log2(3/7) + (4/7)log2(4/7) ] = - (2/9)log2(2/9) - (7/9) [ (3/7)log2(7/9) + (3/7)log2(3/7) +(4/7)log2(7/9) + (4/7)log2(4/7) ] = - (2/9)log2(2/9) - (7/9) [ (3/7)log2(7/9 x 3/7) + (4/7)log2(7/9 x 4/7) ] = - (2/9)log2(2/9) - (7/9) [ (3/7)log2(3/9) + (4/7)log2(4/9) ] = - (2/9)log2(2/9) - (3/9)log2(3/9) - (4/9)log2(4/9) 20 Classification – Decision Tree
21. 21. A Problem: Highly-Branching Attributes Problematic: attributes with a large number of values (extreme case: ID code) Subsets are more likely to be pure if there is a large number of values Information gain is biased towards choosing attributes with a large number of values This may result in overfitting (selection of an attribute that is non-optimal for prediction) and fragmentation21 Classification – Decision Tree
22. 22. Example: Highly-Branching AttributesID Outlook Temp. Humid. Windy Play A sunny hot high false N ID B sunny hot high true N A N C overcast hot high false Y B M D rainy mild high false Y no yes yes no E rainy cool normal false Y F rainy cool normal true N Entropy Split G overcast cool normal true Y info(ID) H sunny mild high false N = info([0,1],[0,1], I sunny cool normal false Y [1,0],…,[0,1]) J rainy mild normal false Y = 0 bits K sunny mild normal true Y gain(ID) = 0.940 (max.) L overcast mild high true YM overcast hot normal false Y N rainy mild high true N 22 Classification – Decision Tree
23. 23. Modification: The Gain Ratio As a SplitInfo. Gain ratio: a modification of the information gain that reduces its bias Gain ratio takes number and size of branches into account when choosing an attribute  It corrects the information gain by taking the intrinsic information of a split into account Intrinsic information: entropy of distribution of instances into branches (i.e. how much info do we need to tell which branch an instance belongs to)23 Classification – Decision Tree
24. 24. Computing the Gain Ratio Example: intrinsic information (split info) for ID code info([1,1,…,1] = 14*( (-1/14)log(1/14) ) =3.807 Value of attribute decreases as intrinsic information gets larger Definition of gain ratio: gain_ratio(“Attribute”) = gain(“Attribute”) intrinsic_info (“Attribute”) Example: gain_ratio(“ID”) = gain(“ID”) = 0.970 bits intrinsic_info (“ID”) 3.807 bits = 0.246 24 Classification – Decision Tree
25. 25. Gain Ratio for Weather Data25 Classification – Decision Tree
26. 26. Gain Ratio for Weather Data(Discussion) “Outlook” still comes out top However: “ID” has greater gain ratio  Standard fix: ad hoc test to prevent splitting on that type of attribute Problem with gain ratio: it may overcompensate  May choose an attribute just because its intrinsic information is very low  Standard fix: only consider attributes with greater than average information gain26 Classification – Decision Tree
27. 27. Avoiding Overfitting the Data The naïve DT algorithm grows each branch of the tree just deeply enough to perfectly classify the training examples. This algorithm may produce trees that overfit the training examples but do not work well for general cases. Reason: the training set may has some noises or it is too small to produce a representative sample of the true target tree (function).27 Classification – Decision Tree
28. 28. Avoid Overfitting: Pruning Pruning simplifies a decision tree to prevent overfitting to noise in the data Two main pruning strategies:  1. Prepruning: stops growing a tree when no statistically significant association between any attribute and the class at a particular node. Most popular test: chi-squared test, only statistically significant attributes where allowed to be selected by information gain procedure 2. Postpruning: takes a fully-grown decision tree and discards unreliable parts by two main pruning operations, i.e., subtree replacement and subtree raising with some possible strategies, e.g., error estimation, significance testing, MDL principle. Prepruning is preferred in practice because of early stopping 28 Classification – Decision Tree
29. 29. Subtree Replacement Bottom-up: tree is considered for replacement once all its subtrees have been considered 29 Classification – Decision Tree
30. 30. Subtree Raising Deletes node and redistributes instances Slower than subtree replacement (Worthwhile?) 30 Classification – Decision Tree
31. 31. Tree to Rule vs. Rule to Tree Tree outlook Rule If outlook=sunny & humidity=high then class=no sunny rainy If outlook=sunny & humidity=normal then class=yes overcast humidity windy If outlook=overcast then class=yes yes If outlook=rainy & windy=false then class=yes If outlook=rainy & windy=true then class=nohigh normal false true no yes yes no Rule Tree ? If outlook=sunny & humidity=high then class=no If humidity=normal then class=yes If outlook=overcast then class=yes If outlook=rainy & windy=true then class=no outlook=rainy & windy=true & humidity=normal  ? Question: outlook=rainy & windy=false & humidity=high  ? 31 Classification Rules
32. 32. Classification Rule: Algorithms Two main algorithms are: Inferring Rudimentary rules  1R: 1-level decision tree Covering Algorithms:  Algorithm to construct the rules  Pruning Rules & Computing Significance  Hypergeometric Distribution vs. Binomial Distribution  Incremental Reduce-Error Pruning 32 Classification Rules
33. 33. (Holte, 93)Inferring Rudimentary Rules (1R rule) 1R learns a 1-level decision tree  Generate a set of rules that all test on one particular attribute  Focus on each attribute Pseudo-code • For each attribute, • For each value of the attribute, make a rule as follows: • count how often each class appears • find the most frequent class • make the rule assign that class to this attribute-value • Calculate the error rate of the rules • Choose the rules with the smallest error rate Note: “missing” can be treated as a separate attribute value 1R’s simple rules performed not much worse than much more complex decision trees. 33 Classification Rules
34. 34. An Example: Evaluating the WeatherAttributes (Nominal, Ordinal)Outlook Temp. Humidity Windy Play Attribute Rule Error Total Error sunny hot high false no Outlook O = sunny  no 2/5 4/14 (O) O = overcast  yes 0/4 sunny hot high true no O = rainy  yes 2/5overcast hot high false yes Temp. T = hot  no 2/4 5/14 rainy mild high false yes (T) T = mild  yes 2/6 rainy cool normal false yes T = cool  yes 1/4 Humidity H = high  no 3/7 4/14 rainy cool normal true no (H) H = normal  yes 1/7overcast cool normal true yes Windy W = false  yes 2/8 5/14 sunny mild high false no (W) W = true  no 3/6 sunny cool normal false yes rainy mild normal false yes 1R chooses the attribute that sunny mild normal true yes produces rules with the smallestovercast mild high true yes number of errors, i.e., rule sets of attribute “Outlook” orovercast hot normal false yes “Humidity” rainy mild high true no 34 Classification Rules
35. 35. An Example: Evaluating the Weather Attributes (Numeric)Outlook Temp. Humidity Windy Play Attribute Rule Error Total sunny 85 85 false no Error sunny 80 90 true no Outlook O = sunny  no 2/5 4/14 (O) O = overcast  yes 0/4overcast 83 86 false yes O = rainy  yes 2/5 rainy 70 96 false yes Temp. T <= 77.5  yes 3/10 5/14 rainy 68 80 false yes (T) T > 77.5  no 2/4 rainy 65 70 true no Humidity H <= 82.5  yes 1/7 3/14overcast 64 65 true yes (H) 82.5<H<=95.5  no 2/6 H > 95.5  yes 0/1 sunny 72 95 false no Windy W = false  yes 2/8 5/14 sunny 69 70 false yes (W) W = true  no 3/6 rainy 75 80 false yes sunny 75 70 true yes 1R chooses the attribute thatovercast 72 90 true yes produces rules with the smallestovercast 81 75 false yes number of errors, i.e., rule set of attribute “Humidity” rainy 71 91 true no 35 Classification Rules
36. 36. Dealing with Numeric Attributes  Numeric attributes are discretized: the range of the attribute is divided into a set of intervals  Instances are sorted according to attribute’s values  Breakpoints are placed where the (majority) class changes (so that the total error is minimized)  Example: Temperature from weather data Left-to-right64 65 68 69 70 71 72 72 75 75 80 81 83 85Y | N | Y Y Y | N N Y | Y Y | N | Y Y | N min=364 65 68 69 70 71 72 72 75 75 80 81 83 85Y N Y Y Y | N N Y Y Y | N Y Y N Merge64 65 68 69 70 71 72 72 75 75 80 81 83 85 sameY N Y Y Y N N Y Y Y | N Y Y N category 36 Classification Rules
37. 37. Separate-and-conquer: selects the test that maximizes the number of covered positive examplesCovering Algorithm and minimizes the number of negative examples that pass the test. It usually does not pay any attention to the examples that do not pass the test. Separate-and-conquer algorithm Divide-and-conquer: optimize for all outcomes of Focus on each class in turn the test. Seek a way to covering all instances in the class More rules could be added for perfect rule set Comparing to decision tree (DT):  Decision tree  Divide-and-conquer  Focus on all classes at each step  Seek an attribute to split on that best separates the classes  DT can be converted into a rule set  Straightforward conversion: rule set overly complex  More effective conversions are not trivial  In multiclass situations, covering algorithm concentrates on one class at a time whereas DT learner takes all classes into account 37 Classification Rules
38. 38. Constructing Classification Rule (An Example) y b a a y b a a y b a a a a a b b a b b b a b b b a b a a 2.6 a b b b b b b bb bb bb x 1.2 x 1.2 x Instance spaceClassification Rules Rule so farIf x<=1.2 then class = bIf x> 1.2 then class = bIf x> 1.2 & y<=2.6 then class = b x > 1.2 n y b y > 2.6 Rule after adding new item n y Decision Tree b ? More rules could be added for “perfect” rule set 38
39. 39. A Simple Covering Algorithm Generates a rule by adding tests that maximize rule’s accuracy, even each new test reduces the rule’s coverage Similar to situation in decision trees: problem of selecting an attribute to split  Decision tree inducer maximizes overall purity.  Covering algorithm maximizes rule accuracy. Goal: maximizing accuracy  t: total number of instances covered by rule  p: positive examples of the class covered by rule  t-p: number of errors made by rule One option: select test that maximizes the ratio p/t We are finished when p/t = 1 or the set of instances cannot be split any further. 39 Classification Rules
40. 40. An Example: Contact Lenses Data age Spectacle astigmati Tear prod. Recom. Age Spectacle astigma Tear prod. Recom. prescription sm rate lenses prescription tism rate lenses young myope no reduced none presbyopic myope no reduced none young myope no normal soft presbyopic myope no normal none young myope yes reduced none presbyopic myope yes reduced none young myope yes normal hard presbyopic myope yes normal hard young hypermyope no reduced none presbyopic hypermyope no reduced none young hypermyope no normal soft presbyopic hypermyope no normal soft young hypermyope yes reduced none presbyopic hypermyope yes reduced none young hypermyope yes normal hard presbyopic hypermyope yes normal nonepre-presbyopic myope no reduced nonepre-presbyopic myope no normal softpre-presbyopic myope yes reduced nonepre-presbyopic myope yes normal hardpre-presbyopic hypermyope no reduced nonepre-presbyopic hypermyope no normal softpre-presbyopic hypermyope yes reduced none First try to find a rule for “hard”pre-presbyopic hypermyope yes normal none 40 Classification Rules
41. 41. An Example: Contact Lenses Data(Finding a good choice) Rule we seek: If ? then recommendation = hard Possible tests: Age = Young 2/8 Age = Pre- presbyopic 1/8 Age = Presbyopic 1/8 Spectacle prescription = Myope 3/12 Spectacle prescription = Hypermetrope 1/12 Astigmatism = no 0/12 Astigmatism = yes 4/12 Tear production rate = Reduced 0/12 Tear production rate = Normal 4/12 OR 41 Classification Rules
42. 42. Modified Rule and Resulting Data  Rule with best test added: If astigmatics = yes then recommendation = hard age Spectacle astigmati Tear prod. Recom. Age Spectacle astigma Tear prod. Recom. prescription sm rate lenses prescription tism rate lenses young myope no reduced none presbyopic myope no reduced none young myope no normal soft presbyopic myope no normal none young myope yes reduced none presbyopic myope yes reduced none young myope yes normal hard presbyopic myope yes normal hard young hypermyope no reduced none presbyopic hypermyope no reduced none young hypermyope no normal soft presbyopic hypermyope no normal soft young hypermyope yes reduced none presbyopic hypermyope yes reduced none young hypermyope yes normal hard presbyopic hypermyope yes normal nonepre-presbyopic myope no reduced nonepre-presbyopic myope no normal soft • The underlined rows match with thepre-presbyopic myope yes reduced none rule.pre-presbyopic myope yes normal hard • Anyway, we need to refine the rulepre-presbyopic hypermyope no reduced none since they are not all correct,pre-presbyopic hypermyope no normal soft according to the rule.pre-presbyopic hypermyope yes reduced none 42pre-presbyopic hypermyope yes normal none Classification Rules
43. 43. Further Refinement Current State: If astigmatism = yes and ? then recommendation = hard Possible tests: Age = Young 2/4 Age = Pre- presbyopic 1/4 Age = Presbyopic 1/4 Spectacle prescription = Myope 3/6 Spectacle prescription = Hypermetrope 1/6 Tear production rate = Reduced 0/6 Tear production rate = Normal 4/643 Classification Rules
44. 44. Modified Rule and Resulting Data  Rule with best test added: If astigmatics = yes and tear prod. rate = normal then recommendation = hard age Spectacle astigmati Tear prod. Recom. prescription sm rate lenses Age Spectacle astigma Tear prod. Recom. young myope no reduced none tism prescription rate lenses young myope no normal soft presbyopic myope no reduced none young myope yes reduced none presbyopic myope no normal none young myope yes normal hard presbyopic myope yes reduced none young hypermyope no reduced none presbyopic myope yes normal hard young hypermyope no normal soft presbyopic hypermyope no reduced none young hypermyope yes reduced none presbyopic hypermyope no normal soft young hypermyope yes normal hard presbyopic hypermyope yes reduced nonepre-presbyopic myope no reduced none presbyopic hypermyope yes normal nonepre-presbyopic myope no normal softpre-presbyopic myope yes reduced none • The underlined rows match withpre-presbyopic myope yes normal hard the rule.pre-presbyopic hypermyope no reduced none • Anyway, we need to refine the rulepre-presbyopic hypermyope no normal soft since they are not all correct,pre-presbyopic hypermyope yes reduced none 44pre-presbyopic hypermyope yes normal none according to the rule. Classification Rules: Covering Algorithm
45. 45. Further Refinement Current State:If astigmatism = yes and tear prod. rate = normal and ? then recommendation = hard Possible tests: Age = Young 2/2 Age = Pre- presbyopic 1/2 Age = Presbyopic 1/2 Spectacle prescription = Myope 3/3 Spectacle prescription = Hypermetrope 1/3 Tie between the first and the fourth test  We choose the one with greater coverage 45 Classification Rules
46. 46. Modified Rule and Resulting Data Final rule with best test added:If astigmatics = yes and tear prod.rate = normal and spectacle prescription = myope then recommendation = hard age Spectacle astigmati Tear prod. Recom. prescription sm rate lenses Age Spectacle astigma Tear prod. Recom. young myope no reduced none tism prescription rate lenses young myope no normal soft presbyopic myope no reduced none young myope yes reduced none presbyopic myope no normal none young myope yes normal hard presbyopic myope yes reduced none young hypermyope no reduced none presbyopic myope yes normal hard young hypermyope no normal soft presbyopic hypermyope no reduced none young hypermyope yes reduced none presbyopic hypermyope no normal soft young hypermyope yes normal hard presbyopic hypermyope yes reduced nonepre-presbyopic myope no reduced none presbyopic hypermyope yes normal nonepre-presbyopic myope no normal softpre-presbyopic myope yes reduced none • The blue rows match with the rule.pre-presbyopic myope yes normal hard • All three rows are „hard‟.pre-presbyopic hypermyope no reduced none • No need to refine the rule since thepre-presbyopic hypermyope no normal softpre-presbyopic hypermyope yes reduced none rule becomes perfect. ITS423: Data Warehouses and Data 46 Classification Rulespre-presbyopic hypermyope yes normal none Mining
47. 47. Finding More Rules Second rule for recommending “hard lenses”: (built from instances not covered by first rule) If age = young and astigmatism = yes and tear production rate = normal then recommendation = hard These astigmatics = yes & tear.prod.rate = lenses”: spectacle.prescr = myope (1) If two rules cover all “hard normal & then recommendation = hard (2) If age = young and astigmatism = yes and tear production rate = normal then recommendation = hard Process is repeated with other two classes, that is “soft lenses” and “none”. 47 Classification Rules
48. 48. Pseudo-code for PRISM AlgorithmFor each class C• Initialize E to the instance set• While E contains instances in class C • Create a rule R with an empty left-hand-side that predicts class C • Until R is perfect (or there are no more attributes to use) do • For each attribute A not mentioned in R, and each value v, • Consider adding the condition A = v to the left-hand side of R • Select A and v to maximize the accuracy p/t (break ties by choosing the condition with the largest p) • Add A = v to R • Remove the instances covered by R from E48 Classification Rules
49. 49. Order Dependency among Rules PRISM without outerloop generates a decision list for one class  Subsequent rules are designed for rules that are not covered by previous rules  Here, order does not matter because all rules predict the same class Outer loop considers all classes separately  No order dependence implied Two problems are  overlapping rules  default rule required49 Classification Rules
50. 50. Separate-and-Conquer Methods like PRISM (for dealing with one class) are separate-and- conquer algorithms:  First, a rule is identified  Then, all instances covered by the rule are separated out  Finally, the remaining instances are “conquered” Difference to divide-and-conquer methods:  Subset covered by rule doesn’t need to be explored any further Variety in separate-and-conquer approach.  Search method (e. g. greedy, beam search, ...)  Test selection criteria (e. g. accuracy, ...)  Pruning method (e. g. MDL, hold-out set, ...)  Stopping criterion (e. g. minimum accuracy)  Post- processing step Also: Decision list vs. one rule set for each class 50 Classification Rules
51. 51. Good Rules and Bad Rules(overview) Sometimes it is better not to generate perfect rules that guarantee to give the correct classification on all instances in order to avoiding overfitting. How do we decide which rules are worthwhile? How do we tell when it becomes counterproductive to continue adding terms to a rule to exclude a few pecky instances of the wrong type? Two main strategies of pruning rules  Global pruning (post-pruning) Create all perfect rules then prune  Incremental pruning (pre-pruning) Three pruning criteria Prune a rule when generating  MDL principle (Minimum Description Length) Rule size + Exception  Statistical significance  INDUCT  Error on hold-out set (reduced-error pruning) 51 Classification Rules
52. 52. Hypergeometric Distribution The dataset contains T examples The rule selects t examples The class contains P examples P T-P The p examples out of t p t-p examples selected by the rule are correctly covered T Hypergeometric Distribution t52 Classification Rules
53. 53. Computing Significance We want the probability that a random rule does at least as well (statistical significance of rule):  P  T  P    min(t , P )    i  t  i  Or m( R )     Ci  T  PCt i min(t , P ) P m( R)   T  T i p Ct i p   t    p p! Here,     q  q!( p  q)!   53 Classification Rules
54. 54. Good/Bad Rules by Statistical significance (An Example) “Reduced probability”1 If astigmatism = yes then recommendation = hard means better success fraction = 4/12 0.047  0.0014 P = p = 4, T = 24, t=12 no information success fraction = 4/24 4 24−4 1∗ 20 20! 4 12−4 8 probability of 4/24  4/12 = 0.047 24 = 24! = 8!∗12! 24! 12 12!∗12! 12!∗12! 20! ∗ 12! = = 0.047 If astigmatism = yes and 8! ∗ 24!2 tear production rate = normal then recommendation = hard success fraction = 4/6 The Best Rule no information success fraction = 4/24 probability of 4/24  4/6 = 0.00143 If astigmatism = yes and tear prod. rate = normal and age = young then recommendation = hard 0.0014  0.022 success fraction = 2/2 “Increased no information success fraction = 4/24  P  T  P  probability”    i  t  i  probability of 4/24  2/2 = 0.022  min(t , P ) m( R)      means worse i p T  54   t  
55. 55. Good/Bad Rules by Statistical significance (Another Example) If astigmatism = yes and tear production rate = normal4 then recommendation = none success fraction = 2/6 no information success fraction = 15/24 Bad Rule probability of 15/24  2/6 = 0.985 High Probability5 If astigmatism = no and tear production rate = normal then recommendation = soft success fraction = 5/6 no information success fraction = 5/24 Good Rule probability of 5/24  5/6 = 0.0001 Low Probability6 If tear production rate = reduced then recommendation = none success fraction = 12/12 no information success fraction = 15/24 probability of 15/24  12/12 = 0.0017 55 Classification Rules
56. 56. The Binomial Distribution  Approximation: can use sampling with replacement instead of sampling without replacement Dataset contains T examples Rule selects t examples Class contains P examplesp examples are correctly covered t i  t  P   P  min(t , P ) i m( R)      1   i  T i p     T  56 Classification Rules
57. 57. Pruning Strategies For better estimation, a rule should be evaluated on data not used for training. This requires a growing set and a pruning set Two options are  Reduced-error pruning for rules builds a full unpruned rule set and simplifies it subsequently  Incremental reduced-error pruning simplifies a rule immediately after it has been built. 57 Classification Rules
58. 58. INDUCT (Incremental Pruning Algorithm)Initialize E to the instance setUntil E is empty do For each class C for which E contains an instance Use basic covering algorithm to create best perfect rule for C Calculate significance m(R) for rule and significance m(R-) for rule with final condition omitted If (m(R-) < m(R)), prune rule and repeat previous step From the rules for the different classes, select the most significant one (i.e. the one with smallest m(R)) Print the rule Remove the instances covered by rule from EContinueINDUCT’s significance computation for a rule:• Probability of completely random rule with same coverage performing at least as well.• Random rule R selects t cases at random from the dataset• We want to know how likely it is that p of these belong to the correct class?• This probability is given by the hypergeometric distribution 58 Classification Rules
59. 59. Example:Classification task is to predict whether a customer will buy a computer RID age income student Credit_rating Class:buys_computer 1 youth High No Fair No 2 youth High No Excellent No 3 middle_age High No Fair Yes 4 senior Medium No Fair Yes 5 senior Low Yes Fair Yes 6 senior Low Yes Excellent No 7 middle_age Low Yes Excellent Yes 8 youth Medium No Fair No 9 youth Low Yes Fair Yes 10 senior Medium Yes Fair Yes 11 youth Medium Yes Excellent Yes 12 middle_age Medium No Excellent Yes 13 middle_age High Yes Fair Yes 14 senior medium no Excellent No 59 Data Warehousing and Data Mining by Kritsada Sriphaew