Like this document? Why not share!

278

Published on

No Downloads

Total Views

278

On Slideshare

0

From Embeds

0

Number of Embeds

0

Shares

0

Downloads

0

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Interestingness Measures in Data Mining Howard J. Hamilton Fabrice Guillet 1 Outline1. Knowledge Discovery and Data Mining2. Criteria for Interestingness3. Interestingness Measures for Association Rules – objective, subjective, semantic4. Interestingness Measures for Summaries Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 2
- 2. Outline5. Semantic Classification of Interestingness Measures6. What Could Be a Good Measure?7. Some Objective Measures8. New/Recent Measures9. Comparison by Simulation9. Tools10. Conclusions Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 3 Knowledge Discovery Interesting to users Knowledge discovery is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in huge amounts of data [Fayyad et al., 1996] Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 4
- 3. Knowledge Discovery Process Decision Interpretation/ Knowledge Maker Evaluation Data Mining Transformation Patterns Preprocessing TransformedSelection Data Preprocessed Data Target Interestingness Data Measures Data (Quality Measures) Adapted from Fayyad et al., “From Data Mining to Knowledge Discovery: An Overview,” Advances in Data Quality Knowledge Discovery and Data Mining, 1996. Measures Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 5 What Kind of Knowledge Can Be Mined? 1. Association rules: milk, bread → eggs (support: 0.12; confidence: 0.75) 2. Classification rules: Annual Income > 80000 and HouseOwned = yes → Credit = good. 3. Clusters: 4. Exceptions: In 2008, the profit increased by 20% compared to 2007, but in April 2008, the profit decreased by 30% compared to April 2007. 5. Summaries: 90 (Histograms) 80 70 60 50 Americas 40 Asi a 30 e.g. distribution of 20 Europe 10 a firm’s income 0 1st Q tr 2nd Q tr 3rd Q tr 4th Q tr Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 6
- 4. What is Interesting?• Knowledge Discovery features semi- automated (user-guided) methods exploring large data sets for patterns• Discovery Science features fully automated methods exploring environments and searching for regularities• Of the many possible patterns or regularities, which are interesting?• What do we mean by “interesting”? Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 7 Criteria for Interestingness 8
- 5. Criteria for Interestingness1. Conciseness Small size of classification trees, length of an association rule, number of rules in a rule set.2. Generality How many records in a dataset / universe are covered by the pattern?3. Reliability Predictive accuracy for classification rules, confidence for association rules.Valid = general and reliable Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 9Criteria for Interestingness (Cont’d )4. Peculiarity Data instance is far away from other data (outlier), pattern conflicts with other patterns5. Diversity Probability distribution of a summary is far away from the uniform distribution.6. Novelty Not known to the user and cannot be inferred (easily) from other known patterns. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 10
- 6. Criteria for Interestingness (Cont’d )7. Surprisingness / Unexpectedness Contradicts a theory’s expectations or a person’s knowledge.8. Utility Useful according to a specified utility function9. Applicability Enables decision making about future actions in the domain Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 11 Categorization of Interestingness MeasuresObjective Interestingness Measures Based on only the raw dataUserSubjective Interestingness Measures Based on both the data and the user’s knowledge of the dataSemantic Interestingness Measures Emphasize the semantics and explanations of the patterns Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 12
- 7. Interestingness Measures for Association Rules 13 Association RulesAssociation rules [Agrawal et al. 1993]: – Market-basket analysis – Unsupervised learning from positive examples – Algorithm + two measures (support and confidence) + two thresholds (minsup, minconf)Problems: – Can produce an enormous number of rules – The semantics of the support and confidence measures are not clear Need to help the user select the best rules Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 14
- 8. Association RulesSolutions: – Reduce redundancy – Structuring (classes, similar rules) – Interactive decision aid (rule mining) – Automated threshold adjustment – Improve interestingness measures Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 15 Association RulesInput: transaction data – N transactions (rows) – p Boolean attributes (V0, V1, … Vp) (columns)Output: association rules – Implicative tendencies: A B • A and B are itemsets, e.g., {V0, V4, V8} {V1} • Positive examples, (V0 =1)^(V4 =1)^(V8 =1) V1=1 – 2 measures: • Support: supp(A B) = NAB • Confidence: conf(A B) = P(B|A) = NAB / NA (XUY)/freq(X) Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 16
- 9. Categorization of Interestingness Measures for Association Rules Interestingness measures for association rules Objective measures User subjective measures Semantic measures Probability based Rule form based Surprisingness Novelty Utility ActionabilityGenerality Peculiarity Conciseness Geng and Hamilton, ``Interestingness Measures for Data Mining: A Survey, ACM Computing Surveys, 38(3), September 2006. Reliability Surprisingness Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 17 Contingency Table for Rule A → B B ¬B A n(A B) n(A ¬B) n(A) ¬A n(¬A B) n(¬A ¬B) n(¬A) n(B) n(¬B) N • AB = the union of the two itemsets A and B • n(AB): the number of records satisfying A and B • N: the total number of records • Support, supp(A B) = P(AB) = n(AB) / N • Confidence, conf(A B) = P(B|A) = n(AB) / n(A) Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 18
- 10. Measures Involving Generality and Reliability• Generality: e.g., support P(AB), coverage P(A)• Reliability: e.g., confidence P(B|A), added value P(B|A) − P(B), lift P(B|A) / P(B)• Combination: e.g., Piatetsky-Shapiro’s measure P(A)(P(B|A)-P(B)), Yao and Liu’s two way support P ( AB ) P ( AB ) * log 2 P ( A) P ( B ) Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 19 Examples of Measures Milk Bread Eggs 1 0 1 1 1 0 1 1 1 1 1 1 0 0 1 Association rule: Milk → Bread Confidence: P(B|A) = 0.75 Added value: P(B|A) − P(B)) = 0.75 − 0.6 = 0.15 P-S: P(A) (P(B|A) − P(B)) = 0.8(0.75 − 0.6) = 0.12 Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 20
- 11. Limits of SupportSupport: supp(A B) = nAB• Generality of the rule• Minimum support threshold (e.g., 10%) – Reduce the number of rules – Lose nuggets (pruning all below minsup)• Nugget: – Specific rule (low support) – General rule (high confidence) High potential of novelty/surprise Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 21 Limits of Confidence [Guillaume et al. 1998], [Lallich et al. 2004]Confidence: conf(A B) = P(B | A) = NAB / NA• Validity/logical aspect of the rule (inclusion)• Minimum confidence threshold (e.g., 90%)• Reduces the number of extracted rules• Confidence does not capture validity• No detection of independence• Independence: – A and B are independent: P(B | A) = P(B) – If P(B) is high => nonsense rule with high support E.g., couch beer (supp=20%, conf=90%) if supp(beer)=90% Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 22
- 12. AR Quality Limits of the Support-Confidence PairIn practice: – High support threshold (10%) – High confidence threshold (90%) – General and reliable rules Cannot capture surprisingness or novelty Efficient measures but insufficient to capture quality Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 23 Objective Interestingness Measures 1 Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 24
- 13. Objective Interestingness Measures 2Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 25Objective Interestingness Measures 3Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 26
- 14. Piatetsky-Shapiro’s Properties• Properties for an interestingness measure F: – P1. F = 0 if A and B are statistically independent; i.e., P(AB) = P(A)P(B). – P2. F monotonically increases with P(AB) when P(A) and P(B) remain the same. – P3. F monotonically decreases with P(A) (or P(B)) when P(AB) and P(B) (or P(A)) remain the same. Piatetsky-Shapiro, “Discovery, Analysis, and Presentation of Strong Rules,” Knowledge Discovery in Databases, 1991. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 27 Properties of the MeasuresName Formula P1 P2 P3Support P(AB) N Y NConfidence P(B|A) N Y NAdded Value P(B|A) − P(B) Y Y YLift P(B|A) / P(B) N Y YPiatetsky- P(A)(P(B|A) − Y Y YShapiro’s P(B))Two way P ( AB ) * log 2 P ( AB ) Y Y Y P ( A) P ( B )support Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 28
- 15. Tan et al.’s Properties• O1. F should be symmetric under variable permutation.• O2. F should be the same when we scale any row or column by a positive factor.• O3. F should become –F if either the rows or the columns are permuted, i.e., swapping either the rows or columns in the contingency table makes interestingness values change sign.• O4. F should remain the same if both the rows and the columns are permuted.• O5. F should have no relationship with the count of the records that do not contain A and B. Tan et al., ``Selecting the Right Interestingness Measure for Association Patterns, KDD’2002. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 29 Example for Property O2• Odds ratio: P( AB) P (¬A¬B ) F= P ( A¬B ) P (¬BA) B ¬B B ¬BA 15 8 23 First row * 2 A 30 16 46¬A 9 25 34 ¬A 9 25 34 24 33 57 39 41 80 15 * 25 30 * 25 F= = 5.2 F= = 5.2 8*9 16 * 9 Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 30
- 16. Properties of the MeasuresName O1 O2 O3 O4 O5Support Y N N N NConfidence N N N N NAdded Value N N N N NLift Y N N N NPiatetsky- Y N Y Y NShapiro’sTwo way Y N N N Ysupport Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 31 Lenca et al.’s Properties• Q1. F is constant if there is no counterexample to the rule.• Q2. F decreases with P(A¬B) in a linear, concave, or convex fashion around 0+.• Q3. F increases as the total number of records increases.• Q4. The threshold is easy to fix.• Q5. The semantics of the measure are easy to express. Lenca et al., ``A multicriteria decision aid for interestingness measure selection,” LUSSI-TR- 2004-01-EN, GET/ENST, Bretagne, France, 2004. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 32
- 17. Example for Property Q2• Lift: P(B|A) / P(B) F = Lift = n(AB) / (n(A) * n(B) / N) = 0.5 * (1 + 200 / (100 + x)) B ¬BA 100 x 100+x¬A 100 100 200 N(A¬B) 200 100+x 300+x Convex decreasing in vicinity of 0 Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 33 Geng and Hamilton’s Properties• S1: F is an increasing function of support, if the margins in the contingency table are fixed.• S2: F is an increasing function of confidence, if the margins in the contingency table are fixed. Geng and Hamilton, ``Interestingness Measures for Data Mining: A Survey, ACM Computing Surveys, 38(3), September 2006. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 34
- 18. An Example for S1• P-S: F = P(A)(P(B|A)-P(B)) P( ) B ¬B A x a-x a ¬A b-x 1-a-b+x 1-a b 1-b 1• Assume x denotes support P(AB).• F = a *(x/a - b) = x – a * b is an increasing function in terms of x (a > 0, 0 ≤ x ≤ min(a,b)). Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 35 An Example for S2• P-S: F = P(A)(P(B|A)-P(B)) P( ) B ¬B A x a-x a ¬A b-x 1-a-b+x 1-a b 1-b 1• Assume c = x / a denotes confidence P(B|A).• F = a *(x/a - b) = a * c – a * b• is an increasing function in terms of c (a > 0). Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 36
- 19. Properties of the MeasuresName Formula S1 S2Support P(AB) Y YConfidence P(B|A) Y YAdded Value P(B|A) - P(B) Y YLift P(B|A) / P(B) Y NPiatetsky- P(A)(P(B|A)-P(B)) Y YShapiro’sTwo way P ( AB ) Y N P ( AB ) * logsupport 2 P ( A)P (B ) Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 37 Subjective Interestingness Measures• Based on Bayes’ rule, an interestingness measure can be defined as the relative difference of the prior and posterior probabilities, given the pattern [Silberschatz and Tuzhilin, 1995].• Interestingness measures can be defined as the distance between the rules and the user’s specifications [Liu et al., 1997] Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 38
- 20. Subjective Interestingness Measures (Cont’d)• Interestingness can be measured via interaction with the user. The user decides if a rule is interesting. The system then chooses the next rule to be presented according to user’s feedback [Sahar, 1999].• A user’s beliefs can be represented in the same format as mined rules. No measure is defined. Only surprising rules, i.e., rules that contradict existing beliefs, are mined [Padmanabhan and Tuzhilin, 1998]. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 39 AR Quality : Subjective MeasuresSilberschatz and Tuzhilin’s Approach• Given evidence E (patterns), the degree of belief in α is updated with Bayes’ rule as follows: P(E | α,ξ )P(α | ξ ) P(α | E,ξ ) = P(E | α,ξ )P(α | ξ ) + P(E | ¬α,ξ )P(¬α | ξ ) where ξ is the context representing the previous evidence supporting α. Then, the interestingness measure for pattern p relative to a soft belief system B is defined as the relative difference of the prior and posterior probabilities | P (α | p , ξ ) − P (α | ξ ) | I ( p, B) = ∑ α∈B P (α | ξ ) Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 40
- 21. AR Quality : Subjective Measures An Example Milk Bread Eggs• ξ: the context is the data set D D 1 0 1• α: people buy milk, bread, and eggs together. P (α | ξ ) = 0 . 4 1 1 0 P ( ¬ α | ξ ) = 0 .6 1 1 1 P (α | ξ ) = 2 / 5 = 0.4• Suppose we get rule r: bread → eggs 1 1 1• support = 0.4 and confidence ≈ 0.67. 0 0 1• P(r | α, ξ) = 1 represents the confidence of rule r given belief α, i.e., the confidence of the rule bread → eggs evaluated on transactions 3 and 4, where milk, bread, and eggs appear together.• Similarly, P(r | ¬α, ξ) = 0.• Finally, P(α | r, ξ) ≈ 0.67. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 41 AR Quality : Subjective Measures Example (Cont’d)• The degree of belief is: P(r | α,ξ )P(α | ξ ) P(α | r,ξ ) = P(r | α,ξ )P(α | ξ ) + P(r | ¬α,ξ )P(¬α | ξ ) 1(0.4) 0.4 = = =1 1(0.4) + 0(0.6) 0.4• The relative difference of the prior and posterior probabilities is: | P ( α | r , ξ ) − P (α | ξ ) | I (r , B ) = ∑ α ∈B P (α | ξ ) | 0 . 67 − 0 . 4 | | 0 . 27 | = = ≈ 0 . 68 0 .4 0 .4 Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 42
- 22. AR Quality : Subjective Measures AR-Schema Ontology Rule Schemas Operators User Knowledge Association Post- Filtered DB Rules processing rules Association Rules Mining Post-processing step Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 43 AR Quality : Subjective Measures ExperimentsData Ontology• Questionnaire database, about client satisfaction concerning accommodation 130 concepts (Nantes Habitat) 113 primitives concepts• 288 items (67 questions with 4 17 restriction concepts answers), 1500 transactions => 7 levels • 358, 072 association rules 2 data properties• Example : – Item q1=1 => question q1=“Is your district transport practical?” with the answer 1=“poor”. – AR: q2=1 q3=1 q47=1 ==> q70=1 – Support = 15.2% Confidence = 85.9% Rule Schemas Rules Schemas Operators RS1: < EntryHall ? CloseSurrounding > RS2: < Stairwell ? EntryHall > RS3: < CloseSurrounding ? EntryHall > P RS4: < EntryHall ? Stairwell > RS5: < CommonAreas ? GarbageRoom > RS6: < TechnicalMaintenance ? TechnicalMaintenance > RS7: < DissatisfactionCalmDistrict > C RS8: < DissasisfactionP rice ? DissatisfactionCommonAreas > Up Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 44
- 23. AR Quality : Subjective Measures Results Nb MICF IRF PRS Rule C(R7) Up(R8) 1 358,072(100%) 1,008 1399 2 X 27,602(7,7%) 361 462 3 X 103,891(29%) 162 401 4 X 207,196(57%) 472 238 5 X X 16,473(4.6%) 77 187 6 X X 21,822(7.7%) 231 154 7 X X 73,091(20%) 3 24 8 X X X 13,382(3.7%) 3 11 Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 45 AR Quality : Subjective Measures Other Subjective Measures• Projected Savings (KEFIR system’s interestingness) [Matheus & Piatetsky-Shapiro 1994]• Fuzzy Matching Interestingness Measure [Lie et al. 1996]• Misclassification Costs [Frietas 1999]• Vague Feelings (Fuzzy General Impressions) [Liu et al. 2000]• Anticipation [Roddick and Rice 2001]• Interestingness [Shekar & Natarajan’s 2001] Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 46
- 24. Classification Interestingness Year Application Foundation Scope Subjective aspects User’s Knowledge Measure Representation1 Matheus and 1994 summaries utilitarian single rule unexpectednesss pattern deviation Piatetsky-Shapiros Projected Savings2 Klemettinen et al. Rule 1994 association rules syntactic single rule unexpectedness & rule templates Templates actionability3 Silbershatz and 1995 format probabilistic rule set unexpectednesss hard & soft beliefs Tuzhilins independent Interestingness4 Liu et al. Fuzzy 1996 classification syntactic distance single rule unexpectednesss Fuzzy rules Matching Interestingness rules Measure5 Imielinski et al. 1996 Association rules syntactic Single rule actionability Rule queries – M-SQL, M-SQL queries extension of SQL6 Baralis and Psaila 1997 Association rules syntactic Single rule actionability Rule templates Rule templates7 Liu et al. General 1997 classification syntactic single rule unexpectednesss GI, RPK Impressions rules8 Padmanabhan and 1997 association rules logical, statistic single rule unexpectednesss Beliefs X→Y Tuzhilin logical contradiction9 Ng et al. 1998 Association rules Syntactic Single rule actionability Constrained association queries10 Freitas’ Attributes Costs 1999 association rules utilitarian single rule actionability Costs values11 Freitas’ 1999 association rules utilitarian single rule actionability Costs values Misclassification Costs Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 47 Classification Interestingness Year Application Foundation Scope Subjective aspects User’s Knowledge Measure Representation12 Liu et al. Vague 2000 generalized syntactic single rule unexpectednesss GI, RPK, PK Feelings (Fuzzy association rules General Impressions)13 Adomavicius and 2001 Association rules in Similarity-based rule Single Actionability Rule Validation Tuzilin User Profiling grouping; rule Operators - Template-based rule Iterative Process; filtering; Taxonomies; Templates;14 Roddick and Rice’s 2001 format probabilistic single rule temporal dimension Probability Graph Anticipation independent15 Shekar and Natarajan’s 2002 association rules distance single rule unexpectednesss Fuzzy-graph based interestingness taxonomy16 An et al. 2003 Groups Semantic information Single Actionalibily Semantic Neworks/ Summaries rule Taxonomies17 Wang et. al 2003 Association rules Unexpectedness support/ Single Unexpectedness Covering confidence rule Knowledge18 Domingues and 2005 Generalized Generalizing items Single Actionability Taxonomies Rezende association rules rule19 Antunes 2007 Association rules Logical, syntactic Single Actionability Ontologies and rule Constraints20 Zhou and Geller’s 2007 Generalized Raising by replacing Single Actionability Ontologies Raising association rules items with generalized rule with higher support ones21 Garcia and Vivacqua 2008 Association rules Semantic Distance & Single Actionability Ontologies Relevance Assesment rule Item weight Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 48
- 25. Utility-Based Measures• Interestingness of a pattern = probability + utility [Shen et al., 2002].• Data models – Weights on columns: e.g., price of each item. – Weights on rows: e.g., price of each transaction. – Weights on both columns and rows. – Weights on cells and columns: e.g., the number of occurrences of the item in a transaction and the price of each item. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 49 List of Utility-Based MeasuresMeasures Data models Extension ofWeighted support Weights for items SupportNormalized weighted support Weights for items SupportVertical weighted support Weights for transactions SupportMixed weighted support Weights for both items and transactions SupportOOA Target and non-target attributes; weights Support for target attributesMarketshare Weight for each transaction, stored in Confidence attribute P in data set.Count support Weights for items and cells in data set SupportAmount support Weights for items and cells in data set SupportCount confidence Weights for items and cells in data set ConfidenceAmount confidence Weights for items and cells in data set ConfidenceYao et al.’s Weights for items and cells in data set Support Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 50
- 26. Utility-Based Measures• All known utility based measures are extensions of the support and confidence measures, i.e., if the weights are set appropriately, the measures become either confidence or support.• No single utility measure is suitable for every application, because applications have different objectives and data models. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 51 Utility Based Itemset Mining• Utility constraint: a constraint in the form u(S) > minutil, where u(S) is the utility value of itemset S and minutil is a threshold defined by the user.• Utility based itemset mining problem: Find H = { S | S ⊆ I, u(S) ≥ minutil } Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 52
- 27. Utility Function• The semantics of profit = quantity ∗ unit profit is captured by f (x, y) = x ∗ y.• To achieve a users goal, two types of utilities for items may need to be identified: • The transaction utility of an item is directly obtained from the information stored in the transaction dataset. • The external utility of an item is given by the user. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 53 Utility Value of an Itemset u(S ) = ∑ ∑ ip∈ S tq∈T S f ( x pq , yp ) (1) u(S): Utility value of an itemset S. ip: an item in the itemset S. tq: a transaction including S. f(xpq,yp): utility function. xpq: transaction utility of a transaction tq on item ip. yp: external utility of an item ip. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 54
- 28. Utility Upper Bound of an ItemsetTheorem (Utility Upper Bound Property). Let u(Sk) be the utility value of a k-itemset Sk and let u(Sk-1) be the utility value of a (k-1)-itemset Sk-1. If f(xpq,yp) ≥ 0 then the , following property holds. ∑ k −1 k −1 u (S k −1 ) u (S ) ≤ k S ∈L k −1 Yao and Hamilton, ``Mining Itemset Utilities from Databases,” Data and Knowledge Engineering, 59, 3, 2006. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 55 Pruning Strategy• Pruning Strategy 1Let b(Sk) be the utility upper bound of Sk. If b(Sk) <minutil, then itemset Sk is a low utility itemset. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 56
- 29. The UMining AlgorithmAlgorithm UMining(T, f, minutil, K)Input: Transaction database T, Utility function f, Utility value threshold minutil, Maximum size of itemset K.Output: A set of high utility itemsets H.1. I = Scan(T);2. C1 = I;3. k = 1;4. Ck = CalculateAndStore(Ck,,T, f);5. H = Discover(Ck , minutil);6. while (|Ck|> 0 and k ≤ K)7. {8. k = k + 1;9. Ck = Generate(Ck-1, I);10. Ck = Prune(Ck , Ck-1 , minutil); // Strategy 1 is incorporated11. Ck = CalculateAndStore(Ck , T, f);12. H = H ∪ Discover(Ck , minutil);13. }14. return H; Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 57 Experimental Results on Synthetic DataIBM Synthetic data – 9 million records Minutil UMining UMining_H α # of HUI # of Trans. Time # of HUI # of Trans. Time 10.00% 21 6,604,369 43 min. 21 6,604,369 43 min. 5.00% 21 6,604,369 43 min. 21 6,604,369 43 min. 2.00% 21 6,604,369 43 min. 21 6,604,369 43 min. 1.00% 39 11,160188 70 min. 39 9,486,180 62 min. 0.50% 394 18,845,636 125 min. 381 13,568,451 91 min. 0.25% 3,371 26,136,835 152 min. 3,168 23,523,151 134 min. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 58
- 30. Experimental Results on a Customer Database • Customer Database: 8 million transaction database, which represents the purchase of 2,238 unique items by about 428,665 customers.Threshold α 10% 5% 2% 1% 0.5% 0.25% H H’ H H’ H H’ H H’ H H’ H H’ # of HUI 3 3 22 22 88 87 337 331 1,802 1,757 5,694 5,563 Time 28m 28m 42m 39m 65m 58m 87m 76m 2.7h 2.4h 10.5h 8.4h Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 59 Categorization of Interestingness Measures for Summaries Interestingness measures for summaries Objective measures Subjective measures Diversity Conciseness Peculiarity Surprisingness Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 60
- 31. OLAP and Summaries Decade, Country Roll Up Space of the Drill Data Cube Down Day, Prov Month, City Day, City Select Data City Decade Country Sales Day Warehouse Sales Person 1980s Canada 1.2M Integrate 1980s USA 11.3M 1990s Canada 2.1MData Source 1 Data Source 2 …… Data Source n 1990s USA 18.7M Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 61 Generalization Space Presented to the User ANY 1 category, 9274 logins weekday or weekend mm 250 1600 Number of Logins per 1400 200 Num ber of Logins 1200 1000 150 800 100 600 400 50 200 0 0 weekday weekend 0 10 20 30 40 50 Weekday or Weekend Minutes day of week hh (also day# of year (1-366) and YYYYMMDD) 900 800 2500 Num ber of Logins 700 600 N u m b er o f L o g in s 2000 500 1500 400 “Efficient Spatio-Temporal 1000 500 300 200 Data Mining with GenSpace 0 100 0 Graphs, Hamilton et al., Sun Mon Tue Wed Thu Fri Sat Day of Week 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Hour of day Journal of Applied Logic, 4(2):192-214, 2006. YYYYMMDDhh 166 categories, average of 55.86 logins per category YYYYMMDDhhmm 4853 categories, average of 1.91 logins per category Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 62
- 32. HMI Set• HMI Set of Heuristic Measures of Interestingness• 16 MeasuresI Variance I Simpson I Shannon I TotalI Max I McIntosh I Lorenz I GiniI Berger I Schutz I Bray I WhittakerI Kullback I MacArthur I Thiel I Atkinson Heuristic Measures of Interestingness, Hilderman and Hamilton, PKDD’99 Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 63 Diversity Measures for Summaries• Variance: m Simpson: m ∑i=1 ( p i − q) 2 ∑ pi 2 i =1 m − 1• Shannon: m − ∑ i =1 p i log 2 pi• Schutz: m ∑ i=1 | p i − q | 2m q• Bray: m ∑ min( i =1 pi , q )• pi denotes the probability for class i, q denotes the average probability for all classes Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 64
- 33. Example of Variance Program Nationality # of Uniform Students Distribution Graduate Canadian 15 75 Graduate Foreign 25 75 Undergraduate Canadian 200 75 Undergraduate Foreign 60 75m 2∑ ( pi − q ) ( 15 300 − 75 2 300 ) +( 25 300 − 75 2 300 ) +( 200 300 − 75 2 300 ) +( 60 300 − 75 2 300 )i =1 = = 0 . 24 m −1 4 −1 Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 65 Diversity in Summaries • Within the context of ranking the interestingness of a summary the number of classes is simply the number of tuples in the summary; the proportional distribution is simply the observed probability distribution of the classes based on the values in the derived Count attribute. (16, 14, 14, 10, 9, 8, 7, 6, 6, 5, 4, 3, 2, 2, 2, 2, 1) has distribution (.14, .13, .13, .09, .08, .07, .06, .05, .05, .05, .04, .03, .02, .02, .02, .02, .01) • In a typical diversity measure, the two components are combined to characterize the variability of a population by a single value. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 66
- 34. Concentration (Isimpson): Simpson’s Measure A C A (600, 200, 100, 100) A • (6/10)2 + (1/5)2 + A D A 2(1/10)2 B B A = .36 + .04 + .02 = .42 Isimpson = sum for i = 1 to m: pi2 (400, 200, 200, 200) A C C • (4/10)2 + 3(1/5)2 A A = 0.16 + 3(.04) = 0.28 D D B B A Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 67 Variance (Ivariance) (600, 200, 100, 100) A C A • (6/10-1/4)2 + (1/5-1/4)2 + A A 2(1/10-1/4)2 D A = (7/20)2 + (1/20)2 + B B A 2(1/5)2 ≈ 0.21IVariance = sum for i = 1 to m: (pi - q)2 A C (400, 200, 200, 200) C A • (4/10-1/4)2 + 3(1/5-1/4)2 A D D = (3/20)2 +3 (1/20)2 B B A = 12/400 = 0.03 Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 68
- 35. Maximum Concentration C (997, 1, 1, 1) A A VarianceA A A D A = (.997-1/4)2 + 3(.001- A 1/4)2 B A A A = (.747)2 + 3(.249)2 = .620 Simpson’s = (.997)2 + 3(.001)2 = .994 Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 69 Principles for Summary Diversity Measures• Five principles that a measure of interestingness based on diversity should follow.• If these principles are accepted, our results follow mathematically.• If a measure does not satisfy these principles, one can still use it, but one should not expect it to satisfy these principles.• P1. Minimum value• P2. Maximum value• P3. Skewness• P4. Permutation invariance• P5. Transfer Hilderman and Hamilton, ``Knowledge Discovery and Measures of Interest, Kluwer Academic, 2001. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 70
- 36. Hilderman and Hamilton’s Principles• P1. Minimum Value Principle. Given a vector (n1, …, nm), where ni = nj for all i, j, measure f(n1, …, nm) attains its minimum value.• P2. Maximum Value Principle. Given a vector (n1, …, nm), where n1 = N – m + 1, ni = 1, i = 2, …, m, and N > m, f(n1, …, nm) attains its maximum value.• P3. Skewness principle. Given a vector (n1, …, nm), where n1 = N – m + 1, ni = 1, i = 2, …, m, and N > m, and a vector (n1 - c, n2, …, nm, nm+1, …, nm+c), where n1 – c > 1, ni = 1, i = 2, …, m + c, f(n1, …, nm) > f(n1 - c, n2, …nm+c).• P4. Permutation Invariance Principle. Given a vector (n1, …, nm) and any permutation (i1, …, im) of (1, …, m), f(n1, …, nm) = f(ni1, …, nim).• P5. Transfer principle. Given a vector (n1, …, nm) and 0 < c < nj < ni, f(n1, …, ni + c, …, nj - c, …, nm) > f(n1,…, ni,…, nj,…, nm). Hilderman and Hamilton, ``Knowledge Discovery and Measures of Interest, Kluwer Academic, 2001. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 71Minimum Value Principle (P1) Given a vector (n1,…,nm), where ni = nj, i ≠ j, for all i, j, f(n1,…,nm) attains its minimum value. • Minimum interestingness should be attained when the tuple counts are equal. • Given the vectors (2, 2), (50, 50, 50), and (1000, 1000, 1000, 1000) • The index value generated by f be the minimum possible for the respective values of N and m. • Example: N = 150 and m = 3: (50, 50, 50) 72
- 37. Maximum Value Principle (P2) Given a vector (n1, …,nm), where n1 = N - m + 1, ni = 1, i = 2,…,m, and N > m, f(n1, …, nm) attains its maximum value. • Maximum interestingness should be attained when the tuple counts are distributed as unevenly as possible. • Sample values of N and m and the vectors for which f gives the maximum index value: •N = 4, m = 2, vector is (3,1) • N = 150, m = 3, vector is (148,1,1) • N = 4000, m = 4, vector is (3997, 1, 1, 1) 73 Skewness Principle (P3)Given a vector (n1,…,nm), where n1 = N - m + 1, ni = 1, i = 2, …, m, and N > m, anda vector (n1 - c, n2, …, nm, nm+ 1,…, nm+ c), where n1 - c > 1 and ni = 1, i = 2, …, m + c,f(n1,…,nm) > f (n1 - c, n2,...,nm, nm+1,…, nm+c).• a summary containing m tuples whose counts are distributed as unevenly as possible is more interesting than a summary containing m + c tuples, whose counts are also distributed as unevenly as possible.• For example, with N = 1000, m = 2, c = 2, the vectors are (999,1) and (997,1,1,1) and we require that f (999,1) > f(997,1,1,1). Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 74
- 38. No. of Index Values No. of Index Values 100 120 140 160 180 200 220 100 150 200 250 300 350 0 20 40 60 80 0 50 -0.032 -0.007 -0.011 -0.002 0.011 0.002 0.032 0.007 0.053 0.012 0.074 0.017 0.095 0.021 0.116 0.026 0.138 0.031 0.159 0.036 0.180 0.041 0.201 0.045 0.222 0.050 0.244 0.055 0.265 0.060 0.286 0.064 0.307 0.069 0.328 0.074 0.349 0.079 0.371 0.083 ISchutz 0.392 0.088 IVariance 0.413 0.093 0.434 0.098 0.455 0.102 0.476 0.107 0.498 0.112Interestingness Measures, Howard J. Hamilton & Fabrice Guillet Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 0.519 0.117 0.540 0.122 0.561 0.126 Distribution for ISchutz 0.582 0.131 Distribution for IVariance 0.604 0.136 0.625 0.141 0.646 0.145 0.667 0.150 0.688 0.15576 75 0.709 0.160 0.731 0.164 Less peaked Low Skewness: Asymmetric Left Positive Kurtosis: Positive Skewness: Negative Kurtosis: Nearly Symmetrical More sharply peaked
- 39. Permutation Invariance Principle (P4)• Given a vector (n1,…,nm) and any permutation (i1,…,im) of (1,…,m),• f(n1,…,nm) = f(ni,1,…,ni,m).• Every permutation of a given distribution of tuple counts should be equally interesting.• Interestingness is not labeled property.• Given the vector (6,4,2), we require that f (2,4,6) = f (2,6,4) = f (4,2,6) = f (4,6,2) = f (6,2,4) = f (6,4,2). Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 77 Transfer Principle (P5)• Given a vector (n1, …, nm) and 0 < c < nj,• f(n1, …, ni + c, …, nj - c,…,nm) > f(n1, …,ni, …nj,…,nm).• When a strictly positive transfer is made from the count of one tuple to another tuple whose count is greater, then interestingness increases.• For example, given the vectors (10,7,5,4) and (10,9,5,2), we require that f(10,9,5,2) > f(10,7,5,4). Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 78
- 40. Measures Satisfying the Five Principles Measure P1 P2 P3 P4 P5 IVariance • • • • • ISimpson • • • • •Supported bymathematical IShannon • • • • • proof IMcIntosh • • • • • ILorenz • • • IGini • • • • IBerger • • • • IShutz • • • IBray • • • Checked on sample IWhittaker • • • database Proof by IMacArthur • • • • MAPLE ITheil • • IAtkinson • • • • Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 79 Subjective Measures for Summaries • Variance: m ∑i =1 ( p i − ei ) 2 m −1 • Schutz: m ∑i=1 | pi − ei | • Bray: 2m q m ∑ min( p , e ) i =1 i i pi denotes the observed probability for class I; ei denotes the expected probability for class i. Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 80
- 41. Outline5. Semantic Classification of Interestingness Measures6. What Could Be a Good Measure?7. Some Objective Measures8. New/Recent Measures9. Comparison by Simulation9. Tools10. Conclusions Interestingness Measures, Howard J. Hamilton & Fabrice Guillet 81 Semantic Classification of Interestingness Measures Interestingness Measures, Fabrice Guillet & Howard J. Hamilton 82
- 42. Objective Interestingness Measures Contingency I(x→y) = f( n, nx , n y , nxy ) E(x) E(y) (extension of x) (extension of y) nx ny Negative- examples nxy nxy n E Examples ⇒ Inclusion of E(x) into E(y) x x y n xy ny - Équilibrium : n nxy = nxy = x y n xy ny nx n y - Independence : n = 2 xy n nx nx n Interestingness Measures, Howard J. Hamilton & Fabrice Guillet Semantic Classification [Huynh et al. 2007], [Blanchard et al. 2009]Measure i(x y) = f(n, nx, ny, nxy) f(n,Classification among three criteria: Object Dependence, Inclusion Range Simple rule, Conjunction, implication, equivalence Nature Statistical or descriptive Interestingness Measures, Fabrice Guillet & Howard J. Hamilton 84
- 43. Semantic ClassificationThe Object of i(x y): Dependence Some measures take a fixed value with independence. P(x and y) = P(x) P(y), or nxy = nx ny / n P(x) P(y), They evaluate a variation with independence nxX Yn y nxy nxy n Inclusion Some measures take a fixed value with equilibrium. P(x and y) = P(x)/2, or nxy = nx / 2 (equiprobability, max of uncertainty) (equiprobability, They evaluate a variation with equilibrium Others do not take a fixed value with independence or with equilibrium Interestingness Measures, Fabrice Guillet & Howard J. Hamilton 85 Semantic Classification The Range of i(x y): Simple rule A directed measure of rule x y i(y y) ≠ i(y x) Some measures evaluate to more than a simple rule: quasi-Implication quasi- They relate simultaneously to rule x y and its contra-positive ¬y ¬x : contra- I(x y) = I(¬y ¬x) I(¬ quasi-conjunction quasi- They simultaneously relate rule x y and its reciprocal y x: I(x y) = I(y x) quasi-equivalence quasi- They relate simultaneously to all three: I(x y) = I(y x) = I(¬y ¬x) I(¬ Interestingness Measures, Fabrice Guillet & Howard J. Hamilton 86
- 44. Semantic Classification The Nature: Nature: If variation: statistical measure i(X Y) = f(n, nx, ny, nxy) f(n, If not: descriptive i(X Y) = f(p(x), p(y), p(x and y)) f(p(x), p(y), Interestingness Measures, Fabrice Guillet & Howard J. Hamilton 87 Semantic Classification Range Rule Quasi-Implication Quasi- Quasi-Conjunction Quasi- Quasi-Equivalence Quasi-Object x y & not y x x y & ¬y ¬x x y &y x both Confidence, Precision Sebag & Schoenauer Ex. & counter-ex. ratio counter- Inclusion Ganascia index, Descript. (Equilibrium) confirm confidence Least contradiction Laplace Bayes Factor, odd Lovinger, certainty Lovinger, Lift,Interest factor Correlation coeff multiplier Conviction Information gain (log lift) Leverage, Novelty Zang Implication Index Interestingness Rate of connection Pavillon, Added value Pavillon, Intensity of Implication Lerman index, directed Collective Strength Dependence(Independence) Klosgen J-Measure * contribution to Χ2 Likelihood linkage Kappa, K * * p-value of Χ2 Yule’s Q and Y Yule’ Odd’s ratio Odd’ Mutual information, Gini Rule-interest Rule- Causal-Confidence Causal- Support, Russel-Rao Russel- Causal Support, Sokal- Sokal- Jaccard Michener Dice Roger-Tanimoto Roger- Other Cosine, Ochiai Kulczynski Descriptive measures Statistical measures p-value test based (probabilistic) * 88 Interestingness Measures, Fabrice Guillet & Howard J. Hamilton

Be the first to comment