Chapter 6: Mining Association
           Data Mining:                                                                     ...
Chapter 6: Mining Association                                                            Mining Association Rules—An Examp...
How to Count Supports of
                   Candidates?                                                                   ...
Benefits of the FP-tree Structure                                            Mining Frequent Patterns Using FP-tree


 n  ...
Mining Frequent Patterns by Creating                                                                     Step 3: Recursive...
FP-growth vs. Tree-Projection: Scalability                                                                 Presentation of...
Multiple-Level Association Rules                                                                              Mining Multi...
Multi-Level Mining: Progressive                                                          Progressive Refinement of Data
  ...
Static Discretization of Quantitative
               Attributes                                                           ...
Clusters and Distance                                                         Chapter 6: Mining Association
              ...
Chapter 6: Mining Association
                                                                                            ...
Succinct Constraint                                                             Convertible Constraint

     n   A subset ...
Characterization of Constraints by
           Property of Constraints: Succinctness                                     Su...
Data Mining: Concepts and Techniques
Upcoming SlideShare
Loading in …5
×

Data Mining: Concepts and Techniques

3,242 views

Published on

1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total views
3,242
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
154
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide

Data Mining: Concepts and Techniques

  1. 1. Chapter 6: Mining Association Data Mining: Rules in Large Databases Concepts and Techniques n Association rule mining n Mining single-dimensional Boolean association rules — Slides for Textbook — from transactional databases — Chapter 6 — n Mining multilevel association rules from transactional databases n Mining multidimensional association rules from ©Jiawei Han and Micheline Kamber transactional databases and data warehouse Intelligent Database Systems Research Lab n From association mining to correlation analysis School of Computing Science n Constraint-based association mining Simon Fraser University, Canada n Summary http://www.cs.sfu.ca January 17, 2001 Data Mining: Concepts and Techniques 1 January 17, 2001 Data Mining: Concepts and Techniques 2 What Is Association Mining? Association Rule: Basic Concepts n Association rule mining: n Given: (1) database of transactions, (2) each transaction is n Finding frequent patterns, associations, correlations, or a list of items (purchased by a customer in a visit) causal structures among sets of items or objects in n Find: all rules that correlate the presence of one set of transaction databases, relational databases, and other items with that of another set of items information repositories. n E.g., 98% of people who purchase tires and auto n Applications: accessories also get automotive services done n Basket data analysis, cross-marketing, catalog design, n Applications loss-leader analysis, clustering, classification, etc. n * ⇒ Maintenance Agreement (What the store should n Examples. do to boost Maintenance Agreement sales) n Rule form: “Body → Η ead [support, confidence]”. n Home Electronics ⇒ * (What other products should n buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%] the store stocks up?) n Attached mailing in direct marketing n major(x, “CS”) ^ takes(x, “DB”) → grade(x, “A”) [1%, n Detecting “ping-pong”ing of patients, faulty “collisions” 75%] January 17, 2001 Data Mining: Concepts and Techniques 3 January 17, 2001 Data Mining: Concepts and Techniques 4 Rule Measures: Support and Confidence Association Rule Mining: A Road Map Customer buys both Find all the rules X & Y ⇒ Z with Customer n n Boolean vs. quantitative associations (Based on the types of values minimum confidence and support buys diaper handled) n support, s, probability that a n buys(x, “SQLServer”) ^ buys(x, “DMBook”) → buys(x, “DBMiner”) transaction contains {X 4 Y 4 [0.2%, 60%] Z} n age(x, “30..39”) ^ income(x, “42..48K”) → buys(x, “PC”) [1%, 75%] Customer n confidence, c, conditional n Single dimension vs. multiple dimensional associations (see ex. Above) buys beer probability that a transaction n Single level vs. multiple-level analysis having {X 4 Y} also contains Z n What brands of beers are associated with what brands of diapers? Transaction ID Items Bought Let minimum support 50%, and n Various extensions 2000 A,B,C minimum confidence 50%, n Correlation, causality analysis 1000 A,C we have n Association does not necessarily imply correlation or causality 4000 A,D n A ⇒ C (50%, 66.6%) n Maxpatterns and closed itemsets 5000 B,E,F n C ⇒ A (50%, 100%) n Constraints enforced n E.g., small sales (sum < 100) trigger big buys (sum > 1,000)? January 17, 2001 Data Mining: Concepts and Techniques 5 January 17, 2001 Data Mining: Concepts and Techniques 6 1
  2. 2. Chapter 6: Mining Association Mining Association Rules—An Example Rules in Large Databases n Association rule mining Transaction ID Items Bought Min. support 50% n Mining single-dimensional Boolean association rules 2000 A,B,C Min. confidence 50% from transactional databases 1000 A,C 4000 A,D Frequent Itemset Support n Mining multilevel association rules from transactional 5000 B,E,F {A} 75% databases {B} 50% {C} 50% n Mining multidimensional association rules from For rule A ⇒ C: {A,C} 50% transactional databases and data warehouse support = support({A 4C}) = 50% n From association mining to correlation analysis confidence = support({A 4C})/support({A}) = 66.6% n Constraint-based association mining The Apriori principle: n Summary Any subset of a frequent itemset must be frequent January 17, 2001 Data Mining: Concepts and Techniques 7 January 17, 2001 Data Mining: Concepts and Techniques 8 Mining Frequent Itemsets: the The Apriori Algorithm Key Step n Join Step: Ck is generated by joining Lk-1with itself n Find the frequent itemsets: the sets of items that have n Prune Step: Any (k-1)-itemset that is not frequent cannot be minimum support a subset of a frequent k-itemset n A subset of a frequent itemset must also be a n Pseudo-code: Ck : Candidate itemset of size k frequent itemset Lk : frequent itemset of size k n i.e., if {AB} is a frequent itemset, both {A} and {B} should L1 = {frequent items}; be a frequent itemset for (k = 1; Lk !=∅; k++) do begin n Iteratively find frequent itemsets with cardinality Ck+1 = candidates generated from Lk ; for each transaction t in database do from 1 to k (k-itemset) increment the count of all candidates in Ck+1 that are contained in t n Use the frequent itemsets to generate association Lk+1 = candidates in Ck+1 with min_support rules. end return ∪k Lk ; January 17, 2001 Data Mining: Concepts and Techniques 9 January 17, 2001 Data Mining: Concepts and Techniques 10 The Apriori Algorithm — Example How to Generate Candidates? Database D itemset sup. TID Items C1 {1} 2 L1 itemset sup. n Suppose the items in Lk-1 are listed in an order {1} 2 100 134 {2} 3 {2} 3 n Step 1: self-joining Lk-1 200 235 Scan D {3} 3 {3} 3 insert into Ck 300 1235 {4} 1 {5} 3 400 25 {5} 3 select p.item1 , p.item2 , …, p.itemk- 1, q.itemk -1 C2 itemset sup C2 itemset from Lk-1 p, Lk-1 q L2 itemset sup Scan D {1 2} {1 2} 1 where p.item1 =q.item1 , …, p.itemk -2 = q .itemk- 2, p.itemk- 1 < {1 3} 2 {1 3} 2 {1 3} q.itemk-1 {2 3} 2 {1 5} 1 {1 5} {2 5} 3 {2 3} 2 {2 3} n Step 2: pruning {2 5} 3 {2 5} {3 5} 2 forall itemsets c in C k do {3 5} 2 {3 5} forall (k-1)-subsets s of c do C3 itemset Scan D L3 itemset sup if (s is not in Lk-1 ) then delete c from Ck {2 3 5} {2 3 5} 2 January 17, 2001 Data Mining: Concepts and Techniques 11 January 17, 2001 Data Mining: Concepts and Techniques 12 2
  3. 3. How to Count Supports of Candidates? Example of Generating Candidates n Why counting supports of candidates a problem? n The total number of candidates can be very huge n L3 ={abc, abd, acd, ace, bcd} n One transaction may contain many candidates n Self-joining: L3 *L3 n Method: n abcd from abc and abd n Candidate itemsets are stored in a hash-tree n acde from acd and ace n Leaf node of hash-tree contains a list of itemsets and counts n Pruning: n Interior node contains a hash table n acde is removed because ade is not in L3 n Subset function: finds all the candidates contained in n C4={abcd} a transaction January 17, 2001 Data Mining: Concepts and Techniques 13 January 17, 2001 Data Mining: Concepts and Techniques 14 Is Apriori Fast Enough? — Performance Methods to Improve Apriori’s Efficiency Bottlenecks n Hash-based itemset counting: A k-itemset whose corresponding n The core of the Apriori algorithm: hashing bucket count is below the threshold cannot be frequent n Use frequent (k – 1)-itemsets to generate candidate frequent k- itemsets n Transaction reduction: A transaction that does not contain any n Use database scan and pattern matching to collect counts for the frequent k-itemset is useless in subsequent scans candidate itemsets n Partitioning: Any itemset that is potentially frequent in DB must be n The bottleneck of Apriori: candidate generation frequent in at least one of the partitions of DB n Huge candidate sets: n Sampling: mining on a subset of given data, lower support n 104 frequent 1-itemset will generate 107 candidate 2-itemsets To discover a frequent pattern of size 100, e.g., {a1 , a2 , …, threshold + a method to determine the completeness n a100 }, one needs to generate 2100 ≈ 1030 candidates. Dynamic itemset counting: add new candidate itemsets only when n n Multiple scans of database: all of their subsets are estimated to be frequent n Needs (n +1 ) scans, n is the length of the longest pattern January 17, 2001 Data Mining: Concepts and Techniques 15 January 17, 2001 Data Mining: Concepts and Techniques 16 Mining Frequent Patterns Without Construct FP-tree from a Candidate Generation Transaction DB TID Items bought (ordered) frequent items n Compress a large database into a compact, Frequent- 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p } Pattern tree (FP -tree) structure 200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 0.5 300 {b, f, h, j, o} {f, b} n highly condensed, but complete for frequent pattern 400 {b, c, k, s, p} {c, b, p} mining 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p } {} n avoid costly database scans Steps: Header Table n Develop an efficient, FP-tree-based frequent pattern 1. Scan DB once, find frequent Item frequency head f:4 c:1 1-itemset (single item f 4 mining method pattern) c 4 c:3 b:1 b:1 n A divide-and-conquer methodology: decompose mining 2. Order frequent items in a 3 b 3 a:3 p:1 tasks into smaller ones frequency descending order m 3 p 3 n Avoid candidate generation: sub-database test only! 3. Scan DB again, construct m:2 b:1 FP-tree January 17, 2001 Data Mining: Concepts and Techniques 17 January 17, 2001 Data Mining: Concepts and Techniques p:2 m:1 18 3
  4. 4. Benefits of the FP-tree Structure Mining Frequent Patterns Using FP-tree n Completeness: n General idea (divide-and-conquer) n Recursively grow frequent pattern path using the FP- n never breaks a long pattern of any transaction n preserves complete information for frequent pattern tree mining n Method n Compactness n For each item, construct its conditional pattern -base, n reduce irrelevant information—infrequent items are gone and then its conditional FP-tree n frequency descending ordering: more frequent items are n Repeat the process on each newly created conditional more likely to be shared FP -tree n never be larger than the original database (if not count n Until the resulting FP-tree is empty , or it contains only node-links and counts) one path (single path will generate all the combinations of its n Example: For Connect-4 DB, compression ratio could be sub-paths, each of which is a frequent pattern) over 100 January 17, 2001 Data Mining: Concepts and Techniques 19 January 17, 2001 Data Mining: Concepts and Techniques 20 Step 1: From FP -tree to Conditional Major Steps to Mine FP-tree Pattern Base n Starting at the frequent header table in the FP-tree n Traverse the FP-tree by following the link of each frequent item 1) Construct conditional pattern base for each node in the n Accumulate all of transformed prefix paths of that item to form a FP -tree conditional pattern base 2) Construct conditional FP-tree from each conditional Header Table {} pattern-base Item frequency head Conditional pattern bases f:4 c:1 3) Recursively mine conditional FP-trees and grow f 4 item cond. pattern base c 4 c:3 b:1 b:1 c f:3 frequent patterns obtained so far a 3 b 3 a fc:3 a:3 p:1 § If the conditional FP-tree contains a single path, m 3 b fca:1, f:1, c:1 simply enumerate all the patterns p 3 m:2 b:1 m fca:2, fcab:1 p:2 m:1 p fcam:2, cb:1 January 17, 2001 Data Mining: Concepts and Techniques 21 January 17, 2001 Data Mining: Concepts and Techniques 22 Properties of FP-tree for Conditional Step 2: Construct Conditional FP-tree Pattern Base Construction n Node-link property n For each pattern-base n Accumulate the count for each item in the base n For any frequent item ai, all the possible frequent n Construct the FP -tree for the frequent items of the patterns that contain ai can be obtained by following pattern base ai's node-links, starting from ai's head in the FP-tree {} m-conditional pattern header Header Table base: Item frequency head f:4 c:1 fca:2, fcab:1 n Prefix path property f 4 All frequent patterns c 4 {} concerning m c:3 b:1 b:1 n To calculate the frequent patterns for a node ai in a a 3 Ú m, a:3 p:1 f:3 Ú fm, cm, am, path P, only the prefix sub-path of ai in P need to be b 3 m 3 fcm, fam, cam, m:2 b:1 accumulated, and its frequency count should carry the p 3 c:3 fcam same count as node ai. p:2 m:1 a:3 m-conditional FP-tree January 17, 2001 Data Mining: Concepts and Techniques 23 January 17, 2001 Data Mining: Concepts and Techniques 24 4
  5. 5. Mining Frequent Patterns by Creating Step 3: Recursively mine the Conditional Pattern-Bases conditional FP-tree {} {} Cond. pattern base of “am”: ( fc:3) f:3 Item Conditional pattern-base Conditional FP-tree c:3 f:3 p {(fcam:2), (cb:1)} {(c:3)}|p am-conditional FP-tree c:3 {} m {(fca :2), (fcab:1)} {(f:3, c:3, a:3)}|m Cond. pattern base of “cm”: (f:3) a:3 f:3 b {(fca :1), (f:1), (c:1)} Empty m-conditional FP-tree cm-conditional FP-tree a {(fc :3)} {(f:3, c:3)}|a {} c {(f:3)} {(f:3)}|c Cond. pattern base of “cam”: (f:3) f:3 f Empty Empty cam-conditional FP-tree January 17, 2001 Data Mining: Concepts and Techniques 25 January 17, 2001 Data Mining: Concepts and Techniques 26 Principles of Frequent Pattern Single FP-tree Path Generation Growth n Suppose an FP -tree T has a single path P n Pattern growth property n The complete set of frequent pattern of T can be n Let α be a frequent itemset in DB, B be α's generated by enumeration of all the combinations of the conditional pattern base, and β be an itemset in B. sub-paths of P Then α ∪ β is a frequent itemset in DB iff β is {} frequent in B. All frequent patterns concerning m n “abcdef ” is a frequent pattern, if and only if f:3 m, Ú n “abcde ” is a frequent pattern, and c:3 fm, cm, am, fcm, fam, cam, n “f ” is frequent in the set of transactions containing a:3 fcam “abcde ” m-conditional FP-tree January 17, 2001 Data Mining: Concepts and Techniques 27 January 17, 2001 Data Mining: Concepts and Techniques 28 Why Is Frequent Pattern Growth FP-growth vs. Apriori: Scalability With Fast? the Support Threshold n Our performance study shows 100 Data set T25I20D10K FP -growth is an order of magnitude faster than 90 D1 FP-growth runtime n D1 Apriori runtime 80 Apriori, and is also faster than tree -projection 70 Run time(sec.) n Reasoning 60 50 n No candidate generation, no candidate test 40 Use compact data structure 30 n 20 n Eliminate repeated database scan 10 0 n Basic operation is counting and FP-tree building 0 0.5 1 1.5 2 2.5 3 Support threshold(%) January 17, 2001 Data Mining: Concepts and Techniques 29 January 17, 2001 Data Mining: Concepts and Techniques 30 5
  6. 6. FP-growth vs. Tree-Projection: Scalability Presentation of Association Rules with Support Threshold (Table Form ) Data set T25I20D100K 140 D2 FP-growth 120 D2 TreeProjection 100 Runtime (sec.) 80 60 40 20 0 0 0.5 1 1.5 2 Support threshold (%) January 17, 2001 Data Mining: Concepts and Techniques 31 January 17, 2001 Data Mining: Concepts and Techniques 32 Visualization of Association Rule Using Plane Graph Visualization of Association Rule Using Rule Graph January 17, 2001 Data Mining: Concepts and Techniques 33 January 17, 2001 Data Mining: Concepts and Techniques 34 Chapter 6: Mining Association Iceberg Queries Rules in Large Databases n Icerberg query : Compute aggregates over one or a set of n Association rule mining attributes only for those whose aggregate values is above n Mining single-dimensional Boolean association rules certain threshold from transactional databases n Example: select P.custID, P.itemID, sum(P.qty) n Mining multilevel association rules from transactional from purchase P databases group by P.custID, P.itemID n Mining multidimensional association rules from having sum(P.qty) >= 10 transactional databases and data warehouse n Compute iceberg queries efficiently by Apriori: n From association mining to correlation analysis n First compute lower dimensions n Constraint-based association mining n Then compute higher dimensions only when all the n Summary lower ones are above the threshold January 17, 2001 Data Mining: Concepts and Techniques 35 January 17, 2001 Data Mining: Concepts and Techniques 36 6
  7. 7. Multiple-Level Association Rules Mining Multi Level Associations - Food n Items often form hierarchy. n A top_down, progressive deepening approach: n Items at the lower level are milk bread n First find high-level strong rules: expected to have lower milk → bread [20%, 60%]. support. skim 2% wheat white n Then find their lower-level “weaker” rules: 2% milk → wheat bread [6%, 50%]. n Rules regarding itemsets at Fraser Sunset n Variations at mining multiple-level association rules. appropriate levels could be n Level-crossed association rules: quite useful. TID Items 2% milk → Wonder wheat bread n Transaction database can be encoded based on T1 {111, 121, 211, 221} n Association rules with multiple, alternative dimensions and levels T2 {111, 211, 222, 323} hierarchies: n We can explore shared multi- T3 {112, 122, 221, 411} 2% milk → Wonder bread T4 {111, 121} level mining January 17, 2001 T5 {111, 122, 211, 221, 413} 37 Data Mining: Concepts and Techniques January 17, 2001 Data Mining: Concepts and Techniques 38 Multi- level Association: Uniform Support vs. Reduced Support Uniform Support Uniform Support: the same minimum support for all levels n Multi-level mining with uniform support n + One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support. Level 1 Milk n – Lower level items do not occur as frequently. If support min_sup = 5% threshold [support = 10%] n too high ⇒ miss low level associations n too low ⇒ generate too many high level associations n Reduced Support: reduced minimum support at lower levels n There are 4 search strategies: Level 2 2% Milk Skim Milk n Level-by-level independent min_sup = 5% [support = 6%] [support = 4%] n Level-cross filtering by k-itemset n Level-cross filtering by single item n Controlled level-cross filtering by single item Back January 17, 2001 Data Mining: Concepts and Techniques 39 January 17, 2001 Data Mining: Concepts and Techniques 40 Multi-level Association: Redundancy Reduced Support Filtering Multi-level mining with reduced support n Some rules may be redundant due to “ancestor” relationships between items. Level 1 Milk n Example min_sup = 5% [support = 10%] n milk ⇒ wheat bread [support = 8%, confidence = 70%] 2% milk ⇒ wheat bread [support = 2%, confidence = 72%] n n We say the first rule is an ancestor of the second rule. Level 2 2% Milk Skim Milk n A rule is redundant if its support is close to the “expected” min_sup = 3% [support = 6%] [support = 4%] value, based on the rule’s ancestor. Back January 17, 2001 Data Mining: Concepts and Techniques 41 January 17, 2001 Data Mining: Concepts and Techniques 42 7
  8. 8. Multi-Level Mining: Progressive Progressive Refinement of Data Deepening Mining Quality n A top-down, progressive deepening approach: n Why progressive refinement? n First mine high-level frequent items: n Mining operator can be expensive or cheap, fine or milk (15%), bread (10%) n Then mine their lower-level “weaker” frequent rough itemsets: n Trade speed with quality: step-by -step refinement. 2% milk (5%), wheat bread (4%) n Superset coverage property: n Different min_support threshold across multi-levels lead to different algorithms: n Preserve all the positive answers—allow a positive false n If adopting the same min_support across multi- test but not a false negative test. levels n Two- or multi-step mining: then toss t if any of t’s ancestors is infrequent. n If adopting reduced min_support at lower levels n First apply rough/cheap operator (superset coverage) then examine only those descendents whose ancestor’s n Then apply expensive algorithm on a substantially support is frequent/non-negligible. reduced candidate set (Koperski & Han, SSD’95). January 17, 2001 Data Mining: Concepts and Techniques 43 January 17, 2001 Data Mining: Concepts and Techniques 44 Progressive Refinement Mining of Chapter 6: Mining Association Spatial Association Rules Rules in Large Databases n Hierarchy of spatial relationship: n Association rule mining n “g_close_to”: near_by, touch, intersect, contain, etc. n First search for rough relationship and then refine it. n Mining single-dimensional Boolean association rules from transactional databases n Two-step mining of spatial association: n Step 1: rough spatial computation (as a filter) n Mining multilevel association rules from transactional n Using MBR or R-tree for rough estimation. databases n Step2: Detailed spatial algorithm (as refinement) n Mining multidimensional association rules from n Apply only to those objects which have passed the rough transactional databases and data warehouse spatial association test (no less than min_support ) n From association mining to correlation analysis n Constraint-based association mining n Summary January 17, 2001 Data Mining: Concepts and Techniques 45 January 17, 2001 Data Mining: Concepts and Techniques 46 Multi-Dimensional Association: Techniques for Mining MD Concepts Associations n Single-dimensional rules: n Search for frequent k-predicate set: buys(X, “milk”) ⇒ buys(X, “bread”) n Example: {age, occupation, buys} is a 3-predicate n Multi-dimensional rules: m 2 dimensions or predicates set. n Techniques can be categorized by how age are n Inter-dimension association rules (no repeated predicates ) treated. age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X,“coke”) 1. Using static discretization of quantitative attributes n Quantitative attributes are statically discretized by n hybrid-dimension association rules (repeated predicates ) using predefined concept hierarchies. age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”) 2. Quantitative association rules n Categorical Attributes n Quantitative attributes are dynamically discretized n finite number of possible values, no ordering among into “bins”based on the distribution of the data. values 3. Distance-based association rules n Quantitative Attributes n This is a dynamic discretization process that n numeric, implicit ordering among values considers the distance between data points. January 17, 2001 Data Mining: Concepts and Techniques 47 January 17, 2001 Data Mining: Concepts and Techniques 48 8
  9. 9. Static Discretization of Quantitative Attributes Quantitative Association Rules n Discretized prior to mining using concept hierarchy. n Numeric attributes are dynamically discretized n Such that the confidence or compactness of the rules n Numeric values are replaced by ranges. mined is maximized. n In relational database, finding all frequent k-predicate sets n 2-D quantitative association rules: A quan1 ∧ A quan2 ⇒ A cat will require k or k+1 table scans. n Cluster “adjacent” n Data cube is well suited for mining. () association rules to form general n The cells of an n-dimensional (age) (income) (buys) rules using a 2-D cuboid correspond to the grid. predicate sets. n Example: (age, income) (age,buys) (income,buys) age(X,”30-34”) ∧ income(X,”24K - n Mining from data cubes 48K”) can be much faster. (age,income,buys) ⇒ buys(X,”high resolution TV ”) January 17, 2001 Data Mining: Concepts and Techniques 49 January 17, 2001 Data Mining: Concepts and Techniques 50 ARCS (Association Rule Clustering Limitations of ARCS System) Only quantitative attributes on LHS of rules. How does ARCS work? n n Only 2 attributes on LHS. (2D limitation) 1. Binning n An alternative to ARCS n Non-grid-based 2. Find frequent n equi-depth binning predicateset n clustering based on a measure of partial completeness. 3. Clustering n “Mining Quantitative Association Rules in Large Relational Tables” by R. Srikant and R. 4. Optimize Agrawal. January 17, 2001 Data Mining: Concepts and Techniques 51 January 17, 2001 Data Mining: Concepts and Techniques 52 Mining Distance-based Association Rules Clusters and Distance Measurements Binning methods do not capture the semantics of interval n data n S[X] is a set of N tuples t1, t 2, …, tN , projected on Equi-width Equi-depth Distance- the attribute set X Price($) (width $10) (depth 2) based 7 [0,10] [7,20] [7,7] n The diameter of S[X]: 20 [11,20] [22,50] [20,22] ∑ ∑ 22 [21,30] [51,53] [50,53] N N 50 [31,40] i =1 j =1 distX ( ti[ X ], tj[ X ]) 51 [41,50] d ( S [ X ]) = 53 [51,60] N ( N − 1) n Distance-based partitioning, more meaningful discretization n distx:distance metric, e.g. Euclidean distance or considering: n density/number of points in an interval Manhattan n “closeness” of points in an interval January 17, 2001 Data Mining: Concepts and Techniques 53 January 17, 2001 Data Mining: Concepts and Techniques 54 9
  10. 10. Clusters and Distance Chapter 6: Mining Association Measurements(Cont.) Rules in Large Databases n Association rule mining n The diameter, d, assesses the density of a Mining single-dimensional Boolean association rules cluster C X , where n X d (CX ) ≤ d 0 from transactional databases n Mining multilevel association rules from transactional CX ≥ s 0 databases n Finding clusters and distance-based rules n Mining multidimensional association rules from n the density threshold, d0 , replaces the notion transactional databases and data warehouse of support n From association mining to correlation analysis n modified version of the BIRCH clustering n Constraint-based association mining algorithm n Summary January 17, 2001 Data Mining: Concepts and Techniques 55 January 17, 2001 Data Mining: Concepts and Techniques 56 Interestingness Measurements Criticism to Support and Confidence n Objective measures n Example 1: (Aggarwal & Yu, PODS98) Two popular measurements: n Among 5000 students n 3000 play basketball ¶ support; and n 3750 eat cereal · confidence n 2000 both play basket ball and eat cereal n play basketball ⇒ eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% n Subjective measures (Silberschatz & Tuzhilin, which is higher than 66.7%. KDD95) n play basketball ⇒ not eat cereal [20%, 33.3%] is far more A rule (pattern) is interesting if accurate, although with lower support and confidence ¶ it is unexpected (surprising to the user); basketball not basketball sum(row) and/or cereal 2000 1750 3750 · actionable (the user can do something with it) not cereal 1000 250 1250 sum(col.) 3000 2000 5000 January 17, 2001 Data Mining: Concepts and Techniques 57 January 17, 2001 Data Mining: Concepts and Techniques 58 Criticism to Support and Confidence (Cont.) Other Interestingness Measures: Interest n Interest (correlation, lift) P(A ∧ B) n Example 2: X 1 1 1 10 0 0 0 P( A) P( B ) n X and Y: positively correlated, n taking both P(A) and P(B) in consideration n X and Z, negatively related Y 1 1 0 00 0 0 0 n support and confidence of P(A^B)=P(B)*P(A), if A and B are independent events Z 0 1 1 11 1 1 1 n X=>Z dominates n A and B negatively correlated, if the value is less than 1; n We need a measure of dependent or correlated events otherwise A and B positively correlated Rule Support Confidence P( A∪ B) corrA,B = X=>Y 25% 50% X 11 1 10 0 00 Itemset X,Y Support 25% Interest 2 P( A)P( B) X=>Z 37.50% 75% Y 11 0 00 0 00 X,Z 37.50% 0.9 Z 01 1 11 1 11 Y,Z 12.50% 0.57 n P(B|A)/P(B) is also called the lift of rule A => B January 17, 2001 Data Mining: Concepts and Techniques 59 January 17, 2001 Data Mining: Concepts and Techniques 60 10
  11. 11. Chapter 6: Mining Association Constraint-Based Mining Rules in Large Databases n Association rule mining n Interactive, exploratory mining giga-bytes of data? n Could it be real? — Making good use of constraints! n Mining single-dimensional Boolean association rules from transactional databases n What kinds of constraints can be used in mining? n Knowledge type constraint: classification, association, n Mining multilevel association rules from transactional etc. databases n Data constraint: S Q L-like queries n Mining multidimensional association rules from n Find product pairs sold together in Vancouver in Dec.’98. transactional databases and data warehouse n Dimension/level constraints: n in relevance to region, price, brand, customer category. n From association mining to correlation analysis n Rule constraints n Constraint-based association mining n small sales (price < $10) triggers big sales (sum > $200). n Summary n Interestingness constraints: n strong rules (min_support ≥ 3%, min_confidence ≥ 60%). January 17, 2001 Data Mining: Concepts and Techniques 61 January 17, 2001 Data Mining: Concepts and Techniques 62 Rule Constraints in Association Mining Constrain-Based Association Query n Two kind of rule constraints: n Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price) n A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C }, n Rule form constraints: meta-rule guided mining. n where C is a set of constraints on S1 , S2 including frequency n P(x, y) ^ Q(x, w) → takes(x, “database systems”). constraint n Rule (content) constraint: constraint -based query n A classification of (single-variable) constraints: optimization (Ng, et al., SIGMOD’98). n Class constraint: S ⊂ A . e.g. S ⊂ Item n sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > n Domain constraint: 1000 n Sθ v, θ ∈ { =, ≠, <, ≤, >, ≥ }. e.g. S.Price < 100 n 1-variable vs. 2-variable constraints (Lakshmanan, et al. n vθ S, θ is ∈ or ∉. e.g. snacks ∉ S.Type SIGMOD’99): n Vθ S, or Sθ V, θ ∈ { ⊆, ⊂, ⊄, =, ≠ } n 1-var: A constraint confining only one side (L/R) of the n e.g. {snacks, sodas } ⊆ S.Type rule, e.g., as shown above. n Aggregation constraint: agg(S) θ v, where agg is in {min, n 2-var: A constraint confining both sides (L and R). max, sum, count, avg}, and θ ∈ { =, ≠, <, ≤, >, ≥ }. n sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS) n e.g. count(S1.Type) = 1 , avg(S2.Price) < 100 January 17, 2001 Data Mining: Concepts and Techniques 63 January 17, 2001 Data Mining: Concepts and Techniques 64 Constrained Association Query Anti-monotone and Monotone Optimization Problem Constraints Given a CAQ = { (S1, S2) | C }, the algorithm should be : A constraint C a is anti-monotone iff. for any anti- n n n sound: It only finds frequent sets that satisfy the given constraints C pattern S not satisfying C a, none of the super- n complete: All frequent sets satisfy the given constraints C are found patterns of S can satisfy C a n A naïve solution: n A constraint C m is monotone iff. for any pattern n Apply Apriori for finding all frequent sets, and then to test them for constraint satisfaction one by one. S satisfying C m, every super-pattern of S also n Our approach: satisfies it n Comprehensive analysis of the properties of constraints and try to push them as deeply as possible inside the frequent set computation. January 17, 2001 Data Mining: Concepts and Techniques 65 January 17, 2001 Data Mining: Concepts and Techniques 66 11
  12. 12. Succinct Constraint Convertible Constraint n A subset of item Is is a succinct set if it can be set, n Suppose all items in patterns are listed in a expressed as σp(I) for some selection predicate total order R p, where σ is a selection operator n A constraint C is convertible anti-monotone iff a anti- n SP⊆2I is a succinct power set if there is a fixed set, pattern S satisfying the constraint implies that number of succinct set I1, …, Ik ⊆I, s.t. SP can each suffix of S w.r.t. R also satisfies C be expressed in terms of the strict power sets n A constraint C is convertible monotone iff a of I1, …, Ik using union and minus pattern S satisfying the constraint implies that n A constraint C s is succinct provided SATCs(I) is a each pattern of which S is a suffix w.r.t. R also succinct power set satisfies C January 17, 2001 Data Mining: Concepts and Techniques 67 January 17, 2001 Data Mining: Concepts and Techniques 68 Relationships Among Categories of Property of Constraints: Constraints Anti-Monotone n Anti-monotonicity : If a set S violates the constraint, any superset of S violates the constraint. n Examples: Succinctness n sum(S.Price) ≤ v is anti-monotone Anti-monotonicity Monotonicity n sum(S.Price) ≥ v is not anti-monotone n sum(S.Price) = v is partly anti-monotone n Application: Convertible constraints n Push “sum(S.price) ≤ 1000” deeply into iterative Inconvertible constraints frequent set computation. January 17, 2001 Data Mining: Concepts and Techniques 69 January 17, 2001 Data Mining: Concepts and Techniques 70 Characterization of Example of Convertible Constraints: Anti-Monotonicity Constraints Avg(S) θ V S θ v, θ ∈ { = , ≤, ≥ } Let R be the value descending order over the yes v∈ S no n S⊇ V S⊆ V no yes set of items S= V partly n E.g. I={9, 8, 6, 4, 3, 1} min(S) ≤ v no min(S) ≥ v yes n Avg(S) ≥ v is convertible monotone w.r.t. R min(S) = v partly max(S) ≤ v yes n If S is a suffix of S1, avg(S1) ≥ avg(S) max(S) ≥ v no max(S) = v partly n {8, 4, 3} is a suffix of {9, 8, 4, 3} count(S) ≤ v count(S) ≥ v yes no n a v g({9, 8, 4, 3})=6 ≥ a v g({8, 4, 3})=5 count(S) = v sum(S) ≤ v partly yes n If S satisfies avg(S) ≥v, so does S1 sum(S) ≥ v no n {8, 4, 3} satisfies constraint a v g(S) ≥ 4, so does sum(S) = v partly {9, 8, 4, 3} avg(S) θ v, θ ∈ { =, ≤ , ≥ } convertible (frequent constraint) (yes) January 17, 2001 Data Mining: Concepts and Techniques 71 January 17, 2001 Data Mining: Concepts and Techniques 72 12
  13. 13. Characterization of Constraints by Property of Constraints: Succinctness Succinctness n Succinctness: S θ v, θ ∈ { = , ≤, ≥ } For any set S 1 and S 2 satisfying C, S 1 ∪ S 2 satisfies C Yes n v∈ S yes S ⊇V Given A 1 is the sets of size 1 satisfying C, then any set S n yes S⊆ V yes satisfying C are based on A 1 , i.e., it contains a subset S= V yes min(S) ≤ v belongs to A 1 , min(S) ≥ v yes yes n Example : min(S) = v yes max(S) ≤ v yes n sum(S.Price ) ≥ v is not succinct max(S) ≥ v yes max(S) = v yes n min(S.Price ) ≤ v is succinct count(S) ≤ v weakly count(S) ≥ v weakly n Optimization: count(S) = v weakly sum(S) ≤ v n If C is succinct, then C is pre -counting prunable. The no sum(S) ≥ v no satisfaction of the constraint alone is not affected by the sum(S) = v no iterative support counting. avg(S) θ v, θ ∈ { =, ≤ , ≥ } no (frequent constraint) (no) January 17, 2001 Data Mining: Concepts and Techniques 73 January 17, 2001 Data Mining: Concepts and Techniques 74 Chapter 6: Mining Association Why Is the Big Pie Still There? Rules in Large Databases n Association rule mining n More on constraint -based mining of associations n Boolean vs. quantitative associations n Mining single-dimensional Boolean association rules n Association on discrete vs. continuous data from transactional databases n From association to correlation and causal structure n Mining multilevel association rules from transactional analysis. databases n Association does not necessarily imply correlation or causal relationships n Mining multidimensional association rules from n From intra -trasanction association to inter-transaction transactional databases and data warehouse associations n From association mining to correlation analysis n E.g., break the barriers of transactions (Lu, et al. TOIS’99). n From association analysis to classification and clustering n Constraint-based association mining analysis n Summary n E.g, clustering association rules January 17, 2001 Data Mining: Concepts and Techniques 75 January 17, 2001 Data Mining: Concepts and Techniques 76 Chapter 6: Mining Association Summary Rules in Large Databases n Association rule mining n Association rule mining n Mining single-dimensional Boolean association rules n probably the most significant contribution from the from transactional databases database community in KDD n Mining multilevel association rules from transactional n A large number of papers have been published databases n Many interesting issues have been explored n Mining multidimensional association rules from n An interesting research direction transactional databases and data warehouse n Association analysis in other types of data: spatial n From association mining to correlation analysis data, multimedia data, time series data, etc. n Constraint-based association mining n Summary January 17, 2001 Data Mining: Concepts and Techniques 77 January 17, 2001 Data Mining: Concepts and Techniques 78 13

×