### SlideShare for iOS

by Linkedin Corporation

FREE - On the App Store

- Total Views
- 659
- Views on SlideShare
- 659
- Embed Views

- Likes
- 0
- Downloads
- 43
- Comments
- 0

No embeds

Uploaded via SlideShare as Adobe PDF

© All Rights Reserved

- 1. DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Semester 2/2011 Lecture 5 Association Rule Mining by Kritsada Sriphaew (sriphaew.k AT gmail.com)1
- 2. Topics Association rule mining Mining single-dimensional association rules Mining multilevel association rules Other measurements: interest and conviction Association rule mining to correlation analysis 2 Data Warehousing and Data Mining by Kritsada Sriphaew
- 3. What is Association Mining? Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Applications: Basket data analysis, cross-marketing, catalog design, clustering, classification, etc. Ex.: Rule form: “Body Head [support, confidence]”buys(x, “diapers*”) Consequent [support, confidence]” “Antecedent buys(x, “beers”) [0.5%,60%]major(x, “CS”)^takes(x, “DB”) grade(x, “A”) [1%, 75%] 3 Data Warehousing and Data Mining by Kritsada Sriphaew
- 4. A typical example of association rule mining ismarket basket analysis.4 Data Warehousing and Data Mining by Kritsada Sriphaew
- 5. Rule Measures: Support/Confidence Find all the rules “Antecedent(s) Consequent(s)” with minimum support and confidence support, s, probability that a transaction contains {A C} confidence, c, conditional probability that a transaction having A also contains C Let min. sup. 50%, and min. conf. 50%, • Support= 50% means that 50% of all transactions under analysis show that A C (s=50%, c=66.7%) A and C are purchased together • Confidence=66.7% means that 66.7% of the C A (s=50%, c=100%) customers who purchased A also bought C Typically association rules are considered interesting if they satisfy both a minimum support threshold and Transactional databases a mininum confidence threshold Transaction ID Items Bought Such thresholds can be set by users 2000 A,B,C or domain experts 1000 A,C 4000 A,D 5000 B,E,F 5 Data Warehousing and Data Mining by Kritsada Sriphaew
- 6. Rule Measures: Support/Confidence probability TransID Items Bought Rule: A C T001 A,B,C T002 A,C support (AC) = P({AC}) = P(AC) T003 A,D confidence(AC) = P(C|A) T004 B,E,F = P({AC})/P({A})Frequency • A B (1/4 = 25%, 1/3 = 33.3%) Customer buys both (A&C)A =3 • B A (1/4 = 25%, 1/2 = 50%) Customer buys diaper(C)B =2 • A C (2/4 = 50%, 2/3 = 66.7%) • C A (2/4 = 50%, 2/2 =100%)C =2 • A, B C (1/4 = 25%, 1/1 = 100%)AB = 1 • A, C B (1/4 = 25%, 1/2 = 50%)AC = 2 • B, C A (1/4 = 25%, 1/1 = 100%)BC = 1ABC = 1 Customer buys beer (A) 6 Data Warehousing and Data Mining by Kritsada Sriphaew
- 7. Association Rule: Support/Confidence for Relational Tables In case that each transaction is a row in a relational table Find: all rules that correlate the presence of one set of attributes with that of another set of attributes outlook temp. humidity windy Sponsor play-time play• If temperature = hot sunny hot high True Sony 85 Y then humidity = high (s=3/10,c=3/5) sunny hot high False HP 90 Y overcast hot normal True Ford 63 Y• If windy=true and play=Y rainy mild high True Ford 5 N then humidity=high and rainy cool low False HP 56 Y outlook=overcast (s=2/10, c=2/4) sunny hot low True Sony 25 N rainy cool normal True Nokia 5 N• If windy=true and play=Y overcast mild high True Honda 86 Y and humidity=high rainy mild low False Ford 78 Y then outlook=overcast (s=2/10, c=2/3) overcast hot high True Sony 74 Y 7 Data Warehousing and Data Mining by Kritsada Sriphaew
- 8. Association Rule Mining: Types Boolean vs. quantitative associations (Based on the types of values handled) (Single vs. multiple Dim.) SQLServer ^ DMBooks DBMiner [0.2%, 60%] buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”) [0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%] Single level vs. multilevel analysis What brands of beers are associated with what brands of diapers? Various extensions Maxpatterns and closed itemsets 8 Data Warehousing and Data Mining by Kritsada Sriphaew
- 9. An Example (single dimensional Boolean association Rule Mining) For rule A C: Min. support 50% support = support({A, C}) = 50% Min. confidence 50% confidence = support({A, C})/support({A}) = 66.7% The Apriori principle: Any subset of a frequent itemset must be frequentTransaction ID Items Bought Frequent Itemset Support 2000 A,B,C {A} 75% 1000 A,C {B} 50% 4000 A,D {C} 50% 5000 B,E,F {A,C} 50% 9 Data Warehousing and Data Mining by Kritsada Sriphaew
- 10. Two Steps in Mining Association Rules A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} must be a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)Step1: Find the frequent itemsets: the sets of items that have minimum supportStep2: Use the frequent itemsets to generate association rules 10 Data Warehousing and Data Mining by Kritsada Sriphaew
- 11. Find the frequent itemsetsThe Apriori Algorithm Join Step: Ck is generated by joining Lk-1with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent 1-itemsets}; 1 for (k = 1; Lk !=f; k++) do begin Ck+1 = candidates generated from Lk; 2 for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return Uk Lk; 11 Data Warehousing and Data Mining by Kritsada Sriphaew
- 12. The Apriori Algorithm — Example Database D itemset sup. L1 itemset sup. TID Items C1 {1} 2 {1} 2 100 134 {2} 3 {2} 3 200 235 Scan D {3} 3 {3} 3 300 1235 {4} 1 {5} 3 400 25 {5} 3 C2 itemset sup C2 itemset L2 itemset sup {1 2} 1 Scan D {1 2} {1 3} 2 {1 3} 2 {1 3} {2 3} 2 {1 5} 1 {1 5} {2 3} 2 {2 3} {2 5} 3 {2 5} 3 {2 5} {3 5} 2 {3 5} 2 {3 5} C3 itemset Scan D L3 itemset sup {2 3 5} {2 3 5} 212
- 13. How to Generate Candidates? Suppose the items in Lk-1 are listed in an orderStep 1: self-joining Lk-1 INSERT INTO Ck SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1 FROM Lk-1 p, Lk-1 q WHERE p.item1 = q.item1, …, p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1Step 2: pruningForAll itemsets c IN Ck DO ForAll (k-1)-subsets s OF c DO13 IF (s is not in Lk-1) THEN DELETE c FROM Ck
- 14. Example of Generating Candidates L3={abc, abd, acd, ace, bcd}Self-joining: L3×L3abc + abd abcdacd + ace acde Pruning:C4={abcd} acde is removed because ade is not in L3 14 Data Warehousing and Data Mining by Kritsada Sriphaew
- 15. How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction15 Data Warehousing and Data Mining by Kritsada Sriphaew
- 16. Subset Function Subset function: finds all the candidates contained in a transaction. (1) Generate Hash Tree (2) Hashing each item in the transactions 2 1C2 1 3 1+1 Database itemset TID Items {1 2} 5 1 100 134 {1 3} f 200 235 {1 5} 3 300 1235 1+1 {2 3} 2 400 25 {2 5} 5 1+1+1 {3 5} 3 5 1+1 16 Data Warehousing and Data Mining by Kritsada Sriphaew
- 17. Is Apriori Fast Enough? — PerformanceBottlenecks The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets Use database scan and pattern matching to collect counts for the candidate itemsets The bottleneck of Apriori: candidate generation Huge candidate sets: 104 frequent 1-itemset will generate 107 candidate 2- itemsets To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates. Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern 17 Data Warehousing and Data Mining by Kritsada Sriphaew
- 18. Mining Frequent Patterns Without CandidateGeneration Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern mining avoid costly database scans Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose mining tasks into smaller ones Avoid candidate generation: sub-database test only!18 Data Warehousing and Data Mining by Kritsada Sriphaew
- 19. Construct FP-tree from Transaction DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 0.5 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}Steps: Header Table1. Scan DB once, find Item frequency head f:4 c:1 frequent 1-itemset f 4 (single item pattern) c 4 c:3 b:1 b:1 a 32. Order frequent items b 3 a:3 p:1 in frequency m 3 descending order p 3 m:2 b:13. Scan DB again, construct FP-tree p:2 m:1 19 Data Warehousing and Data Mining by Kritsada Sriphaew
- 20. Mining Frequent Patterns using FP-tree General idea (divide-and-conquer) Recursively grow frequent pattern path using the FP-tree Method For each item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern) Benefit: Completeness & Compactness Completeness: never breaks a long pattern of any transaction and preserves complete information for frequent pattern mining Compactness: reduces irrelevant information (infrequent items are gone), orders in frequency descending ordering (more frequent items are likely to be shared), and smaller than the original database. 20 Data Warehousing and Data Mining by Kritsada Sriphaew
- 21. Step 1: From FP-tree to Conditional Pattern Base Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form a conditional pattern base {} Header Table Conditional pattern bases Item frequency head f:4 c:1 item cond. pattn base f 4 c:3 b:1 b:1 c f:3 c 4 a 3 a fc:3 b 3 a:3 p:1 b fca:1, f:1, c:1 m 3 p 3 m:2 b:1 m fca:2, fcab:1 p fcam:2, cb:1 p:2 m:1 21 Knowledge Management and Discovery © Kritsada Sriphaew
- 22. Step 2: Construct Conditional FP-tree For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: {} fca:2,Header Table fcab:1Item frequency head f:4 c:1 All frequent f 4 patternsc 4 c:3 b:1 b:1 {} concerning ma 3b 3 a:3 p:1 f:3 m,m 3 c:3 fm, cm, am,p 3 m:2 b:1 fcm, fam, cam, a:3 p:2 m:1 fcam 22 m-conditional FP-tree
- 23. Mining Frequent Patterns by(Creating Conditional Pattern-Bases) Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty23 Data Warehousing and Data Mining by Kritsada Sriphaew
- 24. Step 3: Recursively mine the conditional FP- tree {} f:3 {} c:3 f:3 am-conditional FP-tree c:3 {} a:3 f:3m-conditional FP-tree cm-conditional FP-tree {} f:3 cam-conditional FP-tree 24 Data Warehousing and Data Mining by Kritsada Sriphaew
- 25. Single FP-tree Path Generation Suppose an FP-tree T has a single path P The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P m-conditional pattern base: {} fca:2,Header Table fcab:1Item frequency head f:4 c:1 All frequent f 4 patternsc 4 c:3 b:1 b:1 {} concerning ma 3b 3 a:3 p:1 f:3 m,m 3 c:3 fm, cm, am,p 3 m:2 b:1 fcm, fam, cam, a:3 p:2 m:1 fcam 25 m-conditional FP-tree
- 26. FP-growth vs. Apriori: Scalability With theSupport Threshold Data set T25I20D10K 100 90 80 D1 FP-growth runtime Run time(sec.) 70 D1 Apriori runtime 60 50 40 30 20 10 0 0 1 2 3 Support threshold(%)26 Data Warehousing and Data Mining by Kritsada Sriphaew
- 27. CHARM - Mining Closed Association Rules Instead of horizontal DB format, vertical format is used. Instead of traditional frequent itemsets, closed frequent itemsets are mined. Horizontal DB Vertical DB Transaction Items Items Transaction 1 ABDE A 1345 2 BCE B 123456 3 ABDE C 2456 4 ABCE D 1356 5 ABCDE E 12345 6 BCD 27 Data Warehousing and Data Mining by Kritsada Sriphaew
- 28. CHARM – Frequent Itemsets and Their Supports An example database and its frequent itemsets Items Trans. Support Itemsets A 1345 1.00 B B 123456 0.83 BE, E C 2456 0.67 A, C, D, AB,AE, D 1356 BC, BD, ABE E 12345 0.50 AD, CE, DE, ABD, ADE, BCE, BDE, ABDEVertical DB Min. support = 0.5 28 Data Warehousing and Data Mining by Kritsada Sriphaew
- 29. CHARM - Closed Itemsets Closed frequent itemsets and their corresponding frequent itemsets Closed Itemsets Tidsets Sup. Freq. Itemsets B 123456 1.00 B BE 12345 0.83 BE, E ABE 1345 0.67 ABE, AB, AE, A BD 1356 0.67 BD, D BC 2456 0.67 BC, C ABDE 135 0.50 ABDE, ABD, ADE, BDE, AD, DE BCE 245 0.50 CE, BCE29 Data Warehousing and Data Mining by Kritsada Sriphaew
- 30. The CHARM Algorithm CHARM (? I T, minsup); CHARM-PROPERY(Nodes, NewN) 1. Nodes = { Ij t(Ij) : Ij I |t(Ij )| minsup } 1. if (|Y| minsup) then 2. CHARM-EXTEND (Nodes, C) 2. if t(Xi) = t(Xj) then // Propery 1 3. Remove Xj from Nodes CHARM-EXTEND (Nodes, C) 4. Replace all Xi with X’ 3. for each Xi t(Xi) in Nodes 5. else if t(Xj) t(Xj) then // Propery 2 4. NewN = f and X = Xi 6. Replace all Xi with X’ 5. for each Xj t(Xj) in Nodes, with f(j) > f(I) 7. else if t(Xj) t(Xj) then // Propery 3 6. X’ = X Xj and Y = t(Xi) t(Xj) 8. Remove Xj from Nodes 7. CHARM-PROPERTY(Nodes, NewN) 9. Add X Y to NewN 8. if NewN f then CHARM-EXTEND(NewN) 10. else if t(Xj) t(Xj) then // Propery 4 9. C = C {X} // if X is not subsumed 11. Add X Y to NewN f Ax1345 Bx123456 Cx2456 Dx1356 Ex12345 ABx1345 ABEx1345 ABCx45 ABDx135 BCx2456 BDx1356 BEx12345 ABDEx135 BCDx56 BCEx245 BDEx13530 Data Warehousing and Data Mining by Kritsada Sriphaew
- 31. Presentation of Association Rules (Table Form)31 Data Warehousing and Data Mining by Kritsada Sriphaew
- 32. Visualization of Association Rule UsingPlane Graph32
- 33. Visualization of Association Rule Using RuleGraph33
- 34. Mining multilevel association rules from transactional databasesMultiple-Level Association Rules TID ITEMS Items often form hierarchy. T1 {1121, 1122, 1212} Items at the lower level are T2 {1222, 1121, 1122, 1213} expected to have lower T3 {1124, 1213} T4 {1111, 1211, 1232, 1221, 1223} support. Food Rules regarding itemsets at (1) the appropriate levels could Milk Bread be quite useful. (11) (12) Transaction database can be Skim 2% Wheat White encoded based on (111) (112) (121) (122) dimensions and levels Wonder We can explore shared Fraser Sunset (1222) multi-level mining (1121) (1124) Wonder (1213) 34 Data Warehousing and Data Mining by Kritsada Sriphaew
- 35. Mining Multi-Level Associations A top_down, progressive deepening approach: First find high-level strong rules: milk bread [20%, 60%] Then find their lower-level “weaker” rules: 2% milk wheat bread [6%, 50%] Variations at mining multiple-level association rules. Level-crossed association rules: 2% milk Wonder wheat bread [3%, 60%] Association rules with multiple, alternative hierarchies: 2% milk Wonder bread [8%, 72%]35 Data Warehousing and Data Mining by Kritsada Sriphaew
- 36. Multi-level Association: Redundancy Filtering Some rules may be redundant due to “ancestor” relationships between items. Example milk wheat bread [s=8%, c=70%] 2% milk wheat bread [s=2%, c=72%] We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor. 36 Data Warehousing and Data Mining by Kritsada Sriphaew
- 37. Multi-Level Mining: Progressive Deepening A top-down, progressive deepening approach: First mine high-level frequent items: milk (15%), bread (10%) Then mine their lower-level “weaker” frequent itemsets: 2% milk (5%), wheat bread (4%) Different min_support threshold across multi-levels lead to different algorithms: If adopting the same min_support across multi-levels then toss t if any of t’s ancestors is infrequent. If adopting reduced min_support at lower levels then examine only those descendants whose ancestor’s support is frequent/non-negligible. 37 Data Warehousing and Data Mining by Kritsada Sriphaew
- 38. Problem of Confidence Example: (Aggarwal & Yu, PODS98) Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence basketball not basketball sum(row) cereal 2000 1750 3750 not cereal 1000 250 1250 sum(col.) 3000 2000 5000 38 Data Warehousing and Data Mining by Kritsada Sriphaew
- 39. Interest/Lift/Correlation Interest (or lift, correlation) taking both P(A) and P(B) in consideration P( A B) P(AB)=P(B)P(A), if A and B are independent P( A) P( B) events A and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated 2000 basketball not basketball sum(row) 5000 cereal 2000 1750 3750 0.889 3000 3750 not cereal 1000 250 1250 5000 5000 sum(col.) 3000 2000 5000 1000 Lift(play basketball eat cereal) = 0.89 5000 1.33 Lift(play basketball not eat cereal) = 1.33 3000 1250 5000 5000 39 Data Warehousing and Data Mining by Kritsada Sriphaew
- 40. Conviction Conviction (Brin, 1997) (1 Support ( B)) Conviction ( A B) 0 <= conv(AB) <= (1 Confidence ( A B) A and B are statistically independent if and only if conv(AB) = 1 0 < conv(AB) < 1 if and only if p(B|A) < p(B) B is negatively correlated with A. 1 < conv(AB) < if and only if p(B|A) > p(B) B is positively correlated with A. 1 3750 basketball not basketball sum(row) 5000 0.375 cereal 2000 1750 3750 1 0.667 not cereal 1000 250 1250 1250 sum(col.) 3000 2000 5000 1 5000 2.25 conviction(play basketball eat cereal) = 0.375 1 0.333 conviction(play basketball not eat cereal) = 2.25 40 Data Warehousing and Data Mining by Kritsada Sriphaew
- 41. From Association Mining to Correlation Analysis Ex. Strong rules are not necessarily interesting Of 10000 transactions • 6000 customer transactions include computer games • 7500 customer transactions include videos • 4000 customer transactions include both computer game and video • Suppose that data mining program for videos games discovering association rules is run on the data, using min_sup of 30% and min_conf. of 60% • The following association rule is discovered: 4,000 buys(X, “computer games”) buys(X, “videos”) [s=40%, c=66%] 41 =4000/10000 =4000/6000
- 42. A misleading “strong” association rule buys(X, “computer games”) buys(X, “videos”) [support=40%, confidence=66%] This rule is misleading because the probability of purchasing video is 75% (>66%) In fact, computer games and videos are negatively associated because the purchase of one of these items actually decreases the likelihood of purchasing the other. Therefore, we could easily make unwise business decisions based on this rule 42 Data Warehousing and Data Mining by Kritsada Sriphaew
- 43. From Association Analysis to CorrelationAnalysis To help filter out misleading “strong” association Correlation rules A B [support, confidence, correlation] Lift is a simple correlation measure that is given as follows The occurrence of itemset A is independent of the occurrence of itemset B if P(AB) = P(A)P(B); Otherwise, itemset A and B are dependent and correlated lift(A,B) = P(AB) / P(A)P(B) = P(B|A) / P(B) = conf(AB) / sup(B) If lift(A,B) < 1, then the occurrence of A is negatively correlated with the occurrence of B If lift(A,B) > 1, then A and B is positively correlated, meaning that the occurrence of one implies the occurrence of the other. 43 Data Warehousing and Data Mining by Kritsada Sriphaew
- 44. From Association Analysis to CorrelationAnalysis (Cont.) Ex. Correlation analysis using lift buys(X, “computer games”) buys(X, “videos”) [support=40%, confidence=66%] The lift of this rule is P{game,video} / (P{game} × P{video}) = 0.40/(0.6 ×0.75) = 0.89 There is a negative correlation between the occurrence of {game} and {video} Ex. Is this following rule misleading? Buy walnuts Buy milk [1%, 80%]” if 85% of customers buy milk 44 Data Warehousing and Data Mining by Kritsada Sriphaew
- 45. Homework ให้ transactional database ซึ่งเป็น LOG ไฟล์บันทึกการเข้าเยี่ยมชมเว็บเพจของปู้ใช้แต่ละคน ในช่วงระยะเวลาหนึ่ง จงหากฎสัมพันธ์ที่น่าเชื่อถือ โดยสมมติว่าเราเป็นปู้วเิ คราะห์ข้อมูล มี สิทธิตั้ง minimum support และ minimum confidence ด้วยตัวเอง พร้อมอธิบายเหตุปล ประกอบว่าทาไมถึงตั้งค่านั้น และตรวจสอบด้วยว่ากฎเหล่านั้นเป็น misleading หรือไม่ ถ้ามีให้ แก้ไขอย่างไร TID List of items T001 P1, P2, P3, P4 T002 P3, P6 T003 P2, P5, P1 T004 P5, P4, P3,P6 T005 P1, P3, P4, P2 P1 P2 P4 P6 P3 P5 45
- 46. Feb 26, 2011 (14:00) Quiz I Star-net Query (Multidimensional Table) Data Cube Computation (Memory Calculation) Data Preprocessing (Normalization, Smoothing by binning) Association Rule Mining46 Data Warehousing and Data Mining by Kritsada Sriphaew

Full NameComment goes here.