Association Rule Basics
Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
Lecture outline                                  2



   What is association rule mining?
   Frequent itemsets, support,...
Example of market-basket transactions                3




 Bread       Bread                       Steak     Jam
 Peanuts...
What is association mining?                      4



 Finding frequent patterns, associations, correlations, or
  causal...
What is association rule mining?                                      5




TID Items                                     ...
Definition: Frequent Itemset                                            6



   Itemset                                  ...
What is an association rule?                              7



 Implication of the form X         Y, where X and Y are it...
Support and Confidence                       8



         Customer
         buys both          Customer
                 ...
What is the goal?                                 9



 Given a set of transactions T, the goal of association rule
  min...
Mining Association Rules                        10



             {Bread, Jam}    {Milk} s=0.4 c=0.75
             {Milk,...
Mining Association Rules:                       11
Two Step Approach

 Frequent Itemset Generation
     Generate all item...
Frequent Itemset Generation                                        13



                                  null




      ...
Frequent Itemset Generation                                      14



 Brute-force approach:
     Each itemset in the la...
Computational Complexity                            15




 Given d unique items:
      Total number of itemsets = 2d
   ...
Frequent Itemset Generation Strategies           16



 Reduce the number of candidates (M)
    Complete search: M=2d
   ...
Reducing the Number of Candidates                18



 Apriori principle
     If an itemset is frequent, then all of
   ...
Illustrating Apriori Principle                                           19



                                           ...
How does the Apriori principle work?                    20



Items (1-itemsets)
Item      Count
Bread      4
Peanuts    4...
Apriori Algorithm                            21



 Let k=1
 Generate frequent itemsets of length 1
 Repeat until no ne...
The Apriori Algorithm                            22




      Ck: Candidate itemset of size k
      Lk : frequent itemset ...
The Apriori Algorithm                              23



 Join Step
     Ck is generated by joining Lk-1 with itself

 P...
Efficient Implementation of Apriori in SQL    24



 Hard to get good performance out of pure SQL (SQL-92)
  based approa...
How to Improve                                             25
Apriori’s Efficiency

 Hash-based itemset counting: A k-ite...
Rule Generation                                 26



 Given a frequent itemset L, find all non-empty subsets f    L
  su...
How to efficiently generate rules from                28
frequent itemsets?

 Confidence does not have an anti-monotone p...
Rule Generation for Apriori Algorithm                            29




                                          ABCD=>{ ...
Rule Generation for Apriori Algorithm               30



 Candidate rule is generated by merging two rules that share
  ...
Effect of Support Distribution                31



   Many real data sets have skewed support distribution




Support
d...
How to set the appropriate minsup?              32



 If minsup is set too high, we could miss itemsets involving
  inte...
Upcoming SlideShare
Loading in …5
×

Data Mining: Association Rules Basics

43,056
-1

Published on

Very good notes on Association Rules.

6 Comments
25 Likes
Statistics
Notes
No Downloads
Views
Total Views
43,056
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
2,061
Comments
6
Likes
25
Embeds 0
No embeds

No notes for slide

Data Mining: Association Rules Basics

  1. 1. Association Rule Basics Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
  2. 2. Lecture outline 2  What is association rule mining?  Frequent itemsets, support, and confidence  Mining association rules  The “Apriori” algorithm  Rule generation Prof. Pier Luca Lanzi
  3. 3. Example of market-basket transactions 3 Bread Bread Steak Jam Peanuts Jam Jam Soda Milk Soda Soda Peanuts Fruit Chips Chips Milk Jam Milk Bread Fruit Fruit Jam Fruit Fruit Fruit Soda Soda Soda Peanuts Chips Chips Peanuts Cheese Milk Milk Milk Yogurt Bread Prof. Pier Luca Lanzi
  4. 4. What is association mining? 4  Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories  Applications Basket data analysis Cross-marketing Catalog design … Prof. Pier Luca Lanzi
  5. 5. What is association rule mining? 5 TID Items Examples 1 Bread, Peanuts, Milk, Fruit, Jam {bread} {milk} 2 Bread, Jam, Soda, Chips, Milk, Fruit {soda} {chips} 3 Steak, Jam, Soda, Chips, Bread {bread} {jam} 4 Jam, Soda, Peanuts, Milk, Fruit 5 Jam, Soda, Chips, Milk, Bread 6 Fruit, Soda, Chips, Milk 7 Fruit, Soda, Peanuts, Milk 8 Fruit, Peanuts, Cheese, Yogurt  Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Prof. Pier Luca Lanzi
  6. 6. Definition: Frequent Itemset 6  Itemset TID Items A collection of one or more items, e.g., {milk, bread, jam} 1 Bread, Peanuts, Milk, Fruit, Jam k-itemset, an itemset that 2 Bread, Jam, Soda, Chips, Milk, Fruit contains k items  Support count ( ) 3 Steak, Jam, Soda, Chips, Bread Frequency of occurrence 4 Jam, Soda, Peanuts, Milk, Fruit of an itemset ({Milk, Bread}) = 3 5 Jam, Soda, Chips, Milk, Bread ({Soda, Chips}) = 4 6 Fruit, Soda, Chips, Milk  Support Fraction of transactions 7 Fruit, Soda, Peanuts, Milk that contain an itemset 8 Fruit, Peanuts, Cheese, Yogurt s({Milk, Bread}) = 3/8 s({Soda, Chips}) = 4/8  Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold Prof. Pier Luca Lanzi
  7. 7. What is an association rule? 7  Implication of the form X Y, where X and Y are itemsets  Example, {bread} {milk}  Rule Evaluation Metrics, Suppor & Confidence  Support (s) Fraction of transactions that contain both X and Y  Confidence (c) Measures how often items in Y appear in transactions that contain X Prof. Pier Luca Lanzi
  8. 8. Support and Confidence 8 Customer buys both Customer buys Milk Customer buys Bread Prof. Pier Luca Lanzi
  9. 9. What is the goal? 9  Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold  Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds  Brute-force approach is computationally prohibitive! Prof. Pier Luca Lanzi
  10. 10. Mining Association Rules 10 {Bread, Jam} {Milk} s=0.4 c=0.75 {Milk, Jam} {Bread} s=0.4 c=0.75 {Bread} {Milk, Jam} s=0.4 c=0.75 {Jam} {Bread, Milk} s=0.4 c=0.6 {Milk} {Bread, Jam} s=0.4 c=0.5  All the above rules are binary partitions of the same itemset: {Milk, Bread, Jam}  Rules originating from the same itemset have identical support but can have different confidence  We can decouple the support and confidence requirements! Prof. Pier Luca Lanzi
  11. 11. Mining Association Rules: 11 Two Step Approach  Frequent Itemset Generation Generate all itemsets whose support minsup  Rule Generation Generate high confidence rules from frequent itemset Each rule is a binary partitioning of a frequent itemset  Frequent itemset generation is computationally expensive Prof. Pier Luca Lanzi
  12. 12. Frequent Itemset Generation 13 null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Given d items, there are 2d possible ABCDE candidate itemsets Prof. Pier Luca Lanzi
  13. 13. Frequent Itemset Generation 14  Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database TID Items 1 Bread, Peanuts, Milk, Fruit, Jam candidates 2 Bread, Jam, Soda, Chips, Milk, Fruit 3 Steak, Jam, Soda, Chips, Bread N 4 Jam, Soda, Peanuts, Milk, Fruit 5 Jam, Soda, Chips, Milk, Bread M 6 Fruit, Soda, Chips, Milk 7 Fruit, Soda, Peanuts, Milk 8 Fruit, Peanuts, Cheese, Yogurt w Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d Prof. Pier Luca Lanzi
  14. 14. Computational Complexity 15  Given d unique items: Total number of itemsets = 2d Total number of possible association rules: For d=6, there are 602 rules Prof. Pier Luca Lanzi
  15. 15. Frequent Itemset Generation Strategies 16  Reduce the number of candidates (M) Complete search: M=2d Use pruning techniques to reduce M  Reduce the number of transactions (N) Reduce size of N as the size of itemset increases  Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction Prof. Pier Luca Lanzi
  16. 16. Reducing the Number of Candidates 18  Apriori principle If an itemset is frequent, then all of its subsets must also be frequent  Apriori principle holds due to the following property of the support measure:  Support of an itemset never exceeds the support of its subsets  This is known as the anti-monotone property of support Prof. Pier Luca Lanzi
  17. 17. Illustrating Apriori Principle 19 null A B C D E AB AC AD AE BC BD BE CD CE DE Found to be Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Pruned supersets ABCDE Prof. Pier Luca Lanzi
  18. 18. How does the Apriori principle work? 20 Items (1-itemsets) Item Count Bread 4 Peanuts 4 2-itemsets Milk 6 2-Itemset Count Fruit 6 Bread, Jam 4 Jam 5 Peanuts, Fruit 4 Soda 6 Milk, Fruit 5 Chips 4 Milk, Jam 4 Steak 1 Milk, Soda 5 Cheese 1 Fruit, Soda 4 Yogurt 1 Jam, Soda 4 3-itemsets Soda, Chips 4 Minimum Support = 4 3-Itemset Count Milk, Fruit, Soda 4 Prof. Pier Luca Lanzi
  19. 19. Apriori Algorithm 21  Let k=1  Generate frequent itemsets of length 1  Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent Prof. Pier Luca Lanzi
  20. 20. The Apriori Algorithm 22 Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk != ; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; Prof. Pier Luca Lanzi
  21. 21. The Apriori Algorithm 23  Join Step Ck is generated by joining Lk-1 with itself  Prune Step Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Prof. Pier Luca Lanzi
  22. 22. Efficient Implementation of Apriori in SQL 24  Hard to get good performance out of pure SQL (SQL-92) based approaches alone  Make use of object-relational extensions like UDFs, BLOBs, Table functions etc. Get orders of magnitude improvement  S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD’98 Prof. Pier Luca Lanzi
  23. 23. How to Improve 25 Apriori’s Efficiency  Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent  Transaction reduction: A transaction that does not contain any frequent k- itemset is useless in subsequent scans  Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB  Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness  Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent Prof. Pier Luca Lanzi
  24. 24. Rule Generation 26  Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement  If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC, AB CD, AC BD, AD BC, BC AD, BD AC, CD AB  If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L) Prof. Pier Luca Lanzi
  25. 25. How to efficiently generate rules from 28 frequent itemsets?  Confidence does not have an anti-monotone property  c(ABC D) can be larger or smaller than c(AB D)  But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone with respect to the number of items on the right hand side of the rule Prof. Pier Luca Lanzi
  26. 26. Rule Generation for Apriori Algorithm 29 ABCD=>{ } Low Confidence Rule BCD=>A ACD=>B ABD=>C ABC=>D CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD D=>ABC C=>ABD B=>ACD A=>BCD Pruned Rules Prof. Pier Luca Lanzi
  27. 27. Rule Generation for Apriori Algorithm 30  Candidate rule is generated by merging two rules that share the same prefix in the rule consequent  join(CD=>AB,BD=>AC) CD=>AB BD=>AC would produce the candidate rule D => ABC  Prune rule D=>ABC if its subset AD=>BC does not have high confidence D=>ABC Prof. Pier Luca Lanzi
  28. 28. Effect of Support Distribution 31  Many real data sets have skewed support distribution Support distribution of a retail data set Prof. Pier Luca Lanzi
  29. 29. How to set the appropriate minsup? 32  If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products)  If minsup is set too low, it is computationally expensive and the number of itemsets is very large  A single minimum support threshold may not be effective Prof. Pier Luca Lanzi
  1. Gostou de algum slide específico?

    Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

×