SlideShare a Scribd company logo
1 of 68
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
1
Wu-Jun Li
Department of Computer Science and Engineering
Shanghai Jiao Tong University
Lecture 3: Frequent Itemsets and Association Rules
Mining Massive Datasets
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
2
Outline
 Association rules
 A-Priori algorithm
 Large-scale algorithms
2
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
3
The Market-Basket Model
 A large set of items, e.g., things sold in a
supermarket.
 A large set of baskets, each of which is a small set of
the items, e.g., the things one customer buys on one
day.
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
4
Market-Baskets – (2)
 Really a general many-many mapping (association)
between two kinds of things.
 But we ask about connections among “items,” not
“baskets.”
 The technology focuses on common events, not rare
events (“long tail”).
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
5
Association Rule Discovery
 Goal: To identify items that are bought together by
sufficiently many customers, and find dependencies
among items
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
6
Support
 Simplest question: find sets of items that appear
“frequently” in the baskets.
 Support for itemset I = the number of baskets
containing all items in I.
 Sometimes given as a percentage.
 Given a support threshold s, sets of items that
appear in at least s baskets are called frequent
itemsets.
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
7
Example: Frequent Itemsets
 Items={milk, coke, pepsi, beer, juice}.
 Support threshold = 3 baskets.
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
 Frequent itemsets: {m}, {c}, {b}, {j},
, {b,c} , {c,j}.{m,b}
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
8
Applications – (1)
 Items = products; baskets = sets of products
someone bought in one trip to the store.
 Example application: given that many people buy
beer and diapers together:
 Run a sale on diapers; raise price of beer.
 Only useful if many buy diapers & beer.
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
9
Applications – (2)
 Baskets = sentences; items = documents containing
those sentences.
 Items that appear together too often could
represent plagiarism.
 Notice items do not have to be “in” baskets.
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
10
Applications – (3)
 Baskets = Web pages; items = words.
 Unusual words appearing together in a large number
of documents, e.g., “Brad” and “Angelina,” may
indicate an interesting relationship.
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
11
Scale of the Problem
 WalMart sells 100,000 items and can store billions of
baskets.
 The Web has billions of words and many billions of
pages.
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
12
Association Rules
 If-then rules about the contents of baskets.
 {i1, i2,…,ik} → j means: “if a basket contains all of i1,
…,ik then it is likely to contain j.”
 Confidence of this association rule is the probability
of j given i1,…,ik.
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
13
Example: Confidence
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
 An association rule: {m, b} → c.
 Confidence = 2/4 = 50%.
+
_
_
+
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
14
Finding Association Rules
 Question: “find all association rules with support ≥
s and confidence ≥ c .”
 Note: “support” of an association rule is the support of
the set of items on the left.
 Hard part: finding the frequent itemsets.
 Note: if {i1, i2,…,ik} → j has high support and confidence,
then both {i1, i2,…,ik} and {i1, i2,…,ik,j } will be
“frequent.”
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
15
Computation Model
 Typically, data is kept in flat files rather than in a
database system.
 Stored on disk.
 Stored basket-by-basket.
 Expand baskets into pairs, triples, etc. as you read
baskets.
 Use k nested loops to generate all sets of size k.
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
16
File Organization
Item
Item
Item
Item
Item
Item
Item
Item
Item
Item
Item
Item
Basket 1
Basket 2
Basket 3
Etc.
Example: items are
positive integers,
and boundaries
between baskets
are –1.
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
17
Computation Model – (2)
 The true cost of mining disk-resident data is usually
the number of disk I/O’s.
 In practice, association-rule algorithms read the data
in passes – all baskets read in turn.
 Thus, we measure the cost by the number of passes
an algorithm takes.
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
18
Main-Memory Bottleneck
 For many frequent-itemset algorithms, main memory
is the critical resource.
 As we read baskets, we need to count something, e.g.,
occurrences of pairs.
 The number of different things we can count is limited by
main memory.
 Swapping counts in/out is a disaster (why?).
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
19
Finding Frequent Pairs
 The hardest problem often turns out to be finding
the frequent pairs.
 Why? Often frequent pairs are common, frequent
triples are rare.
 Why? Probability of being frequent drops exponentially with
size; number of sets grows more slowly with size.
 We’ll concentrate on pairs, then extend to larger
sets.
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
20
Naïve Algorithm
 Read file once, counting in main memory the
occurrences of each pair.
 From each basket of n items, generate its n (n -1)/2
pairs by two nested loops.
 Fails if (#items)2
exceeds main memory.
 Remember: #items can be 100K (Wal-Mart) or 10B (Web
pages).
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
21
Example: Counting Pairs
 Suppose 105
items.
 Suppose counts are 4-byte integers.
 Number of pairs of items: 105
(105
-1)/2 = 5*109
(approximately).
 Therefore, 2*1010
(20 gigabytes) of main memory
needed.
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
22
Details of Main-Memory Counting
 Two approaches:
(1) Count all pairs, using a triangular matrix.
(2) Keep a table of triples [i, j, c] = “the count of the pair of
items {i, j } is c.”
 (1) requires only 4 bytes/pair.
 Note: always assume integers are 4 bytes.
 (2) requires 12 bytes, but only for those pairs
with count > 0.
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
23
4 bytes per pair
Method (1) Method (2)
12 bytes per
occurring pair
Association Rules
Details of Main-Memory Counting
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
24
Triangular-Matrix Approach – (1)
 Number items 1, 2,…
 Requires table of size O(n) to convert item names to
consecutive integers.
 Count {i, j } only if i < j.
 Keep pairs in the order {1,2}, {1,3},…, {1,n }, {2,3},
{2,4},…,{2,n }, {3,4},…, {3,n },…{n -1,n }.
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
25
Triangular-Matrix Approach – (2)
 Find pair {i, j } at the position (i –1)(n –i /2) + j – i.
 Total number of pairs n (n –1)/2; total bytes
about 2n 2
.
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
26
Details of Approach #2
 Total bytes used is about 12p, where p is the number
of pairs that actually occur.
 Beats triangular matrix if at most 1/3 of possible pairs
actually occur.
 May require extra space for retrieval structure, e.g., a
hash table.
Association Rules
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
27
Outline
 Association rules
 A-Priori algorithm
 Large-scale algorithms
27
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
28
A-Priori Algorithm – (1)
 A two-pass approach called a-priori limits the need
for main memory.
 Key idea: monotonicity : if a set of items appears at
least s times, so does every subset.
 Contrapositive for pairs: if item i does not appear in s
baskets, then no pair including i can appear in s baskets.
A-Priori Algorithm
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
29
A-Priori Algorithm – (2)
 Pass 1: Read baskets and count in main memory
the occurrences of each item.
 Requires only memory proportional to #items.
 Items that appear at least s times are the
frequent items.
A-Priori Algorithm
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
30
A-Priori Algorithm – (3)
 Pass 2: Read baskets again and count in main
memory only those pairs both of which were
found in Pass 1 to be frequent.
 Requires memory proportional to square of frequent
items only (for counts), plus a list of the frequent
items (so you know what must be counted).
A-Priori Algorithm
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
31
Picture of A-Priori
Item counts
Pass 1 Pass 2
Frequent items
Counts of
pairs of
frequent
items
A-Priori Algorithm
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
32
Detail for A-Priori
 You can use the triangular matrix method with n
= number of frequent items.
 May save space compared with storing triples.
 Trick: number frequent items 1,2,… and keep a
table relating new numbers to original item
numbers.
A-Priori Algorithm
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
33
A-Priori Using Triangular Matrix for
Counts
Item counts
Pass 1 Pass 2
1. Freq- Old
2. quent item
… items #’s
Counts of
pairs of
frequent
items
A-Priori Algorithm
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
34
 For each k, we construct two sets of k -sets (sets of
size k ):
 Ck = candidate k -sets = those that might be frequent sets
(support > s ) based on information from the pass for k –1.
 Lk = the set of truly frequent k -sets.
A-Priori Algorithm
A-Priori for All Frequent Itemsets
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
35
C1 L1 C2 L2 C3
Filter Filter ConstructConstruct
First
pass
Second
pass
All
items
All pairs
of items
from L1
Count
the pairs
To be
explained
Count
the items
Frequent
items
Frequent
pairs
A-Priori Algorithm
A-Priori for All Frequent Itemsets
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
36
A-Priori for All Frequent Itemsets
 One pass for each k.
 Needs room in main memory to count each
candidate k -set.
 For typical market-basket data and reasonable
support (e.g., 1%), k = 2 requires the most memory.
A-Priori Algorithm
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
37
 C1 = all items
 In general, Lk = members of Ck with support ≥ s.
 Ck+1 = (k +1) -sets, each k of which is in Lk.
A-Priori Algorithm
A-Priori for All Frequent Itemsets
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
38
Outline
 Association rules
 A-Priori algorithm
 Large-scale algorithms
38
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
39
 During Pass 1 of A-priori, most memory is idle.
 Use that memory to keep counts of buckets into which
pairs of items are hashed.
 Just the count, not the pairs themselves.
Large-Scale Algorithms
PCY Algorithm
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
40
1. Pairs of items need to be generated from the input
file; they are not present in the file.
2. We are not just interested in the presence of a
pair, but we need to see whether it is present at
least s (support) times.
Large-Scale Algorithms
PCY Algorithm
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
41
PCY Algorithm
 A bucket is frequent if its count is at least the
support threshold.
 If a bucket is not frequent, no pair that hashes to
that bucket could possibly be a frequent pair.
 On Pass 2, we only count pairs that hash to frequent
buckets.
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
42
Picture of PCY
Hash
table
Item counts
Bitmap
Pass 1 Pass 2
Frequent items
Counts of
candidate
pairs
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
43
 Space to count each item.
 One (typically) 4-byte integer per item.
 Use the rest of the space for as many integers,
representing buckets, as we can.
Large-Scale Algorithms
PCY Algorithm
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
44
PCY Algorithm – Pass 1
FOR (each basket) {
FOR (each item in the basket)
add 1 to item’s count;
FOR (each pair of items) {
hash the pair to a bucket;
add 1 to the count for that
bucket
}
}
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
45
1. A bucket that a frequent pair hashes to is surely frequent.
 We cannot use the hash table to eliminate any member of this
bucket.
1. Even without any frequent pair, a bucket can be frequent.
 Again, nothing in the bucket can be eliminated.
3. But in the best case, the count for a bucket is less than the
support s.
 Now, all pairs that hash to this bucket can be eliminated as
candidates, even if the pair consists of two frequent items.
 Thought question: under what conditions can we be sure
most buckets will be in case 3?
Large-Scale Algorithms
PCY Algorithm
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
46
PCY Algorithm – Between Passes
 Replace the buckets by a bit-vector:
 1 means the bucket is frequent; 0 means it is not.
 4-byte integers are replaced by bits, so the bit-vector
requires 1/32 of memory.
 Also, decide which items are frequent and list them
for the second pass.
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
47
PCY Algorithm – Pass 2
 Count all pairs {i, j } that meet the conditions for
being a candidate pair:
1. Both i and j are frequent items.
2. The pair {i, j }, hashes to a bucket number whose bit in
the bit vector is 1.
 Notice all these conditions are necessary for the
pair to have a chance of being frequent.
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
48
 Buckets require a few bytes each.
 Note: we don’t have to count past s.
 # buckets is O(main-memory size).
 On second pass, a table of (item, item, count) triples
is essential (why?).
 Thus, hash table must eliminate 2/3 of the candidate pairs
for PCY to beat a-priori.
Large-Scale Algorithms
PCY Algorithm
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
49
Multistage Algorithm
 Key idea: After Pass 1 of PCY, rehash only those pairs
that qualify for Pass 2 of PCY.
 On middle pass, fewer pairs contribute to buckets, so
fewer false positives –frequent buckets with no
frequent pair.
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
50
Multistage Picture
First
hash table
Second
hash table
Item counts
Bitmap 1 Bitmap 1
Bitmap 2
Freq. items Freq. items
Counts of
candidate
pairs
Pass 1 Pass 2 Pass 3
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
51
Multistage – Pass 3
 Count only those pairs {i, j } that satisfy these
candidate pair conditions:
1. Both i and j are frequent items.
2. Using the first hash function, the pair hashes to a
bucket whose bit in the first bit-vector is 1.
3. Using the second hash function, the pair hashes to a
bucket whose bit in the second bit-vector is 1.
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
52
Important Points
1. The two hash functions have to be independent.
2. We need to check both hashes on the third pass.
 If not, we would wind up counting pairs of frequent
items that hashed first to a frequent bucket but
happened to hash second to an infrequent bucket.
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
53
Multihash
 Key idea: use several independent hash tables on
the first pass.
 Risk: halving the number of buckets doubles the
average count. We have to be sure most buckets
will still not reach count s.
 If so, we can get a benefit like multistage, but in
only 2 passes.
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
54
Multihash Picture
First hash
table
Second
hash table
Item counts
Bitmap 1
Bitmap 2
Freq. items
Counts of
candidate
pairs
Pass 1 Pass 2
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
55
Extensions
 Either multistage or multihash can use more than two
hash functions.
 In multistage, there is a point of diminishing returns,
since the bit-vectors eventually consume all of main
memory.
 For multihash, the bit-vectors occupy exactly what one
PCY bitmap does, but too many hash functions makes
all counts > s.
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
56
All (Or Most) Frequent Itemsets In < 2
Passes
 A-Priori, PCY, etc., take k passes to find frequent
itemsets of size k.
 Other techniques use 2 or fewer passes for all
sizes:
 Simple algorithm.
 SON (Savasere, Omiecinski, and Navathe).
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
57
Simple Algorithm – (1)
 Take a random sample of the market baskets.
 Run a-priori or one of its improvements (for sets of
all sizes, not just pairs) in main memory, so you
don’t pay for disk I/O each time you increase the
size of itemsets.
 Be sure you leave enough space for counts.
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
58
Main-Memory Picture
Copy of
sample
baskets
Space
for
counts
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
59
Simple Algorithm – (2)
 Use as your support threshold a suitable, scaled-
back number.
 E.g., if your sample is 1/100 of the baskets, use s /100
as your support threshold instead of s .
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
60
Simple Algorithm – Option
 Optionally, verify that your guesses are truly
frequent in the entire data set by a second pass.
 But you don’t catch sets frequent in the whole
but not in the sample.
 Smaller threshold, e.g., s /125, helps catch more truly
frequent itemsets.
 But requires more space.
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
61
SON Algorithm – (1)
 Repeatedly read small subsets of the baskets into
main memory and perform the first pass of the
simple algorithm on each subset.
 An itemset becomes a candidate if it is found to be
frequent in any one or more subsets of the baskets.
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
62
SON Algorithm – (2)
 On a second pass, count all the candidate itemsets
and determine which are frequent in the entire set.
 Key “monotonicity” idea: an itemset cannot be
frequent in the entire set of baskets unless it is
frequent in at least one subset.
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
63
SON Algorithm – Distributed Version
 This idea lends itself to distributed data mining.
 If baskets are distributed among many nodes,
compute frequent itemsets at each node, then
distribute the candidates from each node.
 Finally, accumulate the counts of all candidates.
Large-Scale Algorithms
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
64
 First MapReduce
 Map:
 Reduce:
 Second MapReduce
 Map:
 Reduce:
SON Algorithm – Distributed Version
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
65
Compacting the Output
1. Maximal Frequent itemsets : no immediate
superset is frequent.
2. Closed itemsets : no immediate superset has the
same count (> 0).
 Stores not only frequent information, but exact counts.
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
66
Example: Maximal/Closed
Count Maximal (s=3) Closed
A 4 No No
B 5 No Yes
C 3 No No
AB 4 Yes Yes
AC 2 No No
BC 3 Yes Yes
ABC 2 No Yes
Frequent, but
superset BC
also frequent.
Frequent, and
its only superset,
ABC, not freq.
Superset BC
has same count.
Its only super-
set, ABC, has
smaller count.
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
67
Questions?
Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules
68
Acknowledgement
 Slides are from:
 Prof. Jeffrey D. Ullman
 Dr. Jure Leskovec

More Related Content

Similar to Lecture3 assoc rules

20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.ppt20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.pptSamPrem3
 
Cs501 mining frequentpatterns
Cs501 mining frequentpatternsCs501 mining frequentpatterns
Cs501 mining frequentpatternsKamal Singh Lodhi
 
Data mining techniques unit III
Data mining techniques unit IIIData mining techniques unit III
Data mining techniques unit IIImalathieswaran29
 
Mining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesMining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesRashmi Bhat
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 
Apriori Algorithm.pptx
Apriori Algorithm.pptxApriori Algorithm.pptx
Apriori Algorithm.pptxRashi Agarwal
 
CS 402 DATAMINING AND WAREHOUSING -MODULE 5
CS 402 DATAMINING AND WAREHOUSING -MODULE 5CS 402 DATAMINING AND WAREHOUSING -MODULE 5
CS 402 DATAMINING AND WAREHOUSING -MODULE 5NIMMYRAJU
 
Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15Shani729
 
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...ijsrd.com
 
1. Introduction to Association Rule 2. Frequent Item Set Mining 3. Market Bas...
1. Introduction to Association Rule2. Frequent Item Set Mining3. Market Bas...1. Introduction to Association Rule2. Frequent Item Set Mining3. Market Bas...
1. Introduction to Association Rule 2. Frequent Item Set Mining 3. Market Bas...Surabhi Gosavi
 

Similar to Lecture3 assoc rules (20)

20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.ppt20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.ppt
 
Cs501 mining frequentpatterns
Cs501 mining frequentpatternsCs501 mining frequentpatterns
Cs501 mining frequentpatterns
 
Data mining techniques unit III
Data mining techniques unit IIIData mining techniques unit III
Data mining techniques unit III
 
B0950814
B0950814B0950814
B0950814
 
Mining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesMining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association Rules
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Apriori Algorithm.pptx
Apriori Algorithm.pptxApriori Algorithm.pptx
Apriori Algorithm.pptx
 
Lecture20
Lecture20Lecture20
Lecture20
 
CS 402 DATAMINING AND WAREHOUSING -MODULE 5
CS 402 DATAMINING AND WAREHOUSING -MODULE 5CS 402 DATAMINING AND WAREHOUSING -MODULE 5
CS 402 DATAMINING AND WAREHOUSING -MODULE 5
 
Dma unit 2
Dma unit  2Dma unit  2
Dma unit 2
 
Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15
 
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
 
DMML4_apriori (1).ppt
DMML4_apriori (1).pptDMML4_apriori (1).ppt
DMML4_apriori (1).ppt
 
1. Introduction to Association Rule 2. Frequent Item Set Mining 3. Market Bas...
1. Introduction to Association Rule2. Frequent Item Set Mining3. Market Bas...1. Introduction to Association Rule2. Frequent Item Set Mining3. Market Bas...
1. Introduction to Association Rule 2. Frequent Item Set Mining 3. Market Bas...
 
Assosiate rule mining
Assosiate rule miningAssosiate rule mining
Assosiate rule mining
 
Assocrules
AssocrulesAssocrules
Assocrules
 
Associative Learning
Associative LearningAssociative Learning
Associative Learning
 
Data Mining Lecture_3.pptx
Data Mining Lecture_3.pptxData Mining Lecture_3.pptx
Data Mining Lecture_3.pptx
 
Rmining
RminingRmining
Rmining
 
06FPBasic.ppt
06FPBasic.ppt06FPBasic.ppt
06FPBasic.ppt
 

Recently uploaded

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 

Recently uploaded (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 

Lecture3 assoc rules

  • 1. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 3: Frequent Itemsets and Association Rules Mining Massive Datasets
  • 2. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 2 Outline  Association rules  A-Priori algorithm  Large-scale algorithms 2
  • 3. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 3 The Market-Basket Model  A large set of items, e.g., things sold in a supermarket.  A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys on one day. Association Rules
  • 4. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 4 Market-Baskets – (2)  Really a general many-many mapping (association) between two kinds of things.  But we ask about connections among “items,” not “baskets.”  The technology focuses on common events, not rare events (“long tail”). Association Rules
  • 5. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 5 Association Rule Discovery  Goal: To identify items that are bought together by sufficiently many customers, and find dependencies among items Association Rules
  • 6. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 6 Support  Simplest question: find sets of items that appear “frequently” in the baskets.  Support for itemset I = the number of baskets containing all items in I.  Sometimes given as a percentage.  Given a support threshold s, sets of items that appear in at least s baskets are called frequent itemsets. Association Rules
  • 7. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 7 Example: Frequent Itemsets  Items={milk, coke, pepsi, beer, juice}.  Support threshold = 3 baskets. B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c}  Frequent itemsets: {m}, {c}, {b}, {j}, , {b,c} , {c,j}.{m,b} Association Rules
  • 8. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 8 Applications – (1)  Items = products; baskets = sets of products someone bought in one trip to the store.  Example application: given that many people buy beer and diapers together:  Run a sale on diapers; raise price of beer.  Only useful if many buy diapers & beer. Association Rules
  • 9. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 9 Applications – (2)  Baskets = sentences; items = documents containing those sentences.  Items that appear together too often could represent plagiarism.  Notice items do not have to be “in” baskets. Association Rules
  • 10. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 10 Applications – (3)  Baskets = Web pages; items = words.  Unusual words appearing together in a large number of documents, e.g., “Brad” and “Angelina,” may indicate an interesting relationship. Association Rules
  • 11. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 11 Scale of the Problem  WalMart sells 100,000 items and can store billions of baskets.  The Web has billions of words and many billions of pages. Association Rules
  • 12. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 12 Association Rules  If-then rules about the contents of baskets.  {i1, i2,…,ik} → j means: “if a basket contains all of i1, …,ik then it is likely to contain j.”  Confidence of this association rule is the probability of j given i1,…,ik. Association Rules
  • 13. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 13 Example: Confidence B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c}  An association rule: {m, b} → c.  Confidence = 2/4 = 50%. + _ _ + Association Rules
  • 14. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 14 Finding Association Rules  Question: “find all association rules with support ≥ s and confidence ≥ c .”  Note: “support” of an association rule is the support of the set of items on the left.  Hard part: finding the frequent itemsets.  Note: if {i1, i2,…,ik} → j has high support and confidence, then both {i1, i2,…,ik} and {i1, i2,…,ik,j } will be “frequent.” Association Rules
  • 15. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 15 Computation Model  Typically, data is kept in flat files rather than in a database system.  Stored on disk.  Stored basket-by-basket.  Expand baskets into pairs, triples, etc. as you read baskets.  Use k nested loops to generate all sets of size k. Association Rules
  • 16. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 16 File Organization Item Item Item Item Item Item Item Item Item Item Item Item Basket 1 Basket 2 Basket 3 Etc. Example: items are positive integers, and boundaries between baskets are –1. Association Rules
  • 17. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 17 Computation Model – (2)  The true cost of mining disk-resident data is usually the number of disk I/O’s.  In practice, association-rule algorithms read the data in passes – all baskets read in turn.  Thus, we measure the cost by the number of passes an algorithm takes. Association Rules
  • 18. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 18 Main-Memory Bottleneck  For many frequent-itemset algorithms, main memory is the critical resource.  As we read baskets, we need to count something, e.g., occurrences of pairs.  The number of different things we can count is limited by main memory.  Swapping counts in/out is a disaster (why?). Association Rules
  • 19. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 19 Finding Frequent Pairs  The hardest problem often turns out to be finding the frequent pairs.  Why? Often frequent pairs are common, frequent triples are rare.  Why? Probability of being frequent drops exponentially with size; number of sets grows more slowly with size.  We’ll concentrate on pairs, then extend to larger sets. Association Rules
  • 20. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 20 Naïve Algorithm  Read file once, counting in main memory the occurrences of each pair.  From each basket of n items, generate its n (n -1)/2 pairs by two nested loops.  Fails if (#items)2 exceeds main memory.  Remember: #items can be 100K (Wal-Mart) or 10B (Web pages). Association Rules
  • 21. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 21 Example: Counting Pairs  Suppose 105 items.  Suppose counts are 4-byte integers.  Number of pairs of items: 105 (105 -1)/2 = 5*109 (approximately).  Therefore, 2*1010 (20 gigabytes) of main memory needed. Association Rules
  • 22. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 22 Details of Main-Memory Counting  Two approaches: (1) Count all pairs, using a triangular matrix. (2) Keep a table of triples [i, j, c] = “the count of the pair of items {i, j } is c.”  (1) requires only 4 bytes/pair.  Note: always assume integers are 4 bytes.  (2) requires 12 bytes, but only for those pairs with count > 0. Association Rules
  • 23. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 23 4 bytes per pair Method (1) Method (2) 12 bytes per occurring pair Association Rules Details of Main-Memory Counting
  • 24. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 24 Triangular-Matrix Approach – (1)  Number items 1, 2,…  Requires table of size O(n) to convert item names to consecutive integers.  Count {i, j } only if i < j.  Keep pairs in the order {1,2}, {1,3},…, {1,n }, {2,3}, {2,4},…,{2,n }, {3,4},…, {3,n },…{n -1,n }. Association Rules
  • 25. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 25 Triangular-Matrix Approach – (2)  Find pair {i, j } at the position (i –1)(n –i /2) + j – i.  Total number of pairs n (n –1)/2; total bytes about 2n 2 . Association Rules
  • 26. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 26 Details of Approach #2  Total bytes used is about 12p, where p is the number of pairs that actually occur.  Beats triangular matrix if at most 1/3 of possible pairs actually occur.  May require extra space for retrieval structure, e.g., a hash table. Association Rules
  • 27. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 27 Outline  Association rules  A-Priori algorithm  Large-scale algorithms 27
  • 28. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 28 A-Priori Algorithm – (1)  A two-pass approach called a-priori limits the need for main memory.  Key idea: monotonicity : if a set of items appears at least s times, so does every subset.  Contrapositive for pairs: if item i does not appear in s baskets, then no pair including i can appear in s baskets. A-Priori Algorithm
  • 29. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 29 A-Priori Algorithm – (2)  Pass 1: Read baskets and count in main memory the occurrences of each item.  Requires only memory proportional to #items.  Items that appear at least s times are the frequent items. A-Priori Algorithm
  • 30. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 30 A-Priori Algorithm – (3)  Pass 2: Read baskets again and count in main memory only those pairs both of which were found in Pass 1 to be frequent.  Requires memory proportional to square of frequent items only (for counts), plus a list of the frequent items (so you know what must be counted). A-Priori Algorithm
  • 31. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 31 Picture of A-Priori Item counts Pass 1 Pass 2 Frequent items Counts of pairs of frequent items A-Priori Algorithm
  • 32. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 32 Detail for A-Priori  You can use the triangular matrix method with n = number of frequent items.  May save space compared with storing triples.  Trick: number frequent items 1,2,… and keep a table relating new numbers to original item numbers. A-Priori Algorithm
  • 33. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 33 A-Priori Using Triangular Matrix for Counts Item counts Pass 1 Pass 2 1. Freq- Old 2. quent item … items #’s Counts of pairs of frequent items A-Priori Algorithm
  • 34. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 34  For each k, we construct two sets of k -sets (sets of size k ):  Ck = candidate k -sets = those that might be frequent sets (support > s ) based on information from the pass for k –1.  Lk = the set of truly frequent k -sets. A-Priori Algorithm A-Priori for All Frequent Itemsets
  • 35. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 35 C1 L1 C2 L2 C3 Filter Filter ConstructConstruct First pass Second pass All items All pairs of items from L1 Count the pairs To be explained Count the items Frequent items Frequent pairs A-Priori Algorithm A-Priori for All Frequent Itemsets
  • 36. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 36 A-Priori for All Frequent Itemsets  One pass for each k.  Needs room in main memory to count each candidate k -set.  For typical market-basket data and reasonable support (e.g., 1%), k = 2 requires the most memory. A-Priori Algorithm
  • 37. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 37  C1 = all items  In general, Lk = members of Ck with support ≥ s.  Ck+1 = (k +1) -sets, each k of which is in Lk. A-Priori Algorithm A-Priori for All Frequent Itemsets
  • 38. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 38 Outline  Association rules  A-Priori algorithm  Large-scale algorithms 38
  • 39. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 39  During Pass 1 of A-priori, most memory is idle.  Use that memory to keep counts of buckets into which pairs of items are hashed.  Just the count, not the pairs themselves. Large-Scale Algorithms PCY Algorithm
  • 40. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 40 1. Pairs of items need to be generated from the input file; they are not present in the file. 2. We are not just interested in the presence of a pair, but we need to see whether it is present at least s (support) times. Large-Scale Algorithms PCY Algorithm
  • 41. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 41 PCY Algorithm  A bucket is frequent if its count is at least the support threshold.  If a bucket is not frequent, no pair that hashes to that bucket could possibly be a frequent pair.  On Pass 2, we only count pairs that hash to frequent buckets. Large-Scale Algorithms
  • 42. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 42 Picture of PCY Hash table Item counts Bitmap Pass 1 Pass 2 Frequent items Counts of candidate pairs Large-Scale Algorithms
  • 43. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 43  Space to count each item.  One (typically) 4-byte integer per item.  Use the rest of the space for as many integers, representing buckets, as we can. Large-Scale Algorithms PCY Algorithm
  • 44. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 44 PCY Algorithm – Pass 1 FOR (each basket) { FOR (each item in the basket) add 1 to item’s count; FOR (each pair of items) { hash the pair to a bucket; add 1 to the count for that bucket } } Large-Scale Algorithms
  • 45. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 45 1. A bucket that a frequent pair hashes to is surely frequent.  We cannot use the hash table to eliminate any member of this bucket. 1. Even without any frequent pair, a bucket can be frequent.  Again, nothing in the bucket can be eliminated. 3. But in the best case, the count for a bucket is less than the support s.  Now, all pairs that hash to this bucket can be eliminated as candidates, even if the pair consists of two frequent items.  Thought question: under what conditions can we be sure most buckets will be in case 3? Large-Scale Algorithms PCY Algorithm
  • 46. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 46 PCY Algorithm – Between Passes  Replace the buckets by a bit-vector:  1 means the bucket is frequent; 0 means it is not.  4-byte integers are replaced by bits, so the bit-vector requires 1/32 of memory.  Also, decide which items are frequent and list them for the second pass. Large-Scale Algorithms
  • 47. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 47 PCY Algorithm – Pass 2  Count all pairs {i, j } that meet the conditions for being a candidate pair: 1. Both i and j are frequent items. 2. The pair {i, j }, hashes to a bucket number whose bit in the bit vector is 1.  Notice all these conditions are necessary for the pair to have a chance of being frequent. Large-Scale Algorithms
  • 48. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 48  Buckets require a few bytes each.  Note: we don’t have to count past s.  # buckets is O(main-memory size).  On second pass, a table of (item, item, count) triples is essential (why?).  Thus, hash table must eliminate 2/3 of the candidate pairs for PCY to beat a-priori. Large-Scale Algorithms PCY Algorithm
  • 49. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 49 Multistage Algorithm  Key idea: After Pass 1 of PCY, rehash only those pairs that qualify for Pass 2 of PCY.  On middle pass, fewer pairs contribute to buckets, so fewer false positives –frequent buckets with no frequent pair. Large-Scale Algorithms
  • 50. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 50 Multistage Picture First hash table Second hash table Item counts Bitmap 1 Bitmap 1 Bitmap 2 Freq. items Freq. items Counts of candidate pairs Pass 1 Pass 2 Pass 3 Large-Scale Algorithms
  • 51. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 51 Multistage – Pass 3  Count only those pairs {i, j } that satisfy these candidate pair conditions: 1. Both i and j are frequent items. 2. Using the first hash function, the pair hashes to a bucket whose bit in the first bit-vector is 1. 3. Using the second hash function, the pair hashes to a bucket whose bit in the second bit-vector is 1. Large-Scale Algorithms
  • 52. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 52 Important Points 1. The two hash functions have to be independent. 2. We need to check both hashes on the third pass.  If not, we would wind up counting pairs of frequent items that hashed first to a frequent bucket but happened to hash second to an infrequent bucket. Large-Scale Algorithms
  • 53. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 53 Multihash  Key idea: use several independent hash tables on the first pass.  Risk: halving the number of buckets doubles the average count. We have to be sure most buckets will still not reach count s.  If so, we can get a benefit like multistage, but in only 2 passes. Large-Scale Algorithms
  • 54. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 54 Multihash Picture First hash table Second hash table Item counts Bitmap 1 Bitmap 2 Freq. items Counts of candidate pairs Pass 1 Pass 2 Large-Scale Algorithms
  • 55. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 55 Extensions  Either multistage or multihash can use more than two hash functions.  In multistage, there is a point of diminishing returns, since the bit-vectors eventually consume all of main memory.  For multihash, the bit-vectors occupy exactly what one PCY bitmap does, but too many hash functions makes all counts > s. Large-Scale Algorithms
  • 56. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 56 All (Or Most) Frequent Itemsets In < 2 Passes  A-Priori, PCY, etc., take k passes to find frequent itemsets of size k.  Other techniques use 2 or fewer passes for all sizes:  Simple algorithm.  SON (Savasere, Omiecinski, and Navathe). Large-Scale Algorithms
  • 57. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 57 Simple Algorithm – (1)  Take a random sample of the market baskets.  Run a-priori or one of its improvements (for sets of all sizes, not just pairs) in main memory, so you don’t pay for disk I/O each time you increase the size of itemsets.  Be sure you leave enough space for counts. Large-Scale Algorithms
  • 58. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 58 Main-Memory Picture Copy of sample baskets Space for counts Large-Scale Algorithms
  • 59. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 59 Simple Algorithm – (2)  Use as your support threshold a suitable, scaled- back number.  E.g., if your sample is 1/100 of the baskets, use s /100 as your support threshold instead of s . Large-Scale Algorithms
  • 60. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 60 Simple Algorithm – Option  Optionally, verify that your guesses are truly frequent in the entire data set by a second pass.  But you don’t catch sets frequent in the whole but not in the sample.  Smaller threshold, e.g., s /125, helps catch more truly frequent itemsets.  But requires more space. Large-Scale Algorithms
  • 61. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 61 SON Algorithm – (1)  Repeatedly read small subsets of the baskets into main memory and perform the first pass of the simple algorithm on each subset.  An itemset becomes a candidate if it is found to be frequent in any one or more subsets of the baskets. Large-Scale Algorithms
  • 62. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 62 SON Algorithm – (2)  On a second pass, count all the candidate itemsets and determine which are frequent in the entire set.  Key “monotonicity” idea: an itemset cannot be frequent in the entire set of baskets unless it is frequent in at least one subset. Large-Scale Algorithms
  • 63. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 63 SON Algorithm – Distributed Version  This idea lends itself to distributed data mining.  If baskets are distributed among many nodes, compute frequent itemsets at each node, then distribute the candidates from each node.  Finally, accumulate the counts of all candidates. Large-Scale Algorithms
  • 64. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 64  First MapReduce  Map:  Reduce:  Second MapReduce  Map:  Reduce: SON Algorithm – Distributed Version
  • 65. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 65 Compacting the Output 1. Maximal Frequent itemsets : no immediate superset is frequent. 2. Closed itemsets : no immediate superset has the same count (> 0).  Stores not only frequent information, but exact counts.
  • 66. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 66 Example: Maximal/Closed Count Maximal (s=3) Closed A 4 No No B 5 No Yes C 3 No No AB 4 Yes Yes AC 2 No No BC 3 Yes Yes ABC 2 No Yes Frequent, but superset BC also frequent. Frequent, and its only superset, ABC, not freq. Superset BC has same count. Its only super- set, ABC, has smaller count.
  • 67. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 67 Questions?
  • 68. Frequent Itemsets and Association RulesFrequent Itemsets and Association Rules 68 Acknowledgement  Slides are from:  Prof. Jeffrey D. Ullman  Dr. Jure Leskovec