1
Association rules
2
Association rule mining:
Finding frequent patterns, associations, correlations, or causal structures
among sets of items or objects in transaction databases, relational
databases, and other information repositories.
Frequent pattern: pattern (set of items, sequence, etc.) that occurs
frequently in a database
Motivation: finding regularities in data:
What products were often purchased together? — Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
What is Association Mining?
3
Itemset X={x1, …, xk}
Find all the rules XY with min confidence and support.
support, s, probability that a transaction contains XY
confidence, c, conditional probability that a transaction
having X also contains Y.
Basic concepts:
Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Let min_support = 50%, min_conf = 50%:
We have :
1) A  C
support = support({A}{C}) = 50%
confidence =
support({A}{C})/support({A}) = 66.6%
2) C  A
support = support({A}{C}) = 50%
confidence =
support({A}{C})/support({C}) = 100%
Apriori: A Candidate Generation-and-test
Approach
Any subset of a frequent itemset must be frequent
if {beer, diaper, nuts} is frequent, so is {beer, diaper}
Every transaction having {beer, diaper, nuts} also contains {beer, diaper}
Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
Method:
generate length (k+1) candidate itemsets from length k frequent itemsets, and
test the candidates against DB
4
5
The Apriori Algorithm—An Example
Frequency ≥ 50%, Confidence 100%:
A  C
B  E
BC  E
CE  B
BE  C
Database TDB
1st
scan
C1
L1
L2
C2 C2
2nd
scan
C3 L3
3rd
scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
min_support = 2
6
The Apriori Algorithm
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
7
Important Details of Apriori
•How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
•How to count supports of candidates?
•Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
8
Important Details of Apriori
Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
Step 2: pruning
 itemsets c in Ck do
 (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
9
Bottleneck of frequent itemsets
Multiple database scans are costly
Mining long patterns needs many passes of scanning and
generates lots of candidates
To find frequent itemset i1i2…i100
Number of scans: 100
Number of Candidates: + + …
= - 1 = 1.27 * ;
10
Mining frequent patterns without candidate generation
Grow long patterns from short ones using local frequent items
 “abc” is a frequent pattern
 Get all transactions having “abc”: DB|abc
 “d” is a local frequent item in DB|abc  abcd is a frequent
pattern;
11
FP-growth: Frequent Pattern-Growth
 Compress the database representing frequent
items into a frequent –pattern tree or FP-tree
12
Example: FP-growth
 The first scan of data is the
same as Apriori
 Derive the set of frequent 1-
itemsets
 Let min-sup=2
 Generate a set of ordered
items
TID List of item IDS
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3
Transactional Database
Item ID Support
count
I2 7
I1 6
I3 6
I4 2
I5 2
13
Construct the FP-Tree
Transactional Database
Item ID Support
count
I2 7
I1 6
I3 6
I4 2
I5 2
null
- Create a branch for each
transaction
- Items in each transaction are
processed in order
1- Order the items T100: {I2,I1,I5}
2- Construct the first branch:
<I2:1>, <I1:1>,<I5:1>
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3
I2:1
I1:1
I5:1
14
Construct the FP-Tree
Transactional Database
Item ID Support
count
I2 7
I1 6
I3 6
I4 2
I5 2
null
- Create a branch for each
transaction
- Items in each transaction are
processed in order
1- Order the items T200: {I2,I4}
2- Construct the second branch:
<I2:1>, <I4:1>
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3
I2:1
I1:1
I5:1
I4:1
I2:2
15
Construct the FP-Tree
Transactional Database
Item ID Support
count
I2 7
I1 6
I3 6
I4 2
I5 2
null
- Create a branch for each
transaction
- Items in each transaction are
processed in order
1- Order the items T300: {I2,I3}
2- Construct the third branch:
<I2:2>, <I3:1>
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3
I2:2
I1:1
I5:1
I4:1
I3:1
I2:3
16
Construct the FP-Tree
Transactional Database
Item ID Support
count
I2 7
I1 6
I3 6
I4 2
I5 2
null
- Create a branch for each
transaction
- Items in each transaction are
processed in order
1- Order the items T400: {I2,I1,I4}
2- Construct the fourth branch:
<I2:3>, <I1:1>,<I4:1>
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3
I1:1
I5:1
I4:1
I3:1
I2:3
I4:1
I1:2
I2:4
17
Construct the FP-Tree
Transactional Database
Item ID Support
count
I2 7
I1 6
I3 6
I4 2
I5 2
null
- Create a branch for each
transaction
- Items in each transaction are
processed in order
1- Order the items T400: {I1,I3}
2- Construct the fifth branch:
<I1:1>, <I3:1>
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3
I1:2
I5:1
I4:1
I3:1
I2:4
I4:1
I1:1
I3:1
18
Construct the FP-Tree
Transactional Database
Item ID Support
count
I2 7
I1 6
I3 6
I4 2
I5 2
null
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3
I1:4
I5:1
I4:1
I3:2
I2:7
I4:1
I1:2
I3:2
I3:2
I5:1
When a branch of a
transaction is added, the
count for each node
along a common prefix is
incremented by 1
19
Construct the FP-Tree
Item ID Support
count
I2 7
I1 6
I3 6
I4 2
I5 2
null
I1:4
I5:1
I4:1
I3:2
I2:7
I4:1
I1:2
I3:2
I3:2
I5:1
The problem of mining frequent patterns in databases is
transformed to that of mining the FP-tree
20
Construct the FP-Tree
-Occurrences of I5: <I2,I1,I5> and <I2,I1,I3,I5>
-Two prefix Paths <I2, I1: 1> and <I2,I1,I3: 1>
-Conditional FP tree contains only <I2: 2, I1: 2>, I3 is not
considered because its support count of 1 is less than the
minimum support count.
-Frequent patterns {I2,I5:2}, {I1,I5:2},{I2,I1,I5:2}
Item ID Support
count
I2 7
I1 6
I3 6
I4 2
I5 2
null
I1:4
I5:1
I4:1
I3:2
I2:7
I4:1
I1:2
I3:2
I3:2
I5:1
21
Construct the FP-Tree
Item ID Support
count
I2 7
I1 6
I3 6
I4 2
I5 2
null
I1:4
I5:1
I4:1
I3:2
I2:7
I4:1
I1:2
I3:2
I3:2
I5:1
TID Conditional Pattern Base Conditional FP-tree
I5 {{I2,I1:1},{I2,I1,I3:1}} <I2:2,I1:2>
I4 {{I2,I1:1},{I2,1}} <I2:2>
I3 {{I2,I1:2},{I2:2}, {I1:2}} <I2:4,I1:2>,<I1:2>
I1 {I2,4} <I2:4>
22
Construct the FP-Tree
Item ID Support
count
I2 7
I1 6
I3 6
I4 2
I5 2
null
I1:4
I5:1
I4:1
I3:2
I2:7
I4:1
I1:2
I3:2
I3:2
I5:1
TID Conditional FP-tree Frequent Patterns Generated
I5 <I2:2,I1:2> {I2,I5:2}, {I1,I5:2},{I2,I1,I5:2}
I4 <I2:2> {I2,I4:2}
I3 <I2:4,I1:2>,<I1:2> {I2,I3:4},{I1,I3:4},{I2,I1,I3:2}
I1 <I2:4> {I2,I1:4}
23
FP-growth properties
 FP-growth transforms the problem of finding long frequent patterns
to searching for shorter once recursively and concatenating the
suffix
 It reduces the search cost
 If the tree does not fit into main memory, partition the database
 Efficient and scalable for mining both long and short frequent
patterns

Association_Apriori_FP_growth saafev.pptx

  • 1.
  • 2.
    2 Association rule mining: Findingfrequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database Motivation: finding regularities in data: What products were often purchased together? — Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? What is Association Mining?
  • 3.
    3 Itemset X={x1, …,xk} Find all the rules XY with min confidence and support. support, s, probability that a transaction contains XY confidence, c, conditional probability that a transaction having X also contains Y. Basic concepts: Transaction-id Items bought 10 A, B, C 20 A, C 30 A, D 40 B, E, F Let min_support = 50%, min_conf = 50%: We have : 1) A  C support = support({A}{C}) = 50% confidence = support({A}{C})/support({A}) = 66.6% 2) C  A support = support({A}{C}) = 50% confidence = support({A}{C})/support({C}) = 100%
  • 4.
    Apriori: A CandidateGeneration-and-test Approach Any subset of a frequent itemset must be frequent if {beer, diaper, nuts} is frequent, so is {beer, diaper} Every transaction having {beer, diaper, nuts} also contains {beer, diaper} Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Method: generate length (k+1) candidate itemsets from length k frequent itemsets, and test the candidates against DB 4
  • 5.
    5 The Apriori Algorithm—AnExample Frequency ≥ 50%, Confidence 100%: A  C B  E BC  E CE  B BE  C Database TDB 1st scan C1 L1 L2 C2 C2 2nd scan C3 L3 3rd scan Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset sup {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset {B, C, E} Itemset sup {B, C, E} 2 min_support = 2
  • 6.
    6 The Apriori Algorithm Pseudo-code: Ck:Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;
  • 7.
    7 Important Details ofApriori •How to generate candidates? Step 1: self-joining Lk Step 2: pruning •How to count supports of candidates? •Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4={abcd}
  • 8.
    8 Important Details ofApriori Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning  itemsets c in Ck do  (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck
  • 9.
    9 Bottleneck of frequentitemsets Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i1i2…i100 Number of scans: 100 Number of Candidates: + + … = - 1 = 1.27 * ;
  • 10.
    10 Mining frequent patternswithout candidate generation Grow long patterns from short ones using local frequent items  “abc” is a frequent pattern  Get all transactions having “abc”: DB|abc  “d” is a local frequent item in DB|abc  abcd is a frequent pattern;
  • 11.
    11 FP-growth: Frequent Pattern-Growth Compress the database representing frequent items into a frequent –pattern tree or FP-tree
  • 12.
    12 Example: FP-growth  Thefirst scan of data is the same as Apriori  Derive the set of frequent 1- itemsets  Let min-sup=2  Generate a set of ordered items TID List of item IDS T100 I1,I2,I5 T200 I2,I4 T300 I2,I3 T400 I1,I2,I4 T500 I1,I3 T600 I2,I3 T700 I1,I3 T800 I1,I2,I3,I5 T900 I1,I2,I3 Transactional Database Item ID Support count I2 7 I1 6 I3 6 I4 2 I5 2
  • 13.
    13 Construct the FP-Tree TransactionalDatabase Item ID Support count I2 7 I1 6 I3 6 I4 2 I5 2 null - Create a branch for each transaction - Items in each transaction are processed in order 1- Order the items T100: {I2,I1,I5} 2- Construct the first branch: <I2:1>, <I1:1>,<I5:1> TID Items TID Items TID Items T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3 T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5 T300 I2,I3 T600 I2,I3 T900 I1,I2,I3 I2:1 I1:1 I5:1
  • 14.
    14 Construct the FP-Tree TransactionalDatabase Item ID Support count I2 7 I1 6 I3 6 I4 2 I5 2 null - Create a branch for each transaction - Items in each transaction are processed in order 1- Order the items T200: {I2,I4} 2- Construct the second branch: <I2:1>, <I4:1> TID Items TID Items TID Items T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3 T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5 T300 I2,I3 T600 I2,I3 T900 I1,I2,I3 I2:1 I1:1 I5:1 I4:1 I2:2
  • 15.
    15 Construct the FP-Tree TransactionalDatabase Item ID Support count I2 7 I1 6 I3 6 I4 2 I5 2 null - Create a branch for each transaction - Items in each transaction are processed in order 1- Order the items T300: {I2,I3} 2- Construct the third branch: <I2:2>, <I3:1> TID Items TID Items TID Items T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3 T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5 T300 I2,I3 T600 I2,I3 T900 I1,I2,I3 I2:2 I1:1 I5:1 I4:1 I3:1 I2:3
  • 16.
    16 Construct the FP-Tree TransactionalDatabase Item ID Support count I2 7 I1 6 I3 6 I4 2 I5 2 null - Create a branch for each transaction - Items in each transaction are processed in order 1- Order the items T400: {I2,I1,I4} 2- Construct the fourth branch: <I2:3>, <I1:1>,<I4:1> TID Items TID Items TID Items T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3 T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5 T300 I2,I3 T600 I2,I3 T900 I1,I2,I3 I1:1 I5:1 I4:1 I3:1 I2:3 I4:1 I1:2 I2:4
  • 17.
    17 Construct the FP-Tree TransactionalDatabase Item ID Support count I2 7 I1 6 I3 6 I4 2 I5 2 null - Create a branch for each transaction - Items in each transaction are processed in order 1- Order the items T400: {I1,I3} 2- Construct the fifth branch: <I1:1>, <I3:1> TID Items TID Items TID Items T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3 T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5 T300 I2,I3 T600 I2,I3 T900 I1,I2,I3 I1:2 I5:1 I4:1 I3:1 I2:4 I4:1 I1:1 I3:1
  • 18.
    18 Construct the FP-Tree TransactionalDatabase Item ID Support count I2 7 I1 6 I3 6 I4 2 I5 2 null TID Items TID Items TID Items T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3 T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5 T300 I2,I3 T600 I2,I3 T900 I1,I2,I3 I1:4 I5:1 I4:1 I3:2 I2:7 I4:1 I1:2 I3:2 I3:2 I5:1 When a branch of a transaction is added, the count for each node along a common prefix is incremented by 1
  • 19.
    19 Construct the FP-Tree ItemID Support count I2 7 I1 6 I3 6 I4 2 I5 2 null I1:4 I5:1 I4:1 I3:2 I2:7 I4:1 I1:2 I3:2 I3:2 I5:1 The problem of mining frequent patterns in databases is transformed to that of mining the FP-tree
  • 20.
    20 Construct the FP-Tree -Occurrencesof I5: <I2,I1,I5> and <I2,I1,I3,I5> -Two prefix Paths <I2, I1: 1> and <I2,I1,I3: 1> -Conditional FP tree contains only <I2: 2, I1: 2>, I3 is not considered because its support count of 1 is less than the minimum support count. -Frequent patterns {I2,I5:2}, {I1,I5:2},{I2,I1,I5:2} Item ID Support count I2 7 I1 6 I3 6 I4 2 I5 2 null I1:4 I5:1 I4:1 I3:2 I2:7 I4:1 I1:2 I3:2 I3:2 I5:1
  • 21.
    21 Construct the FP-Tree ItemID Support count I2 7 I1 6 I3 6 I4 2 I5 2 null I1:4 I5:1 I4:1 I3:2 I2:7 I4:1 I1:2 I3:2 I3:2 I5:1 TID Conditional Pattern Base Conditional FP-tree I5 {{I2,I1:1},{I2,I1,I3:1}} <I2:2,I1:2> I4 {{I2,I1:1},{I2,1}} <I2:2> I3 {{I2,I1:2},{I2:2}, {I1:2}} <I2:4,I1:2>,<I1:2> I1 {I2,4} <I2:4>
  • 22.
    22 Construct the FP-Tree ItemID Support count I2 7 I1 6 I3 6 I4 2 I5 2 null I1:4 I5:1 I4:1 I3:2 I2:7 I4:1 I1:2 I3:2 I3:2 I5:1 TID Conditional FP-tree Frequent Patterns Generated I5 <I2:2,I1:2> {I2,I5:2}, {I1,I5:2},{I2,I1,I5:2} I4 <I2:2> {I2,I4:2} I3 <I2:4,I1:2>,<I1:2> {I2,I3:4},{I1,I3:4},{I2,I1,I3:2} I1 <I2:4> {I2,I1:4}
  • 23.
    23 FP-growth properties  FP-growthtransforms the problem of finding long frequent patterns to searching for shorter once recursively and concatenating the suffix  It reduces the search cost  If the tree does not fit into main memory, partition the database  Efficient and scalable for mining both long and short frequent patterns