Frequent pattern mining aims to discover patterns that occur frequently in a dataset. It finds inherent regularities without any preconceptions. The frequent patterns can then be used for applications like association rule mining, classification, clustering, and more. Two major approaches for mining frequent patterns are the Apriori algorithm and FP-Growth. Apriori is a generate-and-test method while FP-Growth avoids candidate generation by building a compact data structure called an FP-tree to extract patterns directly.
2. WHAT IS FREQUENT PATTERN
ANALYSIS?
Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
2
3. WHY IS FREQ. PATTERN MINING
IMPORTANT?
Discloses an intrinsic and important property of data sets
Forms the foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
Classification: associative classification
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications 3
4. BASIC CONCEPTS: FREQUENT
PATTERNS AND ASSOCIATION RULES
Itemset X = {x1, …, xk}
Find all the rules X Y with
minimum support and confidence
support, s, probability that a
transaction contains X ∪ Y
confidence, c, conditional
probability that a
transaction having X also
contains Y
4
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
A D (60%, 100%)
D A (60%, 75%)
Customer
buys diaper
Customer
buys both
Customer
buys beer
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
5. CLOSED PATTERNS AND MAX-
PATTERNS
A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1
100
) + (2
100
) + … + (1
1
0
0
0
0
) =
2100
– 1 = 1.27*1030
sub-patterns!
Solution: Mine closed patterns and max-patterns instead
An itemset X is closed if X is frequent and there exists
no super-pattern Y X, with the same support as X
An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y X
Closed pattern is a lossless compression of freq.
patterns
Reducing the # of patterns and rules 5
6. SCALABLE METHODS FOR MINING
FREQUENT PATTERNS
The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
Scalable mining methods: Three major approaches
Apriori
Freq. pattern growth
Vertical data format approach
6
7. APRIORI: A CANDIDATE GENERATION-AND-TEST
APPROACH
Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length
k frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
7
9. IMPORTANT DETAILS OF APRIORI
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates?
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
9
10. HOW TO COUNT SUPPORTS OF
CANDIDATES?
Why counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates
Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets
and counts
Interior node contains a hash table
Subset function: finds all the candidates contained
in a transaction
10
11. BOTTLENECK OF FREQUENT-
PATTERN MINING
Multiple database scans are costly
Mining long patterns needs many passes of scanning
and generates lots of candidates
To find frequent itemset i1i2…i100
# of scans: 100
# of Candidates: (1
100
) + (2
100
) + … + (1
1
0
0
0
0
) = 2100
-1 = 1.27*1030
!
Bottleneck: candidate-generation-and-test
Can we avoid candidate generation?
11
12. FREQUENT ITEMSET GENERATION
Apriori: uses a generate-and-test approach –
generates candidate itemsets and tests if they are
frequent
Generation of candidate itemsets is expensive(in both
space and time)
Support counting is expensive
Subset checking (computationally expensive)
Multiple Database scans (I/O)
FP-Growth: allows frequent itemset discovery
without candidate itemset generation. Two step
approach:
Step 1: Build a compact data structure called the FP-tree
Built using 2 passes over the data-set.
Step 2: Extracts frequent itemsets directly from the FP-
tree 12
13. STEP 1: FP-TREE CONSTRUCTION
FP-Tree is constructed using 2 passes over the
data-set:
Pass 1:
Scan data and find support for each item.
Discard infrequent items.
Sort frequent items in decreasing order based on
their support.
Use this order when building the FP-Tree, so
common prefixes can be shared.
13
14. STEP 1: FP-TREE CONSTRUCTION
Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it
to a path
2. Fixed order is used, so paths can overlap when
transactions share items (when they have the same
prfix ).
In this case, counters are incremented
1. Pointers are maintained between nodes containing
the same item, creating singly linked lists (dotted
lines)
The more paths that overlap, the higher the compression.
FP-tree may fit in memory.
1. Frequent itemsets extracted from the FP-Tree. 14
16. FP-TREE SIZE
The FP-Tree usually has a smaller size than the
uncompressed data - typically many transactions
share items (and hence prefixes).
Best case scenario: all transactions contain the same set of
items.
1 path in the FP-tree
Worst case scenario: every transaction has a unique set of
items (no items in common)
Size of the FP-tree is at least as large as the original data.
Storage requirements for the FP-tree are higher - need to store
the pointers between the nodes and the counters.
The size of the FP-tree depends on how the items are
ordered
Ordering by decreasing support is typically used but
it does not always lead to the smallest tree (it's a
heuristic). 16
17. STEP 2: FREQUENT ITEMSET
GENERATION
FP-Growth extracts frequent itemsets from the
FP-tree.
Bottom-up algorithm - from the leaves towards
the root
Divide and conquer: first look for frequent
itemsets ending in e, then de, etc. . . then d, then
cd, etc. . .
First, extract prefix path sub-trees ending in an
item(set). (hint: use the linked lists)
17
19. STEP 2: FREQUENT ITEMSET
GENERATION
Each prefix path sub-tree is processed
recursively to extract the frequent
itemsets. Solutions are then merged.
E.g. the prefix path sub-tree for e will be used
to extract frequent itemsets ending in e, then
in de, ce, be and ae, then in cde, bde, cde, etc.
Divide and conquer approach
19
20. CONDITIONAL FP-TREE
The FP-Tree that would be built if we only consider
transactions containing a particular itemset (and then
removing that itemset from all transactions).
Example: FP-Tree conditional on e.
20
21. EXAMPLE
Let minSup = 2 and extract all frequent itemsets
containing e.
1. Obtain the prefix path sub-tree for e:
21
22. EXAMPLE
2. Check if e is a frequent item by adding the
counts along the linked list (dotted line). If so,
extract it.
Yes, count =3 so {e} is extracted as a frequent
itemset.
3. As e is frequent, find frequent itemsets ending
in e. i.e. de, ce, be and ae.
22
23. EXAMPLE
4. Use the the conditional FP-tree for e to find
frequent itemsets ending in de, ce and ae
Note that be is not considered as b is not in the
conditional FP-tree for e.
For each of them (e.g. de), find the prefix paths
from the conditional tree for e, extract frequent
itemsets, generate conditional FP-tree, etc...
(recursive)
23
24. EXAMPLE
Example: e -> de -> ade ({d,e}, {a,d,e} are found to be
frequent)
•Example: e -> ce ({c,e} is found to be frequent)
24
26. DISCUSION
Advantages of FP-Growth
only 2 passes over data-set
“compresses” data-set
no candidate generation
much faster than Apriori
Disadvantages of FP-Growth
FP-Tree may not fit in memory!!
FP-Tree is expensive to build
26
27. ECLAT: ANOTHER METHOD FOR FREQUENT
ITEMSET GENERATION
ECLAT: for each item, store a list of transaction ids
(tids); vertical data layout
TID-list
27
28. ECLAT: ANOTHER METHOD FOR FREQUENT
ITEMSET GENERATION
Determine support of any k-itemset by intersecting tid-
lists of two of its (k-1) subsets.
Advantage: very fast support counting
Disadvantage: intermediate tid-lists may become too
large for memory
A
1
4
5
6
7
8
9
B
1
2
5
7
8
10
∧ →
AB
1
5
7
8
28
29. INTERESTINGNESS MEASURE: CORRELATIONS
(LIFT)
play basketball ⇒ eat cereal [40%, 66.7%] is misleading
The overall % of students eating cereal is 75% > 66.7%.
play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
29
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
30. LIFT
Measure of dependent/correlated events: lift
Lift = 1 , A & B are independent
Lift > 1, A & B are positively correlated
Lift<1, A & B are negatively correlated
30
)()(
)(
BPAP
BAP
lift
∪
=
89.0
5000/3750*5000/3000
5000/2000
),( ==CBlift
33.1
5000/1250*5000/3000
5000/1000
),( ==¬CBlift