CS501: DATABASE AND DATA MINING - Frequent Pattern Mining

CS501: DATABASE AND
DATA MINING
Mining Frequent Patterns1

WHAT IS FREQUENT PATTERN
ANALYSIS?
 Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
2

WHY IS FREQ. PATTERN MINING
IMPORTANT?
 Discloses an intrinsic and important property of data sets
 Forms the foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
 Classification: associative classification
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Semantic data compression: fascicles
 Broad applications 3

BASIC CONCEPTS: FREQUENT
PATTERNS AND ASSOCIATION RULES
 Itemset X = {x1, …, xk}
 Find all the rules X  Y with
minimum support and confidence
support, s, probability that a
transaction contains X ∪ Y
confidence, c, conditional
probability that a
transaction having X also
contains Y
4
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
A  D (60%, 100%)
D  A (60%, 75%)
Customer
buys diaper
Customer
buys both
Customer
buys beer
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F

CLOSED PATTERNS AND MAX-
PATTERNS
 A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1
100
) + (2
100
) + … + (1
1
0
0
0
0
) =
2100
– 1 = 1.27*1030
sub-patterns!
 Solution: Mine closed patterns and max-patterns instead
 An itemset X is closed if X is frequent and there exists
no super-pattern Y X, with the same support as X
 An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y X
 Closed pattern is a lossless compression of freq.
patterns
Reducing the # of patterns and rules 5

SCALABLE METHODS FOR MINING
FREQUENT PATTERNS
 The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
 Scalable mining methods: Three major approaches
Apriori
Freq. pattern growth
Vertical data format approach
6

APRIORI: A CANDIDATE GENERATION-AND-TEST
APPROACH
 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
 Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length
k frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
7

THE APRIORI ALGORITHM—AN
EXAMPLE
8
Database TDB
1st
scan
C1
L1
L2
C2 C2
2nd
scan
C3 L33rd
scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2

IMPORTANT DETAILS OF APRIORI
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 How to count supports of candidates?
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}
9

HOW TO COUNT SUPPORTS OF
CANDIDATES?
 Why counting supports of candidates a problem?
The total number of candidates can be very huge
 One transaction may contain many candidates
 Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets
and counts
Interior node contains a hash table
Subset function: finds all the candidates contained
in a transaction
10

BOTTLENECK OF FREQUENT-
PATTERN MINING
 Multiple database scans are costly
 Mining long patterns needs many passes of scanning
and generates lots of candidates
 To find frequent itemset i1i2…i100
 # of scans: 100
 # of Candidates: (1
100
) + (2
100
) + … + (1
1
0
0
0
0
) = 2100
-1 = 1.27*1030
!
 Bottleneck: candidate-generation-and-test
 Can we avoid candidate generation?
11

FREQUENT ITEMSET GENERATION
 Apriori: uses a generate-and-test approach –
generates candidate itemsets and tests if they are
frequent
 Generation of candidate itemsets is expensive(in both
space and time)
 Support counting is expensive
 Subset checking (computationally expensive)
 Multiple Database scans (I/O)
 FP-Growth: allows frequent itemset discovery
without candidate itemset generation. Two step
approach:
 Step 1: Build a compact data structure called the FP-tree
 Built using 2 passes over the data-set.
 Step 2: Extracts frequent itemsets directly from the FP-
tree 12

STEP 1: FP-TREE CONSTRUCTION
 FP-Tree is constructed using 2 passes over the
data-set:
Pass 1:
 Scan data and find support for each item.
 Discard infrequent items.
 Sort frequent items in decreasing order based on
their support.
Use this order when building the FP-Tree, so
common prefixes can be shared.
13

STEP 1: FP-TREE CONSTRUCTION
Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it
to a path
2. Fixed order is used, so paths can overlap when
transactions share items (when they have the same
prfix ).
 In this case, counters are incremented
1. Pointers are maintained between nodes containing
the same item, creating singly linked lists (dotted
lines)
 The more paths that overlap, the higher the compression.
FP-tree may fit in memory.
1. Frequent itemsets extracted from the FP-Tree. 14

STEP 1: FP-TREE CONSTRUCTION (EXAMPLE)
15

FP-TREE SIZE
 The FP-Tree usually has a smaller size than the
uncompressed data - typically many transactions
share items (and hence prefixes).
 Best case scenario: all transactions contain the same set of
items.
 1 path in the FP-tree
 Worst case scenario: every transaction has a unique set of
items (no items in common)
 Size of the FP-tree is at least as large as the original data.
 Storage requirements for the FP-tree are higher - need to store
the pointers between the nodes and the counters.
 The size of the FP-tree depends on how the items are
ordered
 Ordering by decreasing support is typically used but
it does not always lead to the smallest tree (it's a
heuristic). 16

STEP 2: FREQUENT ITEMSET
GENERATION
 FP-Growth extracts frequent itemsets from the
FP-tree.
 Bottom-up algorithm - from the leaves towards
the root
 Divide and conquer: first look for frequent
itemsets ending in e, then de, etc. . . then d, then
cd, etc. . .
 First, extract prefix path sub-trees ending in an
item(set). (hint: use the linked lists)
17

PREFIX PATH SUB-TREES
(EXAMPLE)
18

STEP 2: FREQUENT ITEMSET
GENERATION
 Each prefix path sub-tree is processed
recursively to extract the frequent
itemsets. Solutions are then merged.
 E.g. the prefix path sub-tree for e will be used
to extract frequent itemsets ending in e, then
in de, ce, be and ae, then in cde, bde, cde, etc.
 Divide and conquer approach
19

CONDITIONAL FP-TREE
 The FP-Tree that would be built if we only consider
transactions containing a particular itemset (and then
removing that itemset from all transactions).
 Example: FP-Tree conditional on e.
20

EXAMPLE
Let minSup = 2 and extract all frequent itemsets
containing e.
 1. Obtain the prefix path sub-tree for e:
21

EXAMPLE
 2. Check if e is a frequent item by adding the
counts along the linked list (dotted line). If so,
extract it.
 Yes, count =3 so {e} is extracted as a frequent
itemset.
 3. As e is frequent, find frequent itemsets ending
in e. i.e. de, ce, be and ae.
22

EXAMPLE
 4. Use the the conditional FP-tree for e to find
frequent itemsets ending in de, ce and ae
 Note that be is not considered as b is not in the
conditional FP-tree for e.
 For each of them (e.g. de), find the prefix paths
from the conditional tree for e, extract frequent
itemsets, generate conditional FP-tree, etc...
(recursive)
23

EXAMPLE
 Example: e -> de -> ade ({d,e}, {a,d,e} are found to be
frequent)
•Example: e -> ce ({c,e} is found to be frequent)
24

RESULT
Frequent itemsets found (ordered by sufix and order in
which they are found):
25

DISCUSION
 Advantages of FP-Growth
 only 2 passes over data-set
 “compresses” data-set
 no candidate generation
 much faster than Apriori
 Disadvantages of FP-Growth
 FP-Tree may not fit in memory!!
 FP-Tree is expensive to build
26

ECLAT: ANOTHER METHOD FOR FREQUENT
ITEMSET GENERATION
 ECLAT: for each item, store a list of transaction ids
(tids); vertical data layout
TID-list
27

ECLAT: ANOTHER METHOD FOR FREQUENT
ITEMSET GENERATION
 Determine support of any k-itemset by intersecting tid-
lists of two of its (k-1) subsets.
 Advantage: very fast support counting
 Disadvantage: intermediate tid-lists may become too
large for memory
A
1
4
5
6
7
8
9
B
1
2
5
7
8
10
∧ →
AB
1
5
7
8
28

INTERESTINGNESS MEASURE: CORRELATIONS
(LIFT)
 play basketball ⇒ eat cereal [40%, 66.7%] is misleading
 The overall % of students eating cereal is 75% > 66.7%.
 play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
29
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000

LIFT
 Measure of dependent/correlated events: lift
 Lift = 1 , A & B are independent
 Lift > 1, A & B are positively correlated
 Lift<1, A & B are negatively correlated
30
)()(
)(
BPAP
BAP
lift
∪
=
89.0
5000/3750*5000/3000
5000/2000
),( ==CBlift
33.1
5000/1250*5000/3000
5000/1000
),( ==¬CBlift

CS501: DATABASE AND DATA MINING - Frequent Pattern Mining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CS501: DATABASE AND DATA MINING - Frequent Pattern Mining

Similar to CS501: DATABASE AND DATA MINING - Frequent Pattern Mining (20)

More from Kamal Singh Lodhi

More from Kamal Singh Lodhi (16)

Recently uploaded

Recently uploaded (20)

CS501: DATABASE AND DATA MINING - Frequent Pattern Mining