The document provides an overview of sequential pattern mining. It discusses the challenges of mining sequential patterns from large databases due to the huge number of possible patterns. It then describes the Apriori algorithm as an example approach, showing the pseudocode. It works in multiple passes over the database, generating candidate itemsets in each pass and pruning those that don't meet the minimum support threshold. The document also summarizes the FP-Growth algorithm, which avoids candidate generation by building a compact FP-tree structure and mining it recursively to extract patterns. Applications mentioned include customer shopping sequences, medical treatments, and DNA sequences.
6. A sequence : < (ef) (ab) (df) c b >
A sequence database
SID sequence An element may contain a set of items.
Items within an element are unordered
10 <a(abc)(ac)d(cf)>
and we list them alphabetically.
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb> <a(bc)df> is a subsequence of
40 <eg(af)cbc> <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a sequential
pattern 6
7. CHALLENGES ON SEQUENTIAL
PATTERN MINING
A huge number of possible sequential patterns are hidden in
databases
A mining algorithm should
find the complete set of patterns, when possible, satisfying the
minimum support (frequency) threshold
be highly efficient, scalable, involving only a small number of
database scans
be able to incorporate various kinds of user-specific
constraints
7
11. The Apriori Algorithm [Pseudo-Code]
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk != ; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
11
12. APRIORI ADV/DISADV
Advantages:
Uses large itemset property.
Easily parallelized
Easy to implement.
Disadvantages:
Assumes transaction database is memory resident.
Requires up to m database scans.
13. J. Han, J. Pei, and Y. Yin 2000
Depth-first search
Avoid explicit candidate generation
Adopt divide-and-conquer strategy
Two step approach
Step1:Build a compact data
structure called FP tree
Step2:Extract frequent itemsets
from FP tree.
14. Step 1: FP-Tree Construction
FP-Tree is constructed using 2 passes over the data-set:
Pass 1:
Scan data and find support for each item.
Discard infrequent items.
Sort frequent items in decreasing order based on
their support.
15. Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it to a path
2. Fixed order is used, so paths can overlap when transactions share items (when
they have the same prfix ).
– In this case, counters are incremented
3. Pointers are maintained between nodes containing the same item, creating singly
linked lists (dotted lines)
– The more paths that overlap, the higher the compression. FP-tree may fit in
memory.
4. Frequent itemsets extracted from the FP-Tree.
16. Start from each frequent length-1 pattern (as an initial suffix
pattern)
construct its conditional pattern base (a ―subdatabase,‖which
consists of the set of prefix paths in the FP-tree co-occurring
with the suffix pattern)
Construct its (conditional) FP-tree, and perform mining
recursively on such a tree.
The pattern growth is achieved by the concatenation of the
suffix pattern with the frequent patterns generated from a
conditional FP-tree.
17. Table : Table after
first scan of database
Table : Transactional data
21. FP-FROWTH ADV/DISADV
Advantages of FP-Growth
• only 2 passes over data-set
• ―compresses‖ data-set
• no candidate generation
• much faster than Apriori
Disadvantages of FP-Growth
• FP-Tree may not fit in memory!!
• FP-Tree is expensive to build
22. APPLICATIONS
Customer shopping sequences:
First buy computer, then CD-ROM, and then digital camera, within 3
months.
Medical treatments, natural disasters (e.g., earthquakes), science
& eng. processes, stocks and markets, etc.
Telephone calling patterns, Weblog click streams
DNA sequences and gene structures
22