Sequential pattern mining

A sequence : < (ef) (ab) (df) c b >
A sequence database
SID sequence An element may contain a set of items.
Items within an element are unordered
10 <a(abc)(ac)d(cf)>
and we list them alphabetically.
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb> <a(bc)df> is a subsequence of
40 <eg(af)cbc> <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a sequential
pattern 6

CHALLENGES ON SEQUENTIAL
PATTERN MINING
A huge number of possible sequential patterns are hidden in
databases

A mining algorithm should
 find the complete set of patterns, when possible, satisfying the
minimum support (frequency) threshold
 be highly efficient, scalable, involving only a small number of
database scans
 be able to incorporate various kinds of user-specific
constraints

7

The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

Itemset
3rd scan L3 Itemset sup
C3
{B, C, E} {B, C, E} 2
10

The Apriori Algorithm [Pseudo-Code]

Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk != ; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
11

APRIORI ADV/DISADV

 Advantages:
 Uses large itemset property.
 Easily parallelized
 Easy to implement.

 Disadvantages:
 Assumes transaction database is memory resident.
 Requires up to m database scans.

 J. Han, J. Pei, and Y. Yin 2000
 Depth-first search
 Avoid explicit candidate generation
 Adopt divide-and-conquer strategy
 Two step approach
Step1:Build a compact data
structure called FP tree
Step2:Extract frequent itemsets
from FP tree.

Step 1: FP-Tree Construction
 FP-Tree is constructed using 2 passes over the data-set:

Pass 1:
 Scan data and find support for each item.
 Discard infrequent items.
 Sort frequent items in decreasing order based on
their support.

Pass 2:

Nodes correspond to items and have a counter

1. FP-Growth reads 1 transaction at a time and maps it to a path

2. Fixed order is used, so paths can overlap when transactions share items (when
they have the same prfix ).
– In this case, counters are incremented

3. Pointers are maintained between nodes containing the same item, creating singly
linked lists (dotted lines)
– The more paths that overlap, the higher the compression. FP-tree may fit in
memory.

4. Frequent itemsets extracted from the FP-Tree.

 Start from each frequent length-1 pattern (as an initial suffix
pattern)
 construct its conditional pattern base (a ―subdatabase,‖which
consists of the set of prefix paths in the FP-tree co-occurring
with the suffix pattern)
 Construct its (conditional) FP-tree, and perform mining
recursively on such a tree.
 The pattern growth is achieved by the concatenation of the
suffix pattern with the frequent patterns generated from a
conditional FP-tree.

Table : Table after
first scan of database
Table : Transactional data

Fig . FP – Tree Construction

EXAMPLE CONT

Table:Mining FP Tree by creating conditional (sub)-pattern bases

EXAMPLE CONT

Fig.The conditional FP-tree associated with the conditiona node I3

FP-FROWTH ADV/DISADV

Advantages of FP-Growth
• only 2 passes over data-set
• ―compresses‖ data-set
• no candidate generation
• much faster than Apriori

Disadvantages of FP-Growth
• FP-Tree may not fit in memory!!
• FP-Tree is expensive to build

APPLICATIONS

Customer shopping sequences:
 First buy computer, then CD-ROM, and then digital camera, within 3
months.

Medical treatments, natural disasters (e.g., earthquakes), science
& eng. processes, stocks and markets, etc.
Telephone calling patterns, Weblog click streams
DNA sequences and gene structures

22

Sequential pattern mining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sequential pattern mining

Similar to Sequential pattern mining (20)

Recently uploaded

Recently uploaded (20)

Sequential pattern mining