SPADE -

Sequence mining algorithm

Monica Dăgădiţă
ISI

 Introduction
to sequence mining
 Why sequence mining?
 Sequence mining algorithms
 SPADE
 Motivation
 Definitions and examples
 Algorithm
 Implementation

Data Mining 11/8/2011 2

 Aim - finding statistically relevant patterns
between data examples where the values are
delivered in a sequence

 Originallyintroduced for market basket
analysis - customer behaviour predictions

2 types of sequence mining:
 string mining – biology (gene/protein sequences)
 itemset mining - marketing and CRM applications


 Discovering patterns:
 Bookstore: 70% of the people who buy Jane
Austen’s “Pride and Prejudice” also buy “Emma”
within a month
 Website: finding sequences of most frequently
accessed pages

 Usage:
 Promotions
 Shelf placement
 Restructure the website
 Recommender systems


 Apriori
 GSP (Generalized Sequential Pattern)
 FreeSpan (Frequent pattern-projected
Sequential pattern mining)
 PrefixSpan (Prefix-projected Sequential
pattern mining)
 SPADE (Sequential PAttern Discovery using
Equivalence classes)


 Problems of existing solutions
 Repeated database scans
 Complex internal data structures

 Key features of SPADE:
 Fixed number of database scans
 Vertical id-list database format
 Decomposition of search space into smaller
pieces – processed independently


 Itemset: set of m distinct items
I = {i1, i2, …, im }
 Event: non-empty collection of items
(i1,i2 … ik)
 Sequence : ordered list of events
< e1 -> e2 -> … -> en >
 K-sequence : sequence with k items
(B->AC) – 3-sequence


 Subsequence: given two sequences α=<a1 a2 … an>
and β=<b1 b2 … bm>, α is called a subsequence of
β, denoted as α⊆ β, if there exist integers 1≤ j1 < j2
<…< jn ≤m such that a1 ⊆ bj1, a2 ⊆ bj2,…, an ⊆ bjn

 Examples:
1. (B->AC) is a subsequence of (AB->E->ACD)
2. (AB->E) is not a subsequence of (ABE)


Id-lists of the most frequent items (1-sequences)


 D->BF->A
 Step 1: D->B

 Step 2: D->BF


 D->BF->A
 Step 3 : D->BF->A

 Not space-efficient
 Solution: 2 columns - (sid,eid) for each sequence
 Eid – id of the sequence’s last item


 D->BF->A (space-efficient id-list joins)
D->B

SID EID
1 15
1 20
4 20

D->BF->A D->BF

SID EID SID EID
1 25 1 20
4 25 4 20


 Complete latice representation


 Decomposing the latice => smaller pieces
that can be solved independently

 Equivalence classes
2 sequences are in the same class (Θk) if they
share a common k length prefix
Example
k=1 : Θ1 -> {[A],[B],[D],[F]}


 SPADE(min_sup,D)
//min_sup – minimum_support
//D –initial dataset
F1<- {frequent items or 1-sequences}
F2<- {frequent 2-sequences}
Ε <- {equivalence classes [X] Θ1 }
for all [X] in E
enumerate_frequent_seq([X],min_sup)


 Enumerate_frequent_seq(S,min_sup)
for all Ai in S
Ti <- {}
for all Aj in S, with j≥i
R<- Ai v Aj (join)
if R satisfies min_sup
Ti <- Ti U {R}
end
Enumerate_frequent_seq(Ti , min_sup) //DFS
end
For all non-empty Ti
Enumerate_frequent_seq(Ti , min_sup) //BFS


 The R Project for Statistical Computing
 developed at Bell Laboratories (formerly
AT&T, now Lucent Technologies) by John
Chambers and colleagues

 Different implementation of S language

 arulesSequences package


SPADE -

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SPADE -

Similar to SPADE - (20)

Recently uploaded

Recently uploaded (20)

SPADE -