SPADE -

5,745 views

Published on

Published in: Technology

SPADE -

  1. 1. Sequence mining algorithm Monica Dăgădiţă ISI
  2. 2.  Introduction to sequence mining Why sequence mining? Sequence mining algorithms SPADE  Motivation  Definitions and examples  Algorithm  Implementation Data Mining 11/8/2011 2
  3. 3.  Aim - finding statistically relevant patterns between data examples where the values are delivered in a sequence Originallyintroduced for market basket analysis - customer behaviour predictions2 types of sequence mining:  string mining – biology (gene/protein sequences)  itemset mining - marketing and CRM applications Data Mining 11/8/2011 3
  4. 4.  Discovering patterns:  Bookstore: 70% of the people who buy Jane Austen’s “Pride and Prejudice” also buy “Emma” within a month  Website: finding sequences of most frequently accessed pages Usage:  Promotions  Shelf placement  Restructure the website  Recommender systems Data Mining 11/8/2011 4
  5. 5.  Apriori GSP (Generalized Sequential Pattern) FreeSpan (Frequent pattern-projected Sequential pattern mining) PrefixSpan (Prefix-projected Sequential pattern mining) SPADE (Sequential PAttern Discovery using Equivalence classes) Data Mining 11/8/2011 5
  6. 6.  Problems of existing solutions  Repeated database scans  Complex internal data structures Key features of SPADE:  Fixed number of database scans  Vertical id-list database format  Decomposition of search space into smaller pieces – processed independently Data Mining 11/8/2011 6
  7. 7.  Itemset: set of m distinct items I = {i1, i2, …, im } Event: non-empty collection of items (i1,i2 … ik) Sequence : ordered list of events < e1 -> e2 -> … -> en > K-sequence : sequence with k items (B->AC) – 3-sequence Data Mining 11/8/2011 7
  8. 8.  Subsequence: given two sequences α=<a1 a2 … an> and β=<b1 b2 … bm>, α is called a subsequence of β, denoted as α⊆ β, if there exist integers 1≤ j1 < j2 <…< jn ≤m such that a1 ⊆ bj1, a2 ⊆ bj2,…, an ⊆ bjn  Examples: 1. (B->AC) is a subsequence of (AB->E->ACD) 2. (AB->E) is not a subsequence of (ABE) Data Mining 11/8/2011 8
  9. 9. Data Mining 11/8/2011 9
  10. 10. Id-lists of the most frequent items (1-sequences) Data Mining 11/8/2011 10
  11. 11.  D->BF->A  Step 1: D->B  Step 2: D->BF Data Mining 11/8/2011 11
  12. 12.  D->BF->A  Step 3 : D->BF->A Not space-efficient  Solution: 2 columns - (sid,eid) for each sequence  Eid – id of the sequence’s last item Data Mining 11/8/2011 12
  13. 13.  D->BF->A (space-efficient id-list joins) D->B SID EID 1 15 1 20 4 20 D->BF->A D->BF SID EID SID EID 1 25 1 20 4 25 4 20 Data Mining 11/8/2011 13
  14. 14.  Complete latice representation Data Mining 11/8/2011 14
  15. 15. Data Mining 11/8/2011 15
  16. 16.  Decomposing the latice => smaller pieces that can be solved independently Equivalence classes 2 sequences are in the same class (Θk) if they share a common k length prefix Example k=1 : Θ1 -> {[A],[B],[D],[F]} Data Mining 11/8/2011 16
  17. 17. Data Mining 11/8/2011 17
  18. 18. Data Mining 11/8/2011 18
  19. 19.  SPADE(min_sup,D) //min_sup – minimum_support //D –initial dataset F1<- {frequent items or 1-sequences} F2<- {frequent 2-sequences} Ε <- {equivalence classes [X] Θ1 } for all [X] in E enumerate_frequent_seq([X],min_sup) Data Mining 11/8/2011 19
  20. 20.  Enumerate_frequent_seq(S,min_sup) for all Ai in S Ti <- {} for all Aj in S, with j≥i R<- Ai v Aj (join) if R satisfies min_sup Ti <- Ti U {R} end Enumerate_frequent_seq(Ti , min_sup) //DFS end For all non-empty Ti Enumerate_frequent_seq(Ti , min_sup) //BFS Data Mining 11/8/2011 20
  21. 21.  The R Project for Statistical Computing  developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues  Different implementation of S language  arulesSequences package Data Mining 11/8/2011 21
  22. 22. Data Mining 11/8/2011 22

×