Sequential pattern mining


Published on

Published in: Education
1 Comment
  • For Business Analytics tools Online Training register at
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Sequential pattern mining

  2. 2. A sequence : < (ef) (ab) (df) c b >A sequence databaseSID sequence An element may contain a set of items. Items within an element are unordered10 <a(abc)(ac)d(cf)> and we list them alphabetically.20 <(ad)c(bc)(ae)>30 <(ef)(ab)(df)cb> <a(bc)df> is a subsequence of40 <eg(af)cbc> <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern 6
  3. 3. CHALLENGES ON SEQUENTIALPATTERN MINING A huge number of possible sequential patterns are hidden in databases A mining algorithm should  find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold  be highly efficient, scalable, involving only a small number of database scans  be able to incorporate various kinds of user-specific constraints 7
  4. 4. The Apriori Algorithm—An Example Supmin = 2 Itemset sup Itemset supDatabase TDB {A} 2 Tid Items L1 {A} 2 C1 {B} 3 {B} 3 10 A, C, D {C} 3 1st scan {C} 3 20 B, C, E {D} 1 {E} 3 30 A, B, C, E {E} 3 40 B, E C2 Itemset sup C2 Itemset {A, B} 1 L2 Itemset sup 2nd scan {A, B} {A, C} 2 {A, C} 2 {A, C} {A, E} 1 {B, C} 2 {B, C} 2 {A, E} {B, E} 3 {B, E} 3 {B, C} {C, E} 2 {C, E} 2 {B, E} {C, E} Itemset 3rd scan L3 Itemset sup C3 {B, C, E} {B, C, E} 2 10
  5. 5. The Apriori Algorithm [Pseudo-Code]Ck: Candidate itemset of size kLk : frequent itemset of size kL1 = {frequent items};for (k = 1; Lk != ; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support endreturn k Lk; 11
  6. 6. APRIORI ADV/DISADV Advantages:  Uses large itemset property.  Easily parallelized  Easy to implement. Disadvantages:  Assumes transaction database is memory resident.  Requires up to m database scans.
  7. 7.  J. Han, J. Pei, and Y. Yin 2000 Depth-first search Avoid explicit candidate generation Adopt divide-and-conquer strategy Two step approach Step1:Build a compact data structure called FP tree Step2:Extract frequent itemsets from FP tree.
  8. 8. Step 1: FP-Tree Construction FP-Tree is constructed using 2 passes over the data-set: Pass 1:  Scan data and find support for each item.  Discard infrequent items.  Sort frequent items in decreasing order based on their support.
  9. 9. Pass 2:Nodes correspond to items and have a counter1. FP-Growth reads 1 transaction at a time and maps it to a path2. Fixed order is used, so paths can overlap when transactions share items (when they have the same prfix ). – In this case, counters are incremented3. Pointers are maintained between nodes containing the same item, creating singly linked lists (dotted lines) – The more paths that overlap, the higher the compression. FP-tree may fit in memory.4. Frequent itemsets extracted from the FP-Tree.
  10. 10.  Start from each frequent length-1 pattern (as an initial suffix pattern) construct its conditional pattern base (a ―subdatabase,‖which consists of the set of prefix paths in the FP-tree co-occurring with the suffix pattern) Construct its (conditional) FP-tree, and perform mining recursively on such a tree. The pattern growth is achieved by the concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-tree.
  11. 11. Table : Table after first scan of databaseTable : Transactional data
  12. 12. Fig . FP – Tree Construction
  13. 13. EXAMPLE CONTTable:Mining FP Tree by creating conditional (sub)-pattern bases
  14. 14. EXAMPLE CONTFig.The conditional FP-tree associated with the conditiona node I3
  15. 15. FP-FROWTH ADV/DISADVAdvantages of FP-Growth • only 2 passes over data-set • ―compresses‖ data-set • no candidate generation • much faster than AprioriDisadvantages of FP-Growth • FP-Tree may not fit in memory!! • FP-Tree is expensive to build
  16. 16. APPLICATIONSCustomer shopping sequences:  First buy computer, then CD-ROM, and then digital camera, within 3 months.Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc.Telephone calling patterns, Weblog click streamsDNA sequences and gene structures 22
  17. 17. THANK YOU