Sequential pattern mining

3,005 views

Published on

Published in: Education
1 Comment
4 Likes
Statistics
Notes
  • For Business Analytics tools Online Training register at http://www.todaycourses.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
3,005
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
141
Comments
1
Likes
4
Embeds 0
No embeds

No notes for slide

Sequential pattern mining

  1. 1. GUIDE : MS. ANAGHA CHAUDHARI
  2. 2. A sequence : < (ef) (ab) (df) c b >A sequence databaseSID sequence An element may contain a set of items. Items within an element are unordered10 <a(abc)(ac)d(cf)> and we list them alphabetically.20 <(ad)c(bc)(ae)>30 <(ef)(ab)(df)cb> <a(bc)df> is a subsequence of40 <eg(af)cbc> <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern 6
  3. 3. CHALLENGES ON SEQUENTIALPATTERN MINING A huge number of possible sequential patterns are hidden in databases A mining algorithm should  find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold  be highly efficient, scalable, involving only a small number of database scans  be able to incorporate various kinds of user-specific constraints 7
  4. 4. The Apriori Algorithm—An Example Supmin = 2 Itemset sup Itemset supDatabase TDB {A} 2 Tid Items L1 {A} 2 C1 {B} 3 {B} 3 10 A, C, D {C} 3 1st scan {C} 3 20 B, C, E {D} 1 {E} 3 30 A, B, C, E {E} 3 40 B, E C2 Itemset sup C2 Itemset {A, B} 1 L2 Itemset sup 2nd scan {A, B} {A, C} 2 {A, C} 2 {A, C} {A, E} 1 {B, C} 2 {B, C} 2 {A, E} {B, E} 3 {B, E} 3 {B, C} {C, E} 2 {C, E} 2 {B, E} {C, E} Itemset 3rd scan L3 Itemset sup C3 {B, C, E} {B, C, E} 2 10
  5. 5. The Apriori Algorithm [Pseudo-Code]Ck: Candidate itemset of size kLk : frequent itemset of size kL1 = {frequent items};for (k = 1; Lk != ; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support endreturn k Lk; 11
  6. 6. APRIORI ADV/DISADV Advantages:  Uses large itemset property.  Easily parallelized  Easy to implement. Disadvantages:  Assumes transaction database is memory resident.  Requires up to m database scans.
  7. 7.  J. Han, J. Pei, and Y. Yin 2000 Depth-first search Avoid explicit candidate generation Adopt divide-and-conquer strategy Two step approach Step1:Build a compact data structure called FP tree Step2:Extract frequent itemsets from FP tree.
  8. 8. Step 1: FP-Tree Construction FP-Tree is constructed using 2 passes over the data-set: Pass 1:  Scan data and find support for each item.  Discard infrequent items.  Sort frequent items in decreasing order based on their support.
  9. 9. Pass 2:Nodes correspond to items and have a counter1. FP-Growth reads 1 transaction at a time and maps it to a path2. Fixed order is used, so paths can overlap when transactions share items (when they have the same prfix ). – In this case, counters are incremented3. Pointers are maintained between nodes containing the same item, creating singly linked lists (dotted lines) – The more paths that overlap, the higher the compression. FP-tree may fit in memory.4. Frequent itemsets extracted from the FP-Tree.
  10. 10.  Start from each frequent length-1 pattern (as an initial suffix pattern) construct its conditional pattern base (a ―subdatabase,‖which consists of the set of prefix paths in the FP-tree co-occurring with the suffix pattern) Construct its (conditional) FP-tree, and perform mining recursively on such a tree. The pattern growth is achieved by the concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-tree.
  11. 11. Table : Table after first scan of databaseTable : Transactional data
  12. 12. Fig . FP – Tree Construction
  13. 13. EXAMPLE CONTTable:Mining FP Tree by creating conditional (sub)-pattern bases
  14. 14. EXAMPLE CONTFig.The conditional FP-tree associated with the conditiona node I3
  15. 15. FP-FROWTH ADV/DISADVAdvantages of FP-Growth • only 2 passes over data-set • ―compresses‖ data-set • no candidate generation • much faster than AprioriDisadvantages of FP-Growth • FP-Tree may not fit in memory!! • FP-Tree is expensive to build
  16. 16. APPLICATIONSCustomer shopping sequences:  First buy computer, then CD-ROM, and then digital camera, within 3 months.Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc.Telephone calling patterns, Weblog click streamsDNA sequences and gene structures 22
  17. 17. THANK YOU

×