Sma

492 views
455 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
492
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Sma

  1. 1. Sequence Mining Automata:<br />a New Technique for Mining Frequent Sequences Under Regular Expressions<br />Roberto Trasarti, Francesco Bonchi, Bart Goethals<br />
  2. 2. Problem Definition (1):<br />Given a database of sequences D, the support of a sequence S ∈ Σ∗ is the number of sequences in D that are supersequences of S: sup(S) = | {T ∈ D | S ⊑ T} |. <br />Given a Regular Expression R a sequence s is valid if can be generated by R.<br />A<br />B<br />A<br />C<br />B<br />A<br />Sequence s: 1<br />Minimum support: 3 RE: A*BC*<br />A<br />A<br />A<br />B<br />B<br />C<br />A<br />B<br />C<br />C<br />D<br />A<br />B<br />A<br />A<br />B<br />B<br />C<br />2<br />C<br />B<br />A<br />A<br />B<br />D<br />A<br />A<br />A<br />B<br />3<br />A<br />A<br />B<br />Subsequence: Support: 3<br />Subsequence: Support: 2<br />…<br />B<br />C<br />
  3. 3. Previous approaches and our contribution:<br />Previous approaches [1,2,3] solve the problem focusing on its search space, exploiting in different ways the pruning power of the regular expression <br />R over unpromising patterns.<br />The idea behind our solution is to focus on the input dataset and the given regular expression: reading the input database we produce for each sequence in the database, all and only the valid patterns contained in the sequences.<br />[1] H. Albert-Lorincz and J.-F. Boulicaut. Mining frequent sequential patterns under regular expressions: A highly adaptive strategy for pushing contraints. In Proc. of SDM’03.<br />[2] M. N. Garofalakis, R. Rastogi, and K. Shim. Spirit: Sequential pattern mining with regular expression constraints. In Proceedings of VLDB’99.<br />[3] J. Pei, J. Han, andW.Wang. Mining sequential patterns with constraints in large databases. In Proc. of CIKM’02.<br />A<br />B<br />... <br />A<br />C<br />A<br />B<br />C<br />A<br />... <br />B<br />... <br />A<br />A<br />... <br />... <br />C<br />... <br />C<br />A<br />B<br />... <br />A<br />B<br />A<br />C<br />B<br />A<br />A<br />A<br />A<br />B<br />B<br />C<br />... <br />
  4. 4. Sequence Mining Automata (1):<br />Our subsequences mining automata SMA is a specialized kind of Petri Net, which can be constructed from a DFA by transforming each edge of the DFA in a transition with its two arcs from its input place and to its output place. <br />Moreover it has the following peculiarities:<br />• Transitions do not consume tokens• Parallel execution<br />• External signal<br />The initial marking consists of only the token representing the empty sequence ε in the starting places. <br />External signal<br />Example RE: A*B(B|C)D*E<br />
  5. 5. Sequence Mining Automata (2):<br />Each transition applies an process which is activated only if the external signal is equal to the label of the edge. This process produces a new set of tokens in the destination place.<br />External signal<br />Example RE: A*B(B|C)D*E<br />
  6. 6. Sequence Mining Automata (3 Example):<br />Given R ≡ A∗B(B|C)D∗E S ≡ ACDBFAEBCFDE<br />
  7. 7. One-Pass Solution (SMA-1P) and Full-Cut (SMA-FC)<br />Simply using the SMA on each transactions and at the end compute the support for each sequences extracted filtering using the support threshold.<br />The support threshold is not used during the process of generation. We compute<br />All the sequences in the dataset w.r.t the RE.<br />A<br />D<br />B<br />B<br />E<br />C<br />Given a SMA a valid set of cuts is a partition p1, . . . , pn of the places of the SMA such as does not exist a path from a place in pj to a place in pi if j > i.<br />For each cut we apply the SMA-1P on all the DB. At the end of the i-th scan we obtain an intermediate information about frequent patterns that can be used in subsequent scans by removing the infrequent tokens. <br />
  8. 8. Experiments (Synthetic Data):<br />(D=dataset size, N=number of items, C=average length)<br />
  9. 9. Experiments (Mobility data):<br />From San Jose to San Francisco and back – via CA-101 (west-bound of the bay), i.e., passing through San Mateo (cell H9 of our map); or via I-880 (east-bound of the bay), i.e., passing through Hayward (cell J8 of our map).<br />
  10. 10. Conclusions:<br /> We have introduced “Sequence Mining Automata”, a new mechanism for mining frequent sequences under regular expressions. <br /> Around this basic mechanism we built a family of algorithms embedding different techniques. <br /> The efficiency of our proposal has been thoroughly proven empirically. <br /> The SMA is a very simple and fundamental mechanism opening the door to many possible extensions. <br />

×