Mining non-redundant recurrent rules from a sequence database
1. Mining Non-Redundant Recurrent Rules from a Sequence Database
Yoon SeungYong
Ministry of Science and ICT, Republic of Korea
forcom@forcom.kr
- Efficient Mining of Recurrent Rules from a Sequence Database(Lo et al., DASFAA 2008)
- Parallel Mining of Non-Redundant Recurrent Rules from a Sequence Database(Yoon and Seki, ISIS 2017)
ยท A Parallel Algorithm for Mining Non-Redundant Recurrent Rules from a Sequence Database(Yoon and Seki, JACIII 2019)
- Towards Efficient Mining of Non-Redundant Recurrent Rules from a Sequence Database(Yoon and Seki, IWCIA 2017)
ยท Mining Non-Redundant Recurrent Rules from a Sequence Database(Yoon and Seki, IJCISTUDIES 2018)
- Efficient Mining of Recurrent Rules from a Sequence Database Using Multi-Core Processors(Yoon and Seki, SCIS&ISIS 2018)
- Bidirectional Mining of Non-Redundant Recurrent Rules from a Sequence Database(Lo et al., IEEE ICDE 2011)
- A New Algorithm for Mining Recurrent Rules from a Sequence Database(Seki and Yoon, IEEE SMC 2019)
2. Table of Contents
1. Motivation
2. Mining Non-Redundant Recurrent Rules (NR3) โ Lo et al.
3. Parallel Mining of Non-Redundant Recurrent Rules (pNR3)
4. Loop-Fused Mining of NR3 (LF-NR3)
5. Parallel Loop-Fused Mining of NR3 (pLF-NR3)
6. Bidirectional Mining of NR3 (BOB) โ Lo et al.
7. Interleaved Bidirectional Mining of NR3 (iBiRM)
8. Conclusion
2019.11.18. 2
4. Sequence Database & Sequential Rule
๏ง Transaction Histories
๏ง Program Traces
2019.11.18. 4
Customer Movie Rental History
Alice Star Wars 4, Star Wars 5, Star Wars 6, Star Wars 1
Bob Shrek, Spirited Away, Your Name
Clara Spirited Away, Howlโs Moving Castle, Princess Mononoke
David Star Wars 1, Star Wars 2, Star Wars 3, Star Wars 4, Star Wars 5
Eve Your Name
Trace ID Command
1 check, lock, use, use, unlock, exit
2 check, lock, use, check, lock, use, unlock, exit
3 check, use, unlock, exit
4 check, lock, use
5 check, lock, use, unlock, check, lock, use, unlock, exit
ใStar Wars 4ใโ ใStar Wars 5ใ
ใlockใโ ใunlockใ
5. What is a recurrent rule?
๏ง Recurrent Rule ๐ = ๐ ๐๐๐ โ ๐ ๐๐๐ ๐ก
๏ง โWhenever a series of precedent events occurs,
eventually another series of consequent events occursโ
๏ง e.g., ๐ = โจcheck, lockโฉ โ โจuse, unlockโฉ
โWhenever โจcheck, lockโฉ occurs, eventually โจuse, unlockโฉ occursโ
๏ง Captures temporal constraints that repeat a meaningful number of times
both within a sequence and across multiple sequences
๏ง A sequential rule ๐ = ๐ ๐๐๐ โ ๐ ๐๐๐ ๐ก means โwhenever a sequence is a super-sequence of
๐ ๐๐๐, it will be a super-sequence of ๐ ๐๐๐ ++๐ ๐๐๐ ๐กโ
๏ง Linear Temporal Logic (LTL)
๏ง One of the most widely-used formalism for program verification
๏ง Clarke, Edmund M., Orna Grumberg, and Doron Peled. Model checking. MIT press, 1999.
๏ง Recurrent rule can be expressed in the form of LTL
2019.11.18. 5
- proposed by David LO
6. Mining Non-Redundant Recurrent Rules (NR3)
based on David LO, Siau-Cheng KHOO, NUS and Chao LIU, DASFAA, 2008
2019.11.18. 6
7. Preliminaries & Examples (1)
๏ง a sequence database ๐๐๐๐ท๐ต โ a set of sequences : ๐1, ๐2, ๐3, ๐4, ๐5
๏ง a set of events ๐ผ in ๐๐๐๐ท๐ต : {check, exit, lock, unlock, use}
๏ง a size of ๐๐๐๐ท๐ต = ๐๐๐๐ท๐ต : ๐๐๐๐ท๐ต = 5
๏ง a sequence ๐ = ๐1, ๐2, โฆ , ๐ ๐ โถ ๐1 = โจcheck, lock, use, use, unlock, exitโฉ
๏ง a temporal point ๐ of ๐๐ in ๐ : an event of a temporal point 5 in ๐1 is unlock
๏ง a length of ๐ = ๐ = ๐ : ๐1 = 6
๏ง the last event of ๐ = ๐๐๐ ๐ก ๐ = ๐[๐] : ๐๐๐ ๐ก ๐1 = exit
๏ง the j-prefix of ๐ = ๐ ๐
= โจ๐1, ๐2, โฆ , ๐๐โฉ : ๐1
2
= โจcheck, lockโฉ
2019.11.18. 7
SID Sequence
๐1 โจcheck, lock, use, use, unlock, exitโฉ
๐2 โจcheck, lock, use, check, lock, use, unlock, exitโฉ
๐3 โจcheck, use, unlock, exitโฉ
๐4 โจcheck, lock, useโฉ
๐5 โจcheck, lock, use, unlock, check, lock, use, unlock, exitโฉ
an example sequence database ๐๐๐๐ท๐ต
8. Preliminaries & Examples (2)
๏ง Given a sequence ๐ = โจ๐1, โฆ , ๐ ๐โฉ and ๐โฒ = โจ๐1
โฒ
, โฆ , ๐ ๐
โฒ โฉ
๏ง the concatenation of ๐ and ๐โฒ
โ ๐ ++๐โฒ
= โจ๐1, โฆ , ๐ ๐, ๐1
โฒ
, โฆ , ๐ ๐
โฒ
โฉ
๏ง ๐ is a super-sequence of ๐โฒ
โ ๐ โ ๐โฒ
if ๐๐1
= ๐1
โฒ
, โฆ , ๐๐ ๐
= ๐ ๐
โฒ
(1 โค ๐1 โค โฏ โค ๐ ๐ โค ๐)
๏ง e.g., ๐1 โ โจcheck, lock, unlockโฉ :
๏ง ๐ ๐
is an instance of ๐โฒ
in ๐, if ๐ ๐
โ ๐โฒ
and ๐๐๐ ๐ก ๐โฒ
= ๐ ๐
๏ง ๐ ๐ is the minimum instance of ๐โฒ in ๐,
if ๐ ๐ is an instance of ๐โฒ and โ๐ < ๐, ๐ . ๐ก. , ๐ ๐ is an instance of ๐โฒ
๏ง e.g., ๐1
3
, ๐1
4
are instances of โจcheck, lock, useโฉ in ๐1, and ๐1
3
is the minimum
๏ง ๐5
9
is an instance of ๐1 in ๐5, and it is the minimum
2019.11.18. 8
SID Sequence
๐1 โจcheck, lock, use, use, unlock, exitโฉ
๐2 โจcheck, lock, use, check, lock, use, unlock, exitโฉ
๐3 โจcheck, use, unlock, exitโฉ
๐4 โจcheck, lock, useโฉ
๐5 โจcheck, lock, use, unlock, check, lock, use, unlock, exitโฉ
๐1 = โจcheck, lock, use, use, unlock, exitโฉ
an example sequence database ๐๐๐๐ท๐ต
11. Rule Redundancy
๏ง Consider ๐ = โจcheckโฉ โ โจlock, use, unlockโฉ and ๐ โฒ = โจcheckโฉ โ โจunlockโฉ
with the same sequence/instance support and confidence
๏ง Do we really need both these rules?
๏ง Rule Redundancy
๏ง A rule ๐ โฒ = ๐ ๐๐๐
โฒ โ ๐ ๐๐๐ ๐ก
โฒ
is redundant if there is another rule ๐ = ๐ ๐๐๐ โ ๐ ๐๐๐ ๐ก
1. the same sequence/instance support and confidence
2. ๐ ๐๐๐ ++๐ ๐๐๐ ๐ก โ ๐ ๐๐๐
โฒ
++๐ ๐๐๐ ๐ก
โฒ
(R is longer than Rโ)
๏ง Mining Non-Redundant Recurrent Rules
๏ง Mine pruned pre/post-conditions using modified BIDE (LS-Set miner)
๏ง BIDE : frequent closed sequence mining algorithm based on pattern-growth strategy
๏ง Wang, Jianyong, and Jiawei Han. "BIDE: Efficient mining of frequent closed sequences." Data Engineering, 2004.
Proceedings. 20th International Conference on. IEEE, 2004.
2019.11.18. 11
๐ = โจcheck, lock, use, unlockโฉ
12. FS-Set, CS-Set, LS-Set
๏ง The set of frequent sequential pattern (FS-Set)
๏ง ๐น๐ = {๐ | support ๐ โฅ min_sup}
๏ง The set of closed frequent sequential pattern (CS-Set)
๏ง ๐ถ๐ = {๐ |๐ โ ๐น๐ ๐๐๐ โ๐ โฒ
โ ๐น๐, ๐ ๐ข๐โ ๐กโ๐๐ก ๐ โ ๐ โฒ
๐๐๐ support ๐ = support ๐ โฒ
}
๏ง Project Database Closed Set (LS-Set)
๏ง ๐ฟ๐ = {๐ | support ๐ โฅ min_sup ๐๐๐ โ๐ โฒ
, ๐ ๐ข๐โ ๐กโ๐๐ก ๐ โ ๐ โฒ
๐๐๐ ๐๐๐๐ท๐ต๐ = ๐๐๐๐ท๐ต ๐ โฒ}
๏ง cf. ๐๐๐๐ท๐ต๐ = ๐๐๐๐ท๐ต ๐ โฒ โ ๐๐๐๐ท๐ต๐ = ๐๐๐๐ท๐ต ๐ โฒ
๏ง Xifeng Yan, Jiawei Han, Ramin Afshar, โCloSpan: Mining Closed Sequential Patterns in Large Datasetsโ, SIAM 2003
2019.11.18. 12
13. Pruning Redundant Pre-Conds
๏ง In a sequence database ๐๐๐๐ท๐ต, consider a pre-condition candidate ๐ ๐๐๐.
๏ง If there is a pre-condition candidate ๐ ๐๐๐
โฒ
โ ๐ ๐๐๐ such that
๏ง (i) ๐ ๐๐๐
โฒ
= ๐1 ++๐ ++๐2 while ๐ ๐๐๐ = ๐1 ++๐2, for some event ๐ and nonempty ๐1, ๐2
๏ง (ii) ๐๐๐๐ท๐ต ๐ ๐๐๐
= ๐๐๐๐ท๐ต ๐ ๐๐๐
โฒ
๏ง then, for any post-condition candidate ๐๐๐ ๐ก and any forward extension ๐ ๐๐๐ ++๐,
๏ง the rule ๐ ๐๐๐ ++๐ โ ๐๐๐ ๐ก is redundant
2019.11.18. 13
14. LS-Set BIDE
2019.11.18. 14
Backward-extension event checking is omitted from the original BIDE algorithm
โข David Lo, Siau-Cheng KHOO, Chao LIU, โMining Recurrent Rules from Sequence Databaseโ, TR12/07 NUS
15. Non-Redundant Recurrent Rules Miner (NR3)
๏ง Input: a sequence database ๐๐๐๐ท๐ต; thresholds min_sup, min_supall, min_conf
๏ง Output: Significant and non-redundant recurrent rules ๐ ๐ข๐๐๐
๏ง Procedure
1. ๐๐๐๐ถ๐๐๐ โ A pruned set of pre-conditions from ๐๐๐๐ท๐ต satisfying ๐๐๐ _๐ ๐ข๐
2. foreach ๐๐๐ โ ๐๐๐๐ถ๐๐๐ do
1. ๐๐๐๐ท๐ต๐๐๐
๐๐๐ โ ๐๐๐๐ท๐ต allโprojected on ๐๐๐
2. ๐๐กโ๐ โ ๐๐๐ _๐๐๐๐ ร ๐๐๐๐ท๐ต๐๐๐
๐๐๐
3. ๐๐๐ ๐ก๐ถ๐๐๐ โ A pruned set of post-conditions from ๐๐๐๐ท๐ต๐๐๐
๐๐๐ satisfying ๐๐กโ๐
4. foreach ๐๐๐ ๐ก โ ๐๐๐ ๐ก๐ถ๐๐๐ do
1. if ๐ ๐ข๐ ๐๐๐ ๐๐๐ ++๐๐๐ ๐ก, ๐๐๐๐ท๐ต โฅ ๐๐๐ _๐ ๐ข๐ ๐๐๐ then
1. ๐ ๐ข๐๐๐ = ๐ ๐ข๐๐๐ โช ๐๐๐ โ ๐๐๐ ๐ก
3. Remove remaining redundancy in ๐ ๐ข๐๐๐
๏ง Alias for Tasks
๏ง Procedure line 1 : GenPre task
๏ง Procedure line 2.1 โ 2.4 : GenRule task
๏ง Procedure line 3 : RemRedun task
2019.11.18. 15
a c
b ac b
a a b c
๐
<a>โ<c,a,d>
<a>โ<c,b,b>
<a>โ<b>
Rules
<a,b>โ<c,d>
hash table <a>โ<c,a,d>
<a>โ<c,b,b>
<a,b>โ<c,d>
<a,b>โ<c,a>
<a>โ<b>
Rules
<c,a,d>
28. Data Structure Level Optimization for Projections
๏ง For each sequence Si in SeqDB and a set I of events,
๏ง A hash map ๐๐๐๐ โถ ๐ผ โ 2 1,โฆ, ๐ ๐
๏ง such that each key ๐ โ ๐ผ is mapped to the set of values each of which is a temporal point
of event e occurring in Si
2019.11.18. 28
29. Experiment Environment
๏ง Dataset
๏ง D10C10N10R0.5 (IBM synthetic data generator)
๏ง 9,678 sequences, average length 31.22
๏ง BMSWebView1 (a click stream dataset (Gazelle) from KDD Cup 2000)
๏ง 59,601 sequences, average length 2.42
๏ง Experiment Machine
๏ง Intel Core i7-3610QM 2.30GHz (4 physical and 8 logical cores)
๏ง 8GB RAM
๏ง Microsoft Windows 7 Professional x64
๏ง Implementation
๏ง Java SE 8
๏ง Default JVM settings
2019.11.18. 29
32. Discussion
๏ง Computational Complexity of the Algorithms
๏ง ๐ผ ๐ ร ๐ผ ๐ (I : the set of events, k : the length of the longest frequent pattern)
๏ง The effects of fusing loops in NR3
๏ง The foreach loop in the GenRule step eliminated
๏ง The use of intermediate data ๐๐๐๐ท๐ต๐๐๐ simplifies the computation of
๏ง ๐๐๐๐ท๐ต๐๐๐
๐๐๐
= ๐๐๐๐ท๐ต๐๐๐ โช ๐๐๐๐ท๐ต๐๐๐ ๐๐๐ ๐ก ๐๐๐
๐๐๐
๏ง ๐ ๐ข๐ ๐๐๐
๐๐๐ โ ๐๐๐ ๐ก, ๐๐๐๐ท๐ต = ๐ ๐ข๐ ๐๐๐
๐๐๐ ๐ก, ๐๐๐๐ท๐ต๐๐๐
๏ง The effect of the hash-based data structure
๏ง The efficient computation of (all-)projected databases
๏ง Using the hash-based data structure is not always efficient if the sequences are short
2019.11.18. 32
34. Loop-Fused NR3 (LF-NR3)
2019.11.18. โน#โบ
Possible to use the task-parallelism
underlying in the LF-NR3 algorithm,
โข which can be handled within the
single-producer-multiple-consumer
framework
40. Additional Definitions
๏ง a sequence database ๐๐๐๐ท๐ต โ a set of sequences
๏ง a sequence ๐ = ๐1, ๐2, โฆ , ๐ ๐
๏ง the j-suffix of ๐ = ๐ ๐โ๐+1, ๐ ๐โ๐+2, โฆ , ๐ ๐
๏ง ๐โฒ is the ๐ ๐กโ minimum suffix of ๐,
if ๐โฒ
is an suffix of ๐ iff no suffix starting with first(P) shorter than sx,
and longer than the (j-1)th minimum suffix
๏ง The ๐ ๐๐ suf-projection of ๐๐๐๐ท๐ต with regarding to a pattern ๐
๏ง ๐๐๐๐ท๐ต๐
๐ ๐ข๐โ ๐
= ๐, ๐ ๐ฅ |๐๐ = ๐๐ฅ ++๐ ๐ฅ โ ๐๐๐๐ท๐ต, ๐ ๐ฅ is the ๐ ๐กโ
minimum suffix of ๐๐ of ๐
๏ง ๐๐๐๐ท๐ต pre-projected on ๐
๏ง ๐๐๐๐ท๐ต๐
๐๐๐
= ๐, ๐๐ฅ ๐๐ = ๐๐ฅ ++๐ ๐ฅ โ ๐๐๐๐ท๐ต, ๐ ๐ฅ is ๐ญ๐ก๐ ๐ฆ๐ข๐ง๐ข๐ฆ๐ฎ๐ฆ ๐ฌ๐ฎ๐๐๐ข๐ฑ of ๐ }
2019.11.18. 40
41. Anti-Monotonicity Property of Confidence
๏ง Proposition 1
๏ง Consider a rule ๐ , in the form of ๐ ๐๐๐ โ ๐ ๐๐๐ ๐ก, and a sequence database ๐๐๐๐ท๐ต
๏ง ๐๐๐๐ ๐ , ๐๐๐๐ท๐ต =
sup ๐ ๐๐๐ ๐ก, ๐๐๐๐ท๐ต ๐ ๐๐๐
๐๐๐
๐ ๐ข๐ ๐๐๐ ๐ ๐๐๐, ๐๐๐๐ท๐ต
=
๐ ๐ข๐ ๐๐๐ ๐ ๐๐๐, ๐๐๐๐ท๐ต ๐ ๐๐๐ ๐ก
๐๐๐
๐ ๐ข๐ ๐๐๐ ๐ ๐๐๐, ๐๐๐๐ท๐ต
๏ง Proposition 2
๏ง Consider two rules ๐ and ๐ โฒ in a sequence database ๐๐๐๐ท๐ต with ๐ ๐๐๐
โฒ = ๐ ๐๐๐ and
๐ ๐๐๐ ๐ก
โฒ
= ๐ ++๐ ๐๐๐ ๐ก for some event ๐ โ ๐ผ
๏ง ๐๐๐๐ ๐ โฅ ๐๐๐๐ ๐ โฒ
๏ง Theorem. Anti-Monotonicity Property of Confidence
๏ง Consider two rules ๐ and ๐ โฒ
in a sequence database ๐๐๐๐ท๐ต with ๐ ๐๐๐
โฒ
= ๐ ๐๐๐ and
๐ ๐๐๐ ๐ก
โฒ
= ๐๐ฃ๐ ++๐ ๐๐๐ ๐ก where ๐๐ฃ๐ is an arbitrary series of events.
๏ง ๐๐๐๐ ๐ โฅ ๐๐๐๐ ๐ โฒ
๏ง If ๐ is not confident enough(๐๐๐๐ ๐ < ๐๐๐_๐๐๐๐), ๐ โฒ
is not either
2019.11.18. 41
42. Pruning Redundant Post-Conds
๏ง In a sequence database ๐๐๐๐ท๐ต, consider a post condition candidate ๐ ๐๐๐ ๐ก.
๏ง Lemma 1
๏ง If there is a post-condition candidate ๐ ๐๐๐ ๐ก
โฒ
โ ๐ ๐๐๐ ๐ก such that
๏ง (i) ๐ ๐๐๐ ๐ก
โฒ
= ๐1 ++๐ ++๐2 while ๐ ๐๐๐ ๐ก = ๐1 ++๐2, for some event ๐, subsequences ๐1, (nonempty) ๐2
๏ง (ii) ๐๐๐๐ท๐ต ๐ ๐๐๐ ๐ก
๐๐๐
= ๐๐๐๐ท๐ต ๐ ๐๐๐ ๐ก
โฒ
๐๐๐
๏ง then for any pre-condition candidate ๐๐๐ and any backward extension ๐ ++๐ ๐๐๐ ๐ก of ๐ ๐๐๐ ๐ก, the rule ๐ =
๐๐๐ โ ๐ ++๐ ๐๐๐ ๐ก is not confidence-closed
๏ง i.e., there exists another rule ๐ โฒ
โ ๐ such that ๐๐๐๐ ๐ = ๐๐๐๐ ๐ โฒ
๏ง Lemma 2
๏ง If there is a post-condition candidate ๐ ๐๐๐ ๐ก
โฒ
โ ๐ ๐๐๐ ๐ก such that
๏ง (i) ๐ ๐๐๐ ๐ก
โฒ
= ๐1 ++๐ ++๐2 while ๐ ๐๐๐ ๐ก = ๐1 ++๐2, for some event ๐, subsequences (nonempty) ๐1, ๐2
๏ง (iii) โ๐ โถ ๐๐๐๐ท๐ต ๐ ๐๐๐ ๐ก
๐ ๐ข๐โ๐
= ๐๐๐๐ท๐ต ๐ ๐๐๐ ๐ก
โฒ
๐ ๐ข๐โ๐
, and
๏ง (iv) โ๐ โถ ๐๐๐๐ท๐ต ๐ ๐๐๐ ๐ก
๐ ๐ข๐โ๐
๐ ๐๐๐ ๐ก
๐๐๐
= ๐๐๐๐ท๐ต ๐ ๐๐๐ ๐ก
โฒ
๐ ๐ข๐โ๐
๐ ๐๐๐ ๐ก
โฒ
๐๐๐
๏ง then for any pre-condition candidate ๐๐๐ and any backward extension ๐ ++๐ ๐๐๐ ๐ก of ๐ ๐๐๐ ๐ก, the rule ๐ =
๐๐๐ โ ๐ ++๐ ๐๐๐ ๐ก is not support-closed
๏ง i.e., there exists another rule ๐ โฒ
โ ๐ such that ๐ ๐ข๐ ๐ = ๐ ๐ข๐ ๐ โฒ
and ๐ ๐ข๐ ๐๐๐
๐ = ๐ ๐ข๐ ๐๐๐
๐ โฒ
๏ง Theorem. Pruning Redundant Post-Conds
๏ง If the properties (i)-(iv) in Lemma 1 and 2 are satisfied,
๏ง then for any pre-condition candidate ๐๐๐ and any backward extension ๐ ++๐ ๐๐๐ ๐ก of ๐ ๐๐๐ ๐ก, the rule ๐ =
๐๐๐ โ ๐ ++๐ ๐๐๐ ๐ก is redundant.
2019.11.18. 42
45. Optimizing Operations
๏ง Given the sequence database ๐๐๐๐ท๐ต, and the rule ๐ = ๐๐๐ โ ๐๐๐ ๐ก
๏ง ๐ ๐ข๐ ๐ , ๐๐๐๐ท๐ต = ๐ ๐ข๐ ๐๐๐ ๐ก, ๐๐๐๐ท๐ต๐๐๐
๏ง ๐ ๐ข๐ ๐๐๐ ๐ , ๐๐๐๐ท๐ต = ๐ ๐ข๐ ๐๐๐ ๐๐๐ ๐ก, ๐๐๐๐ท๐ต๐๐๐
๏ง Pruning the search space of PRE early
๏ง for ๐ = ๐๐๐ โ ๐๐๐ ๐ก and ๐ โฒ = ๐๐๐ ++๐ โ ๐๐๐ ๐ก,
๏ง if ๐ ๐ข๐ ๐ , ๐๐๐๐ท๐ต โค ๐๐๐_๐ ๐ข๐, then ๐ ๐ข๐ ๐ โฒ, ๐๐๐๐ท๐ต โค ๐๐๐_๐ ๐ข๐
๏ง if ๐ ๐ข๐ ๐๐๐
๐ , ๐๐๐๐ท๐ต โค ๐๐๐_๐ ๐ข๐ ๐๐๐
, then ๐ ๐ข๐ ๐๐
๐ โฒ
, ๐๐๐๐ท๐ต โค ๐๐๐_๐ ๐ข๐ ๐๐๐
๏ง Decreasing the number of scanning a database using a prefix tree
๏ง for each pre-condition ๐๐๐ โ ๐๐ ๐ธ, suppose that a node ๐0 โ ๐๐๐๐๐ has its children
nodes ๐1, โฆ , ๐๐
๏ง we can compute the instance supports of its children nodes ๐1, โฆ , ๐๐ by scanning ๐๐๐๐ท๐ต
once
๏ง When ๐0 corresponds to a post-condition ๐๐๐ ๐ก โ ๐๐๐๐, each child node ๐๐ corresponds to
a post-condition ๐๐๐ ๐ก๐ = ๐๐ ++๐๐๐ ๐ก for some event ๐๐, and the post condition of each child
node thus has its suffix ๐๐๐ ๐ก in common.
๏ง When scanning a sequence ๐ โ ๐๐๐๐ท๐ต, we record the positions of each ๐๐โs and
those of the events appearing in ๐๐๐ ๐ก, from which we can compute the number of
instances of ๐๐๐ ++๐๐๐ ๐ก๐ in ๐
2019.11.18. โน#โบ
52. Conclusion & Future Works
๏ง Conclusion
๏ง We have proposed Parallel Non-Redundant Recurrent Rules Miner (pNR3)
๏ง We have proposed Loop-Fused Non-Redundant Recurrent Rules Miner(LF-NR3)
๏ง We have proposed Parallel Loop-Fused Non-Redundant Recurrent Rules Miner
(pLF-NR3)
๏ง We have proposed Interleaved Bidirectional Non-Redundant Recurrent Rules Miner
(iBiRM)
๏ง Future works
๏ง Improvement of the sequential recurrent rule mining algorithm
๏ง Improvement of the parallel algorithms
๏ง Source codes are available at https://bitbucket.org/sekilab/nr3
2019.11.18. 52
Editor's Notes
Good morning everyone.
I am Yoon SeungYong, a student in Nagoya Institute of Technology.
Seki Hirohisa is my advisor, and participated in this research.
From now, Iโd like to introduce my research, โParallel Mining of Non-Redundant Recurrent Rules from a Sequence Databaseโ.
I will, first, speak of the motivation of this research, and introduce the recurrent rules and the algorithm NR3, base of this research.
I, then, present our algorithm, parallel mining of recurrent rules, pNR3, and show the effectiveness of our algorithm based on experiment results.
Our motivation on the research
I first talk about the sequence database and sequential rules.
An example of a sequence database is transaction histories.
For instance, Alice rented Star Wars 4, 5, and 6, and then Star Wars 1, as the release date.
Another example is program traces.
From these databases, we can infer a rule <Star Wars 4> then <Star Wars 5>, and <lock> then <unlock>.
But why recurrent rules?
Because a recurrent rule captures temporal constraints within a sequence and across multiple sequences.
Recall the previous examples.
In the transaction histories, we rarely cares how many times a customer lend same videos.
But in the program traces, we have to consider how many times a series of commands has been executed.
This is the reason that a recurrent rule has been proposed
And mined recurrent rules can be directly converted into Linear Temporal Logic, the most widely used formalism for program verification.
For more details, refer a favorite text book, Model checking.
From now, I will introduce mining recurrent rules, and the algorithm NR3.
We first define some terminologies.
A sequence database is a set of sequences.
A sequence is a series of events.
In a sequence, we say the position of each event a temporal point.
And, we refer the first j event as the j-prefix of sequence.
We will define some operations on the sequence.
This is a concatenation of S and Sโ.
We say S is a super-sequence of Sโ, if S contains Sโ.
And the matched prefix is called as instance, and the shortest one is the minimum instance.
We will define the operation on a database.
We say a database is projected on a sequence P, if a sequence contains P, the longest remaining part will be a projected database, and as it is known operation.
We say a database is all-projected on a sequence P, if a sequence contains P, all of the remaining part will be a all-projected database.
We say the number of the sequences support, especially, the sequence support is for projection, and the instance support is for all-projection.
We will define a recurrent rule R equals pre then post.
The supports are almost same as we previously defined.
The confidence has special form, we can intuitively see it how many sequences contains post in the all-projected database on pre.
We say a rule is significant if the number of rules is above the thresholds.
We will define the notion of Rule Redundancy.
Consider these two rules.
R contains Rโ, and have the same support and confidence.
It means if a sequence contains R then it also contains Rโ.
We do not need to mine these rules, so we will prune some of them.
We define a rule is redundant if there is another longer rule that has the same support confidence.
And this will be processed using the algorithm BIDE, well-known frequent closed sequence miner.
Now I will introduce the algorithm of Non-Redundant Recurrent Rules Miner, NR3, the work of David Lo, and others.
The NR3 receives a sequence database and three thresholds, and emits significant and non-redundant recurrent rules.
It first generates the candidates of pre-conditions using BIDE, consisting of recursions.
So we call this step GenPre.
Next, by looping the candidate pre, it generates the candidates of post-conditions and generates rules.
We call this step GenRule, and in this step, we get significant rules.
Finally, we remove remaining redundant rules using hash tables using the supports and confidence as a key.
We call this step RemRedun.
From now I will show our algorithm, parallel mining of recurrent rules, pNR3.
Letโs review the previous work.
First, if GenPre task find one pre-condition candidate, then we can handle GenRule task immediately.
We call this strategy, the single-producer-multiple-consumer-framework.
Because the GenRule tasks can be consumed as the GenPre task produces a pre.
Second, we can concurrently handle the GenRule tasks.
We call this strategy, namely, the loop-level parallelization.
This is our algorithm Parallel Non-Redundant Recurrent Rules Miner, pNR3.
The pNR3 instance starts to mine pre-conditions.
Then the GenPre emits GenRule tasks using found pre, and push them into the thread pool.
The thread pool handles these GenRule tasks, and the tasks collect significant rules.
Finally the RemRedun instance removes redundant rules.
This is our Java implementation.
It works as I explained.
The source codes are available at our Bitbucket repository.
I will discuss the effect of parallelization.
We utilized two strategy, GenPre Concurrency, the single-producer-multiple-consumer framework and GenRule Parallelization, the loop-level parallelization.
GenPre Concurrency works as maximum function of GenPre or GenRule, because the longer task effects the total runtime.
GenRule Parallelization works as a divider function, because available threads can handle each GenRule task.
As a result, the runtime of our pNR3 is max GenPre or GenRule divided by N plus RemRedun.
We will see these discussion in experiment results.
Iโll explain experiment environment.
We used two famous dataset, one is a synthetic dataset and another is real dataset.
We implemented nr3 and pNR3 in Java 8, and executed in the common Core i7 machine which has 4 physical cores.
This is an experiment result on synthetic dataset.
Above is when change minimum support, and below is when change confidence.
First chart is a runtime of algorithms, NR3 and pNR3 on 2, 4, 8 threads, second is the ratio of each tasks in NR3, and third is the size of pre-condition candidates and rules.
As we discussed before, the runtime of our parallel algorithm is maximum of GenPre and GenRule divided by N plus RemRedun.
In NR3, GenPre takes about 20% of runtime, and RemRedun is negligible in this dataset.
So if the runtime of our parallel algorithm becomes 20% of this dataset, then we can say our algorithm is effective.
As the results show, the runtime of 8-pNR3 is about 20% of NR3, so we can say our algorithm is very effective.
This is an experiment result on real world dataset.
Above is when change minimum support, and below is when change confidence.
First chart is a runtime of algorithms, NR3 and pNR3 on 2, 4, 8 threads, second is the ratio of each tasks in NR3, and third is the size of pre-condition candidates and rules.
As we discussed before, the runtime of our parallel algorithm is maximum of GenPre and GenRule divided by N plus RemRedun.
In NR3, GenRule takes almost 100% of runtime, and GenPre and RemRedun is negligible in this dataset.
So if the runtime of our parallel algorithm decreases as we increase the number of threads, then we can say our algorithm is effective.
As the results show, the runtime of 4-pNR3 is about 30% of NR3, and 8-pNR3 is about 20% of NR3, so we can say our algorithm is effective, even if we take account into some overheads due to parallelization.
From now I will show our algorithm, parallel mining of recurrent rules, pNR3.
Now I will introduce the algorithm of Non-Redundant Recurrent Rules Miner, NR3, the work of David Lo, and others.
The NR3 receives a sequence database and three thresholds, and emits significant and non-redundant recurrent rules.
It first generates the candidates of pre-conditions using BIDE, consisting of recursions.
So we call this step GenPre.
Next, by looping the candidate pre, it generates the candidates of post-conditions and generates rules.
We call this step GenRule, and in this step, we get significant rules.
Finally, we remove remaining redundant rules using hash tables using the supports and confidence as a key.
We call this step RemRedun.
Iโll explain experiment environment.
We used two famous dataset, one is a synthetic dataset and another is real dataset.
We implemented nr3 and pNR3 in Java 8, and executed in the common Core i7 machine which has 4 physical cores.
This is an experiment result on synthetic dataset.
Above is when change minimum support, and below is when change confidence.
First chart is a runtime of algorithms, NR3 and pNR3 on 2, 4, 8 threads, second is the ratio of each tasks in NR3, and third is the size of pre-condition candidates and rules.
As we discussed before, the runtime of our parallel algorithm is maximum of GenPre and GenRule divided by N plus RemRedun.
In NR3, GenPre takes about 20% of runtime, and RemRedun is negligible in this dataset.
So if the runtime of our parallel algorithm becomes 20% of this dataset, then we can say our algorithm is effective.
As the results show, the runtime of 8-pNR3 is about 20% of NR3, so we can say our algorithm is very effective.
This is an experiment result on real world dataset.
Above is when change minimum support, and below is when change confidence.
First chart is a runtime of algorithms, NR3 and pNR3 on 2, 4, 8 threads, second is the ratio of each tasks in NR3, and third is the size of pre-condition candidates and rules.
As we discussed before, the runtime of our parallel algorithm is maximum of GenPre and GenRule divided by N plus RemRedun.
In NR3, GenRule takes almost 100% of runtime, and GenPre and RemRedun is negligible in this dataset.
So if the runtime of our parallel algorithm decreases as we increase the number of threads, then we can say our algorithm is effective.
As the results show, the runtime of 4-pNR3 is about 30% of NR3, and 8-pNR3 is about 20% of NR3, so we can say our algorithm is effective, even if we take account into some overheads due to parallelization.
From now I will show our algorithm, parallel mining of recurrent rules, pNR3.
Iโll explain experiment environment.
We used two famous dataset, one is a synthetic dataset and another is real dataset.
We implemented nr3 and pNR3 in Java 8, and executed in the common Core i7 machine which has 4 physical cores.
This is an experiment result on synthetic dataset.
Above is when change minimum support, and below is when change confidence.
First chart is a runtime of algorithms, NR3 and pNR3 on 2, 4, 8 threads, second is the ratio of each tasks in NR3, and third is the size of pre-condition candidates and rules.
As we discussed before, the runtime of our parallel algorithm is maximum of GenPre and GenRule divided by N plus RemRedun.
In NR3, GenPre takes about 20% of runtime, and RemRedun is negligible in this dataset.
So if the runtime of our parallel algorithm becomes 20% of this dataset, then we can say our algorithm is effective.
As the results show, the runtime of 8-pNR3 is about 20% of NR3, so we can say our algorithm is very effective.
This is an experiment result on real world dataset.
Above is when change minimum support, and below is when change confidence.
First chart is a runtime of algorithms, NR3 and pNR3 on 2, 4, 8 threads, second is the ratio of each tasks in NR3, and third is the size of pre-condition candidates and rules.
As we discussed before, the runtime of our parallel algorithm is maximum of GenPre and GenRule divided by N plus RemRedun.
In NR3, GenRule takes almost 100% of runtime, and GenPre and RemRedun is negligible in this dataset.
So if the runtime of our parallel algorithm decreases as we increase the number of threads, then we can say our algorithm is effective.
As the results show, the runtime of 4-pNR3 is about 30% of NR3, and 8-pNR3 is about 20% of NR3, so we can say our algorithm is effective, even if we take account into some overheads due to parallelization.
From now, I will introduce mining recurrent rules, and the algorithm NR3.
We first define some terminologies.
A sequence database is a set of sequences.
A sequence is a series of events.
In a sequence, we say the position of each event a temporal point.
And, we refer the first j event as the j-prefix of sequence.
From now I will show our algorithm, parallel mining of recurrent rules, pNR3.
Iโll explain experiment environment.
We used two famous dataset, one is a synthetic dataset and another is real dataset.
We implemented nr3 and pNR3 in Java 8, and executed in the common Core i7 machine which has 4 physical cores.
This is an experiment result on synthetic dataset.
Above is when change minimum support, and below is when change confidence.
First chart is a runtime of algorithms, NR3 and pNR3 on 2, 4, 8 threads, second is the ratio of each tasks in NR3, and third is the size of pre-condition candidates and rules.
As we discussed before, the runtime of our parallel algorithm is maximum of GenPre and GenRule divided by N plus RemRedun.
In NR3, GenPre takes about 20% of runtime, and RemRedun is negligible in this dataset.
So if the runtime of our parallel algorithm becomes 20% of this dataset, then we can say our algorithm is effective.
As the results show, the runtime of 8-pNR3 is about 20% of NR3, so we can say our algorithm is very effective.
This is an experiment result on real world dataset.
Above is when change minimum support, and below is when change confidence.
First chart is a runtime of algorithms, NR3 and pNR3 on 2, 4, 8 threads, second is the ratio of each tasks in NR3, and third is the size of pre-condition candidates and rules.
As we discussed before, the runtime of our parallel algorithm is maximum of GenPre and GenRule divided by N plus RemRedun.
In NR3, GenRule takes almost 100% of runtime, and GenPre and RemRedun is negligible in this dataset.
So if the runtime of our parallel algorithm decreases as we increase the number of threads, then we can say our algorithm is effective.
As the results show, the runtime of 4-pNR3 is about 30% of NR3, and 8-pNR3 is about 20% of NR3, so we can say our algorithm is effective, even if we take account into some overheads due to parallelization.
Now I finally conclude
We have proposed the algorithm Parallel Non-Redundant Recurrent Rules Miner, pNR3.
It utilized two strategy, the single-producer-multiple-consumer framework and the loop-level parallelism.
We showed the effectiveness of our algorithm based on the experiment on synthetic and real datasets.
For the future works, we will do some experiments on the program trace, as the purpose of the rules.
We will do experiment on many cores processor to see the effects accurately.
Also, using the large memory, we will compare our algorithm to BOB, the successor of NR3.
We are now working on improvement of the sequential recurrent rule mining algorithms.
You can refer our implementation in this repository.
This is all of my presentation.
Thank you for listening.
Do you have any questions?