The document discusses efficient algorithms for mining non-redundant recurrent rules from sequence databases. It begins with an introduction of recurrent rules and their representation of temporal constraints repeated across sequences. Several algorithms are then presented for mining non-redundant recurrent rules, including the original NR3 algorithm, parallel and optimized variants of NR3, and a bidirectional approach. The document concludes with a discussion of interleaving bidirectional mining for further improvements.
System Model
Deadlock Characterization
Methods for Handling Deadlocks
Deadlock Prevention
Deadlock Avoidance
Deadlock Detection
Recovery from Deadlock
Combined Approach to Deadlock Handling
Senthilkanth,MCA..
The following ppt's full topic covers Operating System for BSc CS, BCA, MSc CS, MCA students..
1.Introduction
2.OS Structures
3.Process
4.Threads
5.CPU Scheduling
6.Process Synchronization
7.Dead Locks
8.Memory Management
9.Virtual Memory
10.File system Interface
11.File system implementation
12.Mass Storage System
13.IO Systems
14.Protection
15.Security
16.Distributed System Structure
17.Distributed File System
18.Distributed Co Ordination
19.Real Time System
20.Multimedia Systems
21.Linux
22.Windows
System Model
Deadlock Characterization
Methods for Handling Deadlocks
Deadlock Prevention
Deadlock Avoidance
Deadlock Detection
Recovery from Deadlock
Combined Approach to Deadlock Handling
Senthilkanth,MCA..
The following ppt's full topic covers Operating System for BSc CS, BCA, MSc CS, MCA students..
1.Introduction
2.OS Structures
3.Process
4.Threads
5.CPU Scheduling
6.Process Synchronization
7.Dead Locks
8.Memory Management
9.Virtual Memory
10.File system Interface
11.File system implementation
12.Mass Storage System
13.IO Systems
14.Protection
15.Security
16.Distributed System Structure
17.Distributed File System
18.Distributed Co Ordination
19.Real Time System
20.Multimedia Systems
21.Linux
22.Windows
The Deadlock Problem
System Model
Deadlock Characterization
Methods for Handling Deadlocks
Deadlock Prevention
Deadlock Avoidance
Deadlock Detection
Recovery from Deadlock
Deadlock is a situation where a set of processes are blocked because each process is holding a resource and waiting for another resource acquired by some other process.
Mutual Exclusion: One or more than one resource are non-sharable (Only one process can use at a time)
발표자: 허기홍 (발표 당시 서울대 박사과정)
발표일: 2017.6.
서울대학교 컴퓨터공학부 프로그래밍 연구실에서 프로그램 정적분석(static program analysis)을 연구하고 있다. 특히, 정적 분석기를 더욱 빠르고 정확하게 만들기 위한 다양한 기법을 연구, 개발했다.
최근에는 다양한 데이터를 학습하여 정적 분석기를 더욱 유연하게 만드는데 많은 관심을 두고 있다.
2017년 8월 서울대학교에서 박사학위를 받을 예정이다.
개요:
이 발표에서는 기계학습을 사용하여 정적 분석기의 안전성(soundness)을 조절하는 방법을 이야기한다.
실용적인 정적분석기는, 무결점 검증기를 제외하면, 대부분 필연적으로 불안전 (unsound) 하다. 즉, 실제 실행의 모든 경우를 항상 포섭하지는 않는다는 뜻이다. 정확도나 성능 등을 이유로 여러 부분에서 실제 의미의 일부만을 포섭하는 경우가 많다. 예를 들면, 순환문이 한 바퀴만 도는 경우만 고려한다거나, 내용을 모르는 라이브러리 호출은 아무일을 안한다고 가정해버리는 식이다. 그런데 이런 불안전한 기법을 무턱대고 사용할 시에는 분석기 사용자가 알고 싶어하는 정보 (예를 들면, 버그)를 지나치게 많이 놓치는 문제가 있다. 이를 해결하기 위해 이 연구에서는 안전성을 포기해도 버그를 놓치지 않는 지점, 그러면서도 동시에 정확도는 높일수 있는 지점을 잘 찾는 방법을 제시한다.
핵심은 이런 지점을 기계 학습을 이용해서 찾아내는 것이다. 버그가 있는 프로그램의 여러 지점을 안전성을 포기하면서 분석을 해보고, 그 분석 결과를 학습시키는 방식이다. 그렇게 학습된 분류기 (classifier) 는 새로운 프로그램을 분석할 때 안전성을 포기해도 될 만한 부분을 정확하게 짚어주어 분석기의 정확도를 높이는데 도움을 준다.
우리는 이 기술을 C 프로그램을 대상으로 하는 두 가지 정적 분석기 (버퍼오버런 오류 검출기, 포맷스트링 오류 검출기)에 적용하여 그 성능을 실험하였다. 그 결과, 맹목적으로 안전성을 포기하는 기존 분석기에 비해서 획기적으로 오류 검출능력을 높일 수 있었다. 이 연구는 올해 ICSE (International Conference on Software Engineering) 학회를 통해 소개되었다.
The Deadlock Problem
System Model
Deadlock Characterization
Methods for Handling Deadlocks
Deadlock Prevention
Deadlock Avoidance
Deadlock Detection
Recovery from Deadlock
Deadlock is a situation where a set of processes are blocked because each process is holding a resource and waiting for another resource acquired by some other process.
Mutual Exclusion: One or more than one resource are non-sharable (Only one process can use at a time)
발표자: 허기홍 (발표 당시 서울대 박사과정)
발표일: 2017.6.
서울대학교 컴퓨터공학부 프로그래밍 연구실에서 프로그램 정적분석(static program analysis)을 연구하고 있다. 특히, 정적 분석기를 더욱 빠르고 정확하게 만들기 위한 다양한 기법을 연구, 개발했다.
최근에는 다양한 데이터를 학습하여 정적 분석기를 더욱 유연하게 만드는데 많은 관심을 두고 있다.
2017년 8월 서울대학교에서 박사학위를 받을 예정이다.
개요:
이 발표에서는 기계학습을 사용하여 정적 분석기의 안전성(soundness)을 조절하는 방법을 이야기한다.
실용적인 정적분석기는, 무결점 검증기를 제외하면, 대부분 필연적으로 불안전 (unsound) 하다. 즉, 실제 실행의 모든 경우를 항상 포섭하지는 않는다는 뜻이다. 정확도나 성능 등을 이유로 여러 부분에서 실제 의미의 일부만을 포섭하는 경우가 많다. 예를 들면, 순환문이 한 바퀴만 도는 경우만 고려한다거나, 내용을 모르는 라이브러리 호출은 아무일을 안한다고 가정해버리는 식이다. 그런데 이런 불안전한 기법을 무턱대고 사용할 시에는 분석기 사용자가 알고 싶어하는 정보 (예를 들면, 버그)를 지나치게 많이 놓치는 문제가 있다. 이를 해결하기 위해 이 연구에서는 안전성을 포기해도 버그를 놓치지 않는 지점, 그러면서도 동시에 정확도는 높일수 있는 지점을 잘 찾는 방법을 제시한다.
핵심은 이런 지점을 기계 학습을 이용해서 찾아내는 것이다. 버그가 있는 프로그램의 여러 지점을 안전성을 포기하면서 분석을 해보고, 그 분석 결과를 학습시키는 방식이다. 그렇게 학습된 분류기 (classifier) 는 새로운 프로그램을 분석할 때 안전성을 포기해도 될 만한 부분을 정확하게 짚어주어 분석기의 정확도를 높이는데 도움을 준다.
우리는 이 기술을 C 프로그램을 대상으로 하는 두 가지 정적 분석기 (버퍼오버런 오류 검출기, 포맷스트링 오류 검출기)에 적용하여 그 성능을 실험하였다. 그 결과, 맹목적으로 안전성을 포기하는 기존 분석기에 비해서 획기적으로 오류 검출능력을 높일 수 있었다. 이 연구는 올해 ICSE (International Conference on Software Engineering) 학회를 통해 소개되었다.
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...AshishDPatel1
The sequential pattern mining generates the sequential patterns. It can be used as the input of another program for retrieving the information from the large collection of data. It requires a large amount of memory as well as numerous I/O operations. Multistage operations reduce the efficiency of the
algorithm. The given GACP is based on graph representation and avoids recursively reconstructing intermediate trees during the mining process. The algorithm also eliminates the need of repeatedly scanning the database. A graph used in GACP is a data structure accessed starting at its first node called root and each node of a graph is either a leaf or an interior node. An interior node has one or more child nodes, thus from the root to any node in the graph defines a sequence. After construction of the graph the pruning technique called clustering is used to retrieve the records from the graph. The algorithm can be used to mine the database using compact memory based data structures and cleaver pruning methods.
In this paper, we have proposed a novel sequential mining method. The method is fast in comparison to existing method. Data mining, that is additionally cited as knowledge discovery in databases, has been recognized because the method of extracting non-trivial, implicit, antecedently unknown, and probably helpful data from knowledge in databases. The information employed in the mining method usually contains massive amounts of knowledge collected by computerized applications. As an example, bar-code readers in retail stores, digital sensors in scientific experiments, and alternative automation tools in engineering typically generate tremendous knowledge into databases in no time. Not to mention the natively computing- centric environments like internet access logs in net applications. These databases therefore work as rich and reliable sources for information generation and verification. Meanwhile, the massive databases give challenges for effective approaches for information discovery.
Mining Top-k Closed Sequential Patterns in Sequential Databases IOSR Journals
Abstract: In data mining community, sequential pattern mining has been studied extensively. Most studies
require the specification of minimum support threshold to mine the sequential patterns. However, it is difficult
for users to provide an appropriate threshold in practice. To overcome this, we propose mining top-k closed
sequential patterns of length no less than min_l, where k is the number of closed sequential patterns to be
mined, and min_l is the minimum length of each pattern. We mine closed patterns since they are solid
representations of frequent patterns.
Keywords: closed pattern, data mining, sequential pattern, scalability
Foundation and Synchronization of the Dynamic Output Dual Systemsijtsrd
In this paper, the synchronization problem of the dynamic output dual systems is firstly introduced and investigated. Based on the time domain approach, the state variables synchronization of such dual systems can be verified. Meanwhile, the guaranteed exponential convergence rate can be accurately estimated. Finally, some numerical simulations are provided to illustrate the feasibility and effectiveness of the obtained result. Yeong-Jeu Sun "Foundation and Synchronization of the Dynamic Output Dual Systems" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-6 , October 2019, URL: https://www.ijtsrd.com/papers/ijtsrd29256.pdf Paper URL: https://www.ijtsrd.com/engineering/electrical-engineering/29256/foundation-and-synchronization-of-the-dynamic-output-dual-systems/yeong-jeu-sun
Jogging While Driving, and Other Software Engineering Research Problems (invi...David Rosenblum
invited talk presented for the Distinguished Lecturer Series of the Department of Computer Science at the University of Illinois at Chicago, 10 April 2014
Models—abstract and simple descriptions of some artifact—are the backbone of all software engineering activities. While writing models is hard, existing code can serve as a source for abstract descriptions of how software behaves. To infer correct usage, code analysis needs usage examples, though; the more, the better.
We have built a lightweight parser that efficiently extracts API usage models from source code—models that can then be used to detect anomalies. Applied on the 200 mil- lion lines of code of the Gentoo Linux distribution, we would extract more than 15 million API constraints. On the web site checkmycode.org, anyone can check his/her code against the “wisdom of Linux”.
The slide of the talk in http://www.meetup.com/R-Users-Sydney/events/223867196/
There is a web version here: http://wush978.github.io/FeatureHashing/index.html
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Locks? We Don't Need No Stinkin' Locks - Michael BarkerJAX London
Embrace the dark side. As a developer you'll often be advised that writing concurrent code should be the purview of the genius coders alone. In this talk Michael Barker will discard that notion into the cesspits of logic and reason and attempt to present on the less understood area of non-blocking concurrency, i.e. concurrency without locks. We'll look the modern Intel CPU architecture, why we need a memory model, the performance costs of various non-blocking constructs and delve into the implementation details of the latest version of the Disruptor to see how non-blocking concurrency can be applied to build high performance data structures.
Similar to Mining non-redundant recurrent rules from a sequence database (20)
지능정보사회의 발전에 따라 사이버 공격은 업종과 규모를 가리지 않고 모든 기업을 대상으로 이
뤄지고 있다. 이러한 현실을 반영하여 최근 정보통신망법은 모든 정보통신서비스 제공자에게 특정 지위의 정보보호 최고책임자(CISO)를 지정하도록 개정되어 시행될 예정이다. 그러나 정보통신망법령은 업종별 정보화 특성을 고려하지 아니하고, 매출액·자산총액 기준으로만 정보보호 최고책임자의 지위를 차등화하고 있으며, 차등화된 지위는 임원·비임원 여부로만 규정되어 있어 현장에서 실효성이 발휘되기 곤란하다는 문제가 있다. 본 논문은 정보보호 거버넌스 관점에서 지위를 차등화하고 업종별 특성과 종업원 수 기준에 따른 정보보호 최고책임자의 법적 지위 요건을 제시하고자 한다.
계산 종이접기 분야를 소개합니다.
- 2019. 10. 12. 대전 소모임 '내가 만드는 작은 강연' 발표자료
· 범위
計算折り紙入門 (上原隆平) 제2장 전개도의 기초지식
- 전개도의 기본적인 성질
- 변전개도의 개수
- 아키야마-나라의 정리
· 회전벨트에 의한 무한종류의 접기
· 아키야마·나라의 정리의 확장
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
Mining non-redundant recurrent rules from a sequence database
1. Mining Non-Redundant Recurrent Rules from a Sequence Database
Yoon SeungYong
Ministry of Science and ICT, Republic of Korea
forcom@forcom.kr
- Efficient Mining of Recurrent Rules from a Sequence Database(Lo et al., DASFAA 2008)
- Parallel Mining of Non-Redundant Recurrent Rules from a Sequence Database(Yoon and Seki, ISIS 2017)
· A Parallel Algorithm for Mining Non-Redundant Recurrent Rules from a Sequence Database(Yoon and Seki, JACIII 2019)
- Towards Efficient Mining of Non-Redundant Recurrent Rules from a Sequence Database(Yoon and Seki, IWCIA 2017)
· Mining Non-Redundant Recurrent Rules from a Sequence Database(Yoon and Seki, IJCISTUDIES 2018)
- Efficient Mining of Recurrent Rules from a Sequence Database Using Multi-Core Processors(Yoon and Seki, SCIS&ISIS 2018)
- Bidirectional Mining of Non-Redundant Recurrent Rules from a Sequence Database(Lo et al., IEEE ICDE 2011)
- A New Algorithm for Mining Recurrent Rules from a Sequence Database(Seki and Yoon, IEEE SMC 2019)
2. Table of Contents
1. Motivation
2. Mining Non-Redundant Recurrent Rules (NR3) – Lo et al.
3. Parallel Mining of Non-Redundant Recurrent Rules (pNR3)
4. Loop-Fused Mining of NR3 (LF-NR3)
5. Parallel Loop-Fused Mining of NR3 (pLF-NR3)
6. Bidirectional Mining of NR3 (BOB) – Lo et al.
7. Interleaved Bidirectional Mining of NR3 (iBiRM)
8. Conclusion
2019.11.18. 2
4. Sequence Database & Sequential Rule
Transaction Histories
Program Traces
2019.11.18. 4
Customer Movie Rental History
Alice Star Wars 4, Star Wars 5, Star Wars 6, Star Wars 1
Bob Shrek, Spirited Away, Your Name
Clara Spirited Away, Howl’s Moving Castle, Princess Mononoke
David Star Wars 1, Star Wars 2, Star Wars 3, Star Wars 4, Star Wars 5
Eve Your Name
Trace ID Command
1 check, lock, use, use, unlock, exit
2 check, lock, use, check, lock, use, unlock, exit
3 check, use, unlock, exit
4 check, lock, use
5 check, lock, use, unlock, check, lock, use, unlock, exit
〈Star Wars 4〉→ 〈Star Wars 5〉
〈lock〉→ 〈unlock〉
5. What is a recurrent rule?
Recurrent Rule 𝑅 = 𝑅 𝑝𝑟𝑒 → 𝑅 𝑝𝑜𝑠𝑡
“Whenever a series of precedent events occurs,
eventually another series of consequent events occurs”
e.g., 𝑅 = ⟨check, lock⟩ → ⟨use, unlock⟩
“Whenever ⟨check, lock⟩ occurs, eventually ⟨use, unlock⟩ occurs”
Captures temporal constraints that repeat a meaningful number of times
both within a sequence and across multiple sequences
A sequential rule 𝑅 = 𝑅 𝑝𝑟𝑒 → 𝑅 𝑝𝑜𝑠𝑡 means “whenever a sequence is a super-sequence of
𝑅 𝑝𝑟𝑒, it will be a super-sequence of 𝑅 𝑝𝑟𝑒 ++𝑅 𝑝𝑜𝑠𝑡”
Linear Temporal Logic (LTL)
One of the most widely-used formalism for program verification
Clarke, Edmund M., Orna Grumberg, and Doron Peled. Model checking. MIT press, 1999.
Recurrent rule can be expressed in the form of LTL
2019.11.18. 5
- proposed by David LO
6. Mining Non-Redundant Recurrent Rules (NR3)
based on David LO, Siau-Cheng KHOO, NUS and Chao LIU, DASFAA, 2008
2019.11.18. 6
7. Preliminaries & Examples (1)
a sequence database 𝑆𝑒𝑞𝐷𝐵 – a set of sequences : 𝑆1, 𝑆2, 𝑆3, 𝑆4, 𝑆5
a set of events 𝐼 in 𝑆𝑒𝑞𝐷𝐵 : {check, exit, lock, unlock, use}
a size of 𝑆𝑒𝑞𝐷𝐵 = 𝑆𝑒𝑞𝐷𝐵 : 𝑆𝑒𝑞𝐷𝐵 = 5
a sequence 𝑆 = 𝑒1, 𝑒2, … , 𝑒 𝑛 ∶ 𝑆1 = ⟨check, lock, use, use, unlock, exit⟩
a temporal point 𝑗 of 𝑒𝑗 in 𝑆 : an event of a temporal point 5 in 𝑆1 is unlock
a length of 𝑆 = 𝑆 = 𝑛 : 𝑆1 = 6
the last event of 𝑆 = 𝑙𝑎𝑠𝑡 𝑆 = 𝑆[𝑛] : 𝑙𝑎𝑠𝑡 𝑆1 = exit
the j-prefix of 𝑆 = 𝑆 𝑗
= ⟨𝑒1, 𝑒2, … , 𝑒𝑗⟩ : 𝑆1
2
= ⟨check, lock⟩
2019.11.18. 7
SID Sequence
𝑆1 ⟨check, lock, use, use, unlock, exit⟩
𝑆2 ⟨check, lock, use, check, lock, use, unlock, exit⟩
𝑆3 ⟨check, use, unlock, exit⟩
𝑆4 ⟨check, lock, use⟩
𝑆5 ⟨check, lock, use, unlock, check, lock, use, unlock, exit⟩
an example sequence database 𝑆𝑒𝑞𝐷𝐵
8. Preliminaries & Examples (2)
Given a sequence 𝑆 = ⟨𝑒1, … , 𝑒 𝑛⟩ and 𝑆′ = ⟨𝑒1
′
, … , 𝑒 𝑚
′ ⟩
the concatenation of 𝑆 and 𝑆′
≔ 𝑆 ++𝑆′
= ⟨𝑒1, … , 𝑒 𝑛, 𝑒1
′
, … , 𝑒 𝑚
′
⟩
𝑆 is a super-sequence of 𝑆′
≔ 𝑆 ⊒ 𝑆′
if 𝑒𝑖1
= 𝑒1
′
, … , 𝑒𝑖 𝑚
= 𝑒 𝑚
′
(1 ≤ 𝑖1 ≤ ⋯ ≤ 𝑖 𝑚 ≤ 𝑛)
e.g., 𝑆1 ⊒ ⟨check, lock, unlock⟩ :
𝑆 𝑗
is an instance of 𝑆′
in 𝑆, if 𝑆 𝑗
⊒ 𝑆′
and 𝑙𝑎𝑠𝑡 𝑆′
= 𝑆 𝑗
𝑆 𝑗 is the minimum instance of 𝑆′ in 𝑆,
if 𝑆 𝑗 is an instance of 𝑆′ and ∄𝑘 < 𝑗, 𝑠. 𝑡. , 𝑆 𝑘 is an instance of 𝑆′
e.g., 𝑆1
3
, 𝑆1
4
are instances of ⟨check, lock, use⟩ in 𝑆1, and 𝑆1
3
is the minimum
𝑆5
9
is an instance of 𝑆1 in 𝑆5, and it is the minimum
2019.11.18. 8
SID Sequence
𝑆1 ⟨check, lock, use, use, unlock, exit⟩
𝑆2 ⟨check, lock, use, check, lock, use, unlock, exit⟩
𝑆3 ⟨check, use, unlock, exit⟩
𝑆4 ⟨check, lock, use⟩
𝑆5 ⟨check, lock, use, unlock, check, lock, use, unlock, exit⟩
𝑆1 = ⟨check, lock, use, use, unlock, exit⟩
an example sequence database 𝑆𝑒𝑞𝐷𝐵
11. Rule Redundancy
Consider 𝑅 = ⟨check⟩ → ⟨lock, use, unlock⟩ and 𝑅′ = ⟨check⟩ → ⟨unlock⟩
with the same sequence/instance support and confidence
Do we really need both these rules?
Rule Redundancy
A rule 𝑅′ = 𝑅 𝑝𝑟𝑒
′ → 𝑅 𝑝𝑜𝑠𝑡
′
is redundant if there is another rule 𝑅 = 𝑅 𝑝𝑟𝑒 → 𝑅 𝑝𝑜𝑠𝑡
1. the same sequence/instance support and confidence
2. 𝑅 𝑝𝑟𝑒 ++𝑅 𝑝𝑜𝑠𝑡 ⊒ 𝑅 𝑝𝑟𝑒
′
++𝑅 𝑝𝑜𝑠𝑡
′
(R is longer than R’)
Mining Non-Redundant Recurrent Rules
Mine pruned pre/post-conditions using modified BIDE (LS-Set miner)
BIDE : frequent closed sequence mining algorithm based on pattern-growth strategy
Wang, Jianyong, and Jiawei Han. "BIDE: Efficient mining of frequent closed sequences." Data Engineering, 2004.
Proceedings. 20th International Conference on. IEEE, 2004.
2019.11.18. 11
𝑆 = ⟨check, lock, use, unlock⟩
12. FS-Set, CS-Set, LS-Set
The set of frequent sequential pattern (FS-Set)
𝐹𝑆 = {𝑠| support 𝑠 ≥ min_sup}
The set of closed frequent sequential pattern (CS-Set)
𝐶𝑆 = {𝑠|𝑠 ∈ 𝐹𝑆 𝑎𝑛𝑑 ∄𝑠′
∈ 𝐹𝑆, 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑠 ⊑ 𝑠′
𝑎𝑛𝑑 support 𝑠 = support 𝑠′
}
Project Database Closed Set (LS-Set)
𝐿𝑆 = {𝑠| support 𝑠 ≥ min_sup 𝑎𝑛𝑑 ∄𝑠′
, 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑠 ⊑ 𝑠′
𝑎𝑛𝑑 𝑆𝑒𝑞𝐷𝐵𝑠 = 𝑆𝑒𝑞𝐷𝐵 𝑠′}
cf. 𝑆𝑒𝑞𝐷𝐵𝑠 = 𝑆𝑒𝑞𝐷𝐵 𝑠′ ⇔ 𝑆𝑒𝑞𝐷𝐵𝑠 = 𝑆𝑒𝑞𝐷𝐵 𝑠′
Xifeng Yan, Jiawei Han, Ramin Afshar, “CloSpan: Mining Closed Sequential Patterns in Large Datasets“, SIAM 2003
2019.11.18. 12
13. Pruning Redundant Pre-Conds
In a sequence database 𝑆𝑒𝑞𝐷𝐵, consider a pre-condition candidate 𝑅 𝑝𝑟𝑒.
If there is a pre-condition candidate 𝑅 𝑝𝑟𝑒
′
⊐ 𝑅 𝑝𝑟𝑒 such that
(i) 𝑅 𝑝𝑟𝑒
′
= 𝑃1 ++𝑒 ++𝑃2 while 𝑅 𝑝𝑟𝑒 = 𝑃1 ++𝑃2, for some event 𝑒 and nonempty 𝑃1, 𝑃2
(ii) 𝑆𝑒𝑞𝐷𝐵 𝑅 𝑝𝑟𝑒
= 𝑆𝑒𝑞𝐷𝐵 𝑅 𝑝𝑟𝑒
′
then, for any post-condition candidate 𝑝𝑜𝑠𝑡 and any forward extension 𝑅 𝑝𝑟𝑒 ++𝑃,
the rule 𝑅 𝑝𝑟𝑒 ++𝑃 → 𝑝𝑜𝑠𝑡 is redundant
2019.11.18. 13
14. LS-Set BIDE
2019.11.18. 14
Backward-extension event checking is omitted from the original BIDE algorithm
• David Lo, Siau-Cheng KHOO, Chao LIU, “Mining Recurrent Rules from Sequence Database”, TR12/07 NUS
15. Non-Redundant Recurrent Rules Miner (NR3)
Input: a sequence database 𝑆𝑒𝑞𝐷𝐵; thresholds min_sup, min_supall, min_conf
Output: Significant and non-redundant recurrent rules 𝑅𝑢𝑙𝑒𝑠
Procedure
1. 𝑃𝑟𝑒𝐶𝑜𝑛𝑑 ≔ A pruned set of pre-conditions from 𝑆𝑒𝑞𝐷𝐵 satisfying 𝑚𝑖𝑛 _𝑠𝑢𝑝
2. foreach 𝑝𝑟𝑒 ∈ 𝑃𝑟𝑒𝐶𝑜𝑛𝑑 do
1. 𝑆𝑒𝑞𝐷𝐵𝑝𝑟𝑒
𝑎𝑙𝑙 ≔ 𝑆𝑒𝑞𝐷𝐵 all−projected on 𝑝𝑟𝑒
2. 𝑏𝑡ℎ𝑑 ≔ 𝑚𝑖𝑛 _𝑐𝑜𝑛𝑓 × 𝑆𝑒𝑞𝐷𝐵𝑝𝑟𝑒
𝑎𝑙𝑙
3. 𝑃𝑜𝑠𝑡𝐶𝑜𝑛𝑑 ≔ A pruned set of post-conditions from 𝑆𝑒𝑞𝐷𝐵𝑝𝑟𝑒
𝑎𝑙𝑙 satisfying 𝑏𝑡ℎ𝑑
4. foreach 𝑝𝑜𝑠𝑡 ∈ 𝑃𝑜𝑠𝑡𝐶𝑜𝑛𝑑 do
1. if 𝑠𝑢𝑝 𝑎𝑙𝑙 𝑝𝑟𝑒 ++𝑝𝑜𝑠𝑡, 𝑆𝑒𝑞𝐷𝐵 ≥ 𝑚𝑖𝑛 _𝑠𝑢𝑝 𝑎𝑙𝑙 then
1. 𝑅𝑢𝑙𝑒𝑠 = 𝑅𝑢𝑙𝑒𝑠 ∪ 𝑝𝑟𝑒 → 𝑝𝑜𝑠𝑡
3. Remove remaining redundancy in 𝑅𝑢𝑙𝑒𝑠
Alias for Tasks
Procedure line 1 : GenPre task
Procedure line 2.1 – 2.4 : GenRule task
Procedure line 3 : RemRedun task
2019.11.18. 15
a c
b ac b
a a b c
𝜀
<a>→<c,a,d>
<a>→<c,b,b>
<a>→<b>
Rules
<a,b>→<c,d>
hash table <a>→<c,a,d>
<a>→<c,b,b>
<a,b>→<c,d>
<a,b>→<c,a>
<a>→<b>
Rules
<c,a,d>
17. Revisiting Non-Redundant Recurrent Rules Miner (NR3)
Input: a sequence database 𝑆𝑒𝑞𝐷𝐵; thresholds min_sup, min_supall, min_conf
Output: Significant and non-redundant recurrent rules 𝑅𝑢𝑙𝑒𝑠
Procedure
1. 𝑃𝑟𝑒𝐶𝑜𝑛𝑑 ≔ A pruned set of pre-conditions from 𝑆𝑒𝑞𝐷𝐵 satisfying 𝑚𝑖𝑛 _𝑠𝑢𝑝
2. foreach 𝑝𝑟𝑒 ∈ 𝑃𝑟𝑒𝐶𝑜𝑛𝑑 do
1. 𝑆𝑒𝑞𝐷𝐵𝑝𝑟𝑒
𝑎𝑙𝑙 ≔ 𝑆𝑒𝑞𝐷𝐵 all−projected on 𝑝𝑟𝑒
2. 𝑏𝑡ℎ𝑑 ≔ 𝑚𝑖𝑛 _𝑐𝑜𝑛𝑓 × 𝑆𝑒𝑞𝐷𝐵𝑝𝑟𝑒
𝑎𝑙𝑙
3. 𝑃𝑜𝑠𝑡𝐶𝑜𝑛𝑑 ≔ A pruned set of post-conditions from 𝑆𝑒𝑞𝐷𝐵𝑝𝑟𝑒
𝑎𝑙𝑙 satisfying 𝑏𝑡ℎ𝑑
4. foreach 𝑝𝑜𝑠𝑡 ∈ 𝑃𝑜𝑠𝑡𝐶𝑜𝑛𝑑 do
1. if 𝑠𝑢𝑝 𝑎𝑙𝑙 𝑝𝑟𝑒 ++𝑝𝑜𝑠𝑡, 𝑆𝑒𝑞𝐷𝐵 ≥ 𝑚𝑖𝑛 _𝑠𝑢𝑝 𝑎𝑙𝑙 then
1. 𝑅𝑢𝑙𝑒𝑠 = 𝑅𝑢𝑙𝑒𝑠 ∪ 𝑝𝑟𝑒 → 𝑝𝑜𝑠𝑡
3. Remove remaining redundancy in 𝑅𝑢𝑙𝑒𝑠
Parallelization Strategy
1. the single-producer-multiple-consumer framework
2. the loop-level parallelization
2019.11.18. 17
a c
b ac b
a a b c
𝜀
<a>→<c,a,d>
<a>→<c,b,b>
<a>→<b>
Rules
<a,b>→<c,d>
hash table <a>→<c,a,d>
<a>→<c,b,b>
<a,b>→<c,d>
<a,b>→<c,a>
<a>→<b>
Rules
<c,a,d>
1
2
18. Parallel Non-Redundant Recurrent Rules Miner (pNR3)
2019.11.18. 18
a c
b ac b
a a b c
GenPre task
<a>➝<c,a,d>
<a>➝<c,b,b>
<a,b>➝<c,d>
<a,b>➝<c,a>
<a>➝<b>
RulesThread pool
GenRule[c,b]
GenRule[c,b,c]
GenRule[a,b]
GenRule[a]
task queue worker threads
GenPre
[1]
GenRule[a]
[2]
GenRule[a,b]
[N]
<a>➝<c,a,d>
<a>➝<c,b,b>
<a>➝<b>
Rules
<a,b>➝<c,d>
RemRedun task
hash table
Image
UML
19. Parallel Non-Redundant Recurrent Rules Miner (pNR3)
2019.11.18. 19
- pNR3 framework
- GenPre task
- GenRule task
Source codes are available at https://bitbucket.org/sekilab/nr3
20. Parallelization Effects of pNR3
Let 𝑡 𝑇 be the runtime of a task 𝑇, 𝑁 be the number of available threads
NR3 : 𝑡 𝐺𝑒𝑛𝑃𝑟𝑒 + 𝑡 𝐺𝑒𝑛𝑅𝑢𝑙𝑒 + 𝑡 𝑅𝑒𝑚𝑅𝑒𝑑𝑢𝑛
pNR3 : max 𝑡 𝐺𝑒𝑛𝑃𝑟𝑒, 𝑡 𝐺𝑒𝑛𝑅𝑢𝑙𝑒/𝑁 + 𝑡 𝑅𝑒𝑚𝑅𝑒𝑑𝑢𝑛
GenPre Concurrency : max 𝑡 𝐺𝑒𝑛𝑃𝑟𝑒, 𝑡 𝐺𝑒𝑛𝑅𝑢𝑙𝑒 + 𝑡 𝑅𝑒𝑚𝑅𝑒𝑑𝑢𝑛
GenRule Parallelization : 𝑡 𝐺𝑒𝑛𝑃𝑟𝑒 + 𝑡 𝐺𝑒𝑛𝑅𝑢𝑙𝑒/𝑁 + 𝑡 𝑅𝑒𝑚𝑅𝑒𝑑𝑢𝑛
2019.11.18. 20
a c
b ac b
a a b c
𝜀
<a>→<c,a,d>
<a>→<c,b,b>
<a,b>→<c,d>
<a,b>→<c,a>
<a>→<b>
Rules
<a>
<a, b>
<c,a,d>
<a>→<c,a,d>
<a>→<c,b,b>
<a>→<b>
Rules
<a,b>→<c,d>
hash table
GenRule par. (1/N)
GenPre Concurrency (max func) RemRedun
21. Experiment Environment
Dataset
D10C10N10R0.5 (IBM synthetic data generator)
9,678 sequences, average length 31.22
BMSWebView1 (a click stream dataset (Gazelle) from KDD Cup 2000)
59,601 sequences, average length 2.42
Experiment Machine
Intel Core i7-3610QM 2.30GHz (4 physical and 8 logical cores)
8GB RAM
Microsoft Windows 7 Professional x64
Implementation
Java SE 8
Default JVM settings
2019.11.18. 21
28. Data Structure Level Optimization for Projections
For each sequence Si in SeqDB and a set I of events,
A hash map 𝑀𝑎𝑝𝑖 ∶ 𝐼 → 2 1,…, 𝑆 𝑖
such that each key 𝑒 ∈ 𝐼 is mapped to the set of values each of which is a temporal point
of event e occurring in Si
2019.11.18. 28
29. Experiment Environment
Dataset
D10C10N10R0.5 (IBM synthetic data generator)
9,678 sequences, average length 31.22
BMSWebView1 (a click stream dataset (Gazelle) from KDD Cup 2000)
59,601 sequences, average length 2.42
Experiment Machine
Intel Core i7-3610QM 2.30GHz (4 physical and 8 logical cores)
8GB RAM
Microsoft Windows 7 Professional x64
Implementation
Java SE 8
Default JVM settings
2019.11.18. 29
32. Discussion
Computational Complexity of the Algorithms
𝐼 𝑘 × 𝐼 𝑘 (I : the set of events, k : the length of the longest frequent pattern)
The effects of fusing loops in NR3
The foreach loop in the GenRule step eliminated
The use of intermediate data 𝑆𝑒𝑞𝐷𝐵𝑝𝑟𝑒 simplifies the computation of
𝑆𝑒𝑞𝐷𝐵𝑝𝑟𝑒
𝑎𝑙𝑙
= 𝑆𝑒𝑞𝐷𝐵𝑝𝑟𝑒 ∪ 𝑆𝑒𝑞𝐷𝐵𝑝𝑟𝑒 𝑙𝑎𝑠𝑡 𝑝𝑟𝑒
𝑎𝑙𝑙
𝑠𝑢𝑝 𝑎𝑙𝑙
𝑝𝑟𝑒 → 𝑝𝑜𝑠𝑡, 𝑆𝑒𝑞𝐷𝐵 = 𝑠𝑢𝑝 𝑎𝑙𝑙
𝑝𝑜𝑠𝑡, 𝑆𝑒𝑞𝐷𝐵𝑝𝑟𝑒
The effect of the hash-based data structure
The efficient computation of (all-)projected databases
Using the hash-based data structure is not always efficient if the sequences are short
2019.11.18. 32
34. Loop-Fused NR3 (LF-NR3)
2019.11.18. ‹#›
Possible to use the task-parallelism
underlying in the LF-NR3 algorithm,
• which can be handled within the
single-producer-multiple-consumer
framework
40. Additional Definitions
a sequence database 𝑆𝑒𝑞𝐷𝐵 – a set of sequences
a sequence 𝑆 = 𝑒1, 𝑒2, … , 𝑒 𝑛
the j-suffix of 𝑆 = 𝑒 𝑛−𝑗+1, 𝑒 𝑛−𝑗+2, … , 𝑒 𝑛
𝑆′ is the 𝑗 𝑡ℎ minimum suffix of 𝑆,
if 𝑆′
is an suffix of 𝑆 iff no suffix starting with first(P) shorter than sx,
and longer than the (j-1)th minimum suffix
The 𝒋 𝒕𝒉 suf-projection of 𝑆𝑒𝑞𝐷𝐵 with regarding to a pattern 𝑃
𝑆𝑒𝑞𝐷𝐵𝑃
𝑠𝑢𝑓− 𝑗
= 𝑖, 𝑠𝑥 |𝑆𝑖 = 𝑝𝑥 ++𝑠𝑥 ∈ 𝑆𝑒𝑞𝐷𝐵, 𝑠𝑥 is the 𝑗 𝑡ℎ
minimum suffix of 𝑆𝑖 of 𝑃
𝑆𝑒𝑞𝐷𝐵 pre-projected on 𝑃
𝑆𝑒𝑞𝐷𝐵𝑃
𝑝𝑟𝑒
= 𝑖, 𝑝𝑥 𝑆𝑖 = 𝑝𝑥 ++𝑠𝑥 ∈ 𝑆𝑒𝑞𝐷𝐵, 𝑠𝑥 is 𝐭𝐡𝐞 𝐦𝐢𝐧𝐢𝐦𝐮𝐦 𝐬𝐮𝐟𝐟𝐢𝐱 of 𝑃 }
2019.11.18. 40
41. Anti-Monotonicity Property of Confidence
Proposition 1
Consider a rule 𝑅, in the form of 𝑅 𝑝𝑟𝑒 → 𝑅 𝑝𝑜𝑠𝑡, and a sequence database 𝑆𝑒𝑞𝐷𝐵
𝑐𝑜𝑛𝑓 𝑅, 𝑆𝑒𝑞𝐷𝐵 =
sup 𝑅 𝑝𝑜𝑠𝑡, 𝑆𝑒𝑞𝐷𝐵 𝑅 𝑝𝑟𝑒
𝑎𝑙𝑙
𝑠𝑢𝑝 𝑎𝑙𝑙 𝑅 𝑝𝑟𝑒, 𝑆𝑒𝑞𝐷𝐵
=
𝑠𝑢𝑝 𝑎𝑙𝑙 𝑅 𝑝𝑟𝑒, 𝑆𝑒𝑞𝐷𝐵 𝑅 𝑝𝑜𝑠𝑡
𝑝𝑟𝑒
𝑠𝑢𝑝 𝑎𝑙𝑙 𝑅 𝑝𝑟𝑒, 𝑆𝑒𝑞𝐷𝐵
Proposition 2
Consider two rules 𝑅 and 𝑅′ in a sequence database 𝑆𝑒𝑞𝐷𝐵 with 𝑅 𝑝𝑟𝑒
′ = 𝑅 𝑝𝑟𝑒 and
𝑅 𝑝𝑜𝑠𝑡
′
= 𝑒 ++𝑅 𝑝𝑜𝑠𝑡 for some event 𝑒 ∈ 𝐼
𝑐𝑜𝑛𝑓 𝑅 ≥ 𝑐𝑜𝑛𝑓 𝑅′
Theorem. Anti-Monotonicity Property of Confidence
Consider two rules 𝑅 and 𝑅′
in a sequence database 𝑆𝑒𝑞𝐷𝐵 with 𝑅 𝑝𝑟𝑒
′
= 𝑅 𝑝𝑟𝑒 and
𝑅 𝑝𝑜𝑠𝑡
′
= 𝑒𝑣𝑠 ++𝑅 𝑝𝑜𝑠𝑡 where 𝑒𝑣𝑠 is an arbitrary series of events.
𝑐𝑜𝑛𝑓 𝑅 ≥ 𝑐𝑜𝑛𝑓 𝑅′
If 𝑅 is not confident enough(𝑐𝑜𝑛𝑓 𝑅 < 𝑚𝑖𝑛_𝑐𝑜𝑛𝑓), 𝑅′
is not either
2019.11.18. 41
42. Pruning Redundant Post-Conds
In a sequence database 𝑆𝑒𝑞𝐷𝐵, consider a post condition candidate 𝑅 𝑝𝑜𝑠𝑡.
Lemma 1
If there is a post-condition candidate 𝑅 𝑝𝑜𝑠𝑡
′
⊏ 𝑅 𝑝𝑜𝑠𝑡 such that
(i) 𝑅 𝑝𝑜𝑠𝑡
′
= 𝑃1 ++𝑒 ++𝑃2 while 𝑅 𝑝𝑜𝑠𝑡 = 𝑃1 ++𝑃2, for some event 𝑒, subsequences 𝑃1, (nonempty) 𝑃2
(ii) 𝑆𝑒𝑞𝐷𝐵 𝑅 𝑝𝑜𝑠𝑡
𝑝𝑟𝑒
= 𝑆𝑒𝑞𝐷𝐵 𝑅 𝑝𝑜𝑠𝑡
′
𝑝𝑟𝑒
then for any pre-condition candidate 𝑝𝑟𝑒 and any backward extension 𝑃 ++𝑅 𝑝𝑜𝑠𝑡 of 𝑅 𝑝𝑜𝑠𝑡, the rule 𝑅 =
𝑝𝑟𝑒 → 𝑃 ++𝑅 𝑝𝑜𝑠𝑡 is not confidence-closed
i.e., there exists another rule 𝑅′
⊐ 𝑅 such that 𝑐𝑜𝑛𝑓 𝑅 = 𝑐𝑜𝑛𝑓 𝑅′
Lemma 2
If there is a post-condition candidate 𝑅 𝑝𝑜𝑠𝑡
′
⊐ 𝑅 𝑝𝑜𝑠𝑡 such that
(i) 𝑅 𝑝𝑜𝑠𝑡
′
= 𝑃1 ++𝑒 ++𝑃2 while 𝑅 𝑝𝑜𝑠𝑡 = 𝑃1 ++𝑃2, for some event 𝑒, subsequences (nonempty) 𝑃1, 𝑃2
(iii) ∀𝑗 ∶ 𝑆𝑒𝑞𝐷𝐵 𝑅 𝑝𝑜𝑠𝑡
𝑠𝑢𝑓−𝑗
= 𝑆𝑒𝑞𝐷𝐵 𝑅 𝑝𝑜𝑠𝑡
′
𝑠𝑢𝑓−𝑗
, and
(iv) ∀𝑗 ∶ 𝑆𝑒𝑞𝐷𝐵 𝑅 𝑝𝑜𝑠𝑡
𝑠𝑢𝑓−𝑗
𝑅 𝑝𝑜𝑠𝑡
𝑎𝑙𝑙
= 𝑆𝑒𝑞𝐷𝐵 𝑅 𝑝𝑜𝑠𝑡
′
𝑠𝑢𝑓−𝑗
𝑅 𝑝𝑜𝑠𝑡
′
𝑎𝑙𝑙
then for any pre-condition candidate 𝑝𝑟𝑒 and any backward extension 𝑃 ++𝑅 𝑝𝑜𝑠𝑡 of 𝑅 𝑝𝑜𝑠𝑡, the rule 𝑅 =
𝑝𝑟𝑒 → 𝑃 ++𝑅 𝑝𝑜𝑠𝑡 is not support-closed
i.e., there exists another rule 𝑅′
⊐ 𝑅 such that 𝑠𝑢𝑝 𝑅 = 𝑠𝑢𝑝 𝑅′
and 𝑠𝑢𝑝 𝑎𝑙𝑙
𝑅 = 𝑠𝑢𝑝 𝑎𝑙𝑙
𝑅′
Theorem. Pruning Redundant Post-Conds
If the properties (i)-(iv) in Lemma 1 and 2 are satisfied,
then for any pre-condition candidate 𝑝𝑟𝑒 and any backward extension 𝑃 ++𝑅 𝑝𝑜𝑠𝑡 of 𝑅 𝑝𝑜𝑠𝑡, the rule 𝑅 =
𝑝𝑟𝑒 → 𝑃 ++𝑅 𝑝𝑜𝑠𝑡 is redundant.
2019.11.18. 42
45. Optimizing Operations
Given the sequence database 𝑆𝑒𝑞𝐷𝐵, and the rule 𝑅 = 𝑝𝑟𝑒 → 𝑝𝑜𝑠𝑡
𝑠𝑢𝑝 𝑅, 𝑆𝑒𝑞𝐷𝐵 = 𝑠𝑢𝑝 𝑝𝑜𝑠𝑡, 𝑆𝑒𝑞𝐷𝐵𝑝𝑟𝑒
𝑠𝑢𝑝 𝑎𝑙𝑙 𝑅, 𝑆𝑒𝑞𝐷𝐵 = 𝑠𝑢𝑝 𝑎𝑙𝑙 𝑝𝑜𝑠𝑡, 𝑆𝑒𝑞𝐷𝐵𝑝𝑟𝑒
Pruning the search space of PRE early
for 𝑅 = 𝑝𝑟𝑒 → 𝑝𝑜𝑠𝑡 and 𝑅′ = 𝑝𝑟𝑒 ++𝑒 → 𝑝𝑜𝑠𝑡,
if 𝑠𝑢𝑝 𝑅, 𝑆𝑒𝑞𝐷𝐵 ≤ 𝑚𝑖𝑛_𝑠𝑢𝑝, then 𝑠𝑢𝑝 𝑅′, 𝑆𝑒𝑞𝐷𝐵 ≤ 𝑚𝑖𝑛_𝑠𝑢𝑝
if 𝑠𝑢𝑝 𝑎𝑙𝑙
𝑅, 𝑆𝑒𝑞𝐷𝐵 ≤ 𝑚𝑖𝑛_𝑠𝑢𝑝 𝑎𝑙𝑙
, then 𝑠𝑢𝑝 𝑎𝑙
𝑅′
, 𝑆𝑒𝑞𝐷𝐵 ≤ 𝑚𝑖𝑛_𝑠𝑢𝑝 𝑎𝑙𝑙
Decreasing the number of scanning a database using a prefix tree
for each pre-condition 𝑝𝑟𝑒 ∈ 𝑃𝑅𝐸, suppose that a node 𝑁0 ∈ 𝑇𝑃𝑂𝑆𝑇 has its children
nodes 𝑁1, … , 𝑁𝑘
we can compute the instance supports of its children nodes 𝑁1, … , 𝑁𝑘 by scanning 𝑆𝑒𝑞𝐷𝐵
once
When 𝑁0 corresponds to a post-condition 𝑝𝑜𝑠𝑡 ∈ 𝑃𝑂𝑆𝑇, each child node 𝑁𝑖 corresponds to
a post-condition 𝑝𝑜𝑠𝑡𝑖 = 𝑒𝑖 ++𝑝𝑜𝑠𝑡 for some event 𝑒𝑖, and the post condition of each child
node thus has its suffix 𝑝𝑜𝑠𝑡 in common.
When scanning a sequence 𝑠 ∈ 𝑆𝑒𝑞𝐷𝐵, we record the positions of each 𝑒𝑖’s and
those of the events appearing in 𝑝𝑜𝑠𝑡, from which we can compute the number of
instances of 𝑝𝑟𝑒 ++𝑝𝑜𝑠𝑡𝑖 in 𝑠
2019.11.18. ‹#›
52. Conclusion & Future Works
Conclusion
We have proposed Parallel Non-Redundant Recurrent Rules Miner (pNR3)
We have proposed Loop-Fused Non-Redundant Recurrent Rules Miner(LF-NR3)
We have proposed Parallel Loop-Fused Non-Redundant Recurrent Rules Miner
(pLF-NR3)
We have proposed Interleaved Bidirectional Non-Redundant Recurrent Rules Miner
(iBiRM)
Future works
Improvement of the sequential recurrent rule mining algorithm
Improvement of the parallel algorithms
Source codes are available at https://bitbucket.org/sekilab/nr3
2019.11.18. 52
Editor's Notes
Good morning everyone.
I am Yoon SeungYong, a student in Nagoya Institute of Technology.
Seki Hirohisa is my advisor, and participated in this research.
From now, I’d like to introduce my research, ‘Parallel Mining of Non-Redundant Recurrent Rules from a Sequence Database’.
I will, first, speak of the motivation of this research, and introduce the recurrent rules and the algorithm NR3, base of this research.
I, then, present our algorithm, parallel mining of recurrent rules, pNR3, and show the effectiveness of our algorithm based on experiment results.
Our motivation on the research
I first talk about the sequence database and sequential rules.
An example of a sequence database is transaction histories.
For instance, Alice rented Star Wars 4, 5, and 6, and then Star Wars 1, as the release date.
Another example is program traces.
From these databases, we can infer a rule <Star Wars 4> then <Star Wars 5>, and <lock> then <unlock>.
But why recurrent rules?
Because a recurrent rule captures temporal constraints within a sequence and across multiple sequences.
Recall the previous examples.
In the transaction histories, we rarely cares how many times a customer lend same videos.
But in the program traces, we have to consider how many times a series of commands has been executed.
This is the reason that a recurrent rule has been proposed
And mined recurrent rules can be directly converted into Linear Temporal Logic, the most widely used formalism for program verification.
For more details, refer a favorite text book, Model checking.
From now, I will introduce mining recurrent rules, and the algorithm NR3.
We first define some terminologies.
A sequence database is a set of sequences.
A sequence is a series of events.
In a sequence, we say the position of each event a temporal point.
And, we refer the first j event as the j-prefix of sequence.
We will define some operations on the sequence.
This is a concatenation of S and S’.
We say S is a super-sequence of S’, if S contains S’.
And the matched prefix is called as instance, and the shortest one is the minimum instance.
We will define the operation on a database.
We say a database is projected on a sequence P, if a sequence contains P, the longest remaining part will be a projected database, and as it is known operation.
We say a database is all-projected on a sequence P, if a sequence contains P, all of the remaining part will be a all-projected database.
We say the number of the sequences support, especially, the sequence support is for projection, and the instance support is for all-projection.
We will define a recurrent rule R equals pre then post.
The supports are almost same as we previously defined.
The confidence has special form, we can intuitively see it how many sequences contains post in the all-projected database on pre.
We say a rule is significant if the number of rules is above the thresholds.
We will define the notion of Rule Redundancy.
Consider these two rules.
R contains R’, and have the same support and confidence.
It means if a sequence contains R then it also contains R’.
We do not need to mine these rules, so we will prune some of them.
We define a rule is redundant if there is another longer rule that has the same support confidence.
And this will be processed using the algorithm BIDE, well-known frequent closed sequence miner.
Now I will introduce the algorithm of Non-Redundant Recurrent Rules Miner, NR3, the work of David Lo, and others.
The NR3 receives a sequence database and three thresholds, and emits significant and non-redundant recurrent rules.
It first generates the candidates of pre-conditions using BIDE, consisting of recursions.
So we call this step GenPre.
Next, by looping the candidate pre, it generates the candidates of post-conditions and generates rules.
We call this step GenRule, and in this step, we get significant rules.
Finally, we remove remaining redundant rules using hash tables using the supports and confidence as a key.
We call this step RemRedun.
From now I will show our algorithm, parallel mining of recurrent rules, pNR3.
Let’s review the previous work.
First, if GenPre task find one pre-condition candidate, then we can handle GenRule task immediately.
We call this strategy, the single-producer-multiple-consumer-framework.
Because the GenRule tasks can be consumed as the GenPre task produces a pre.
Second, we can concurrently handle the GenRule tasks.
We call this strategy, namely, the loop-level parallelization.
This is our algorithm Parallel Non-Redundant Recurrent Rules Miner, pNR3.
The pNR3 instance starts to mine pre-conditions.
Then the GenPre emits GenRule tasks using found pre, and push them into the thread pool.
The thread pool handles these GenRule tasks, and the tasks collect significant rules.
Finally the RemRedun instance removes redundant rules.
This is our Java implementation.
It works as I explained.
The source codes are available at our Bitbucket repository.
I will discuss the effect of parallelization.
We utilized two strategy, GenPre Concurrency, the single-producer-multiple-consumer framework and GenRule Parallelization, the loop-level parallelization.
GenPre Concurrency works as maximum function of GenPre or GenRule, because the longer task effects the total runtime.
GenRule Parallelization works as a divider function, because available threads can handle each GenRule task.
As a result, the runtime of our pNR3 is max GenPre or GenRule divided by N plus RemRedun.
We will see these discussion in experiment results.
I’ll explain experiment environment.
We used two famous dataset, one is a synthetic dataset and another is real dataset.
We implemented nr3 and pNR3 in Java 8, and executed in the common Core i7 machine which has 4 physical cores.
This is an experiment result on synthetic dataset.
Above is when change minimum support, and below is when change confidence.
First chart is a runtime of algorithms, NR3 and pNR3 on 2, 4, 8 threads, second is the ratio of each tasks in NR3, and third is the size of pre-condition candidates and rules.
As we discussed before, the runtime of our parallel algorithm is maximum of GenPre and GenRule divided by N plus RemRedun.
In NR3, GenPre takes about 20% of runtime, and RemRedun is negligible in this dataset.
So if the runtime of our parallel algorithm becomes 20% of this dataset, then we can say our algorithm is effective.
As the results show, the runtime of 8-pNR3 is about 20% of NR3, so we can say our algorithm is very effective.
This is an experiment result on real world dataset.
Above is when change minimum support, and below is when change confidence.
First chart is a runtime of algorithms, NR3 and pNR3 on 2, 4, 8 threads, second is the ratio of each tasks in NR3, and third is the size of pre-condition candidates and rules.
As we discussed before, the runtime of our parallel algorithm is maximum of GenPre and GenRule divided by N plus RemRedun.
In NR3, GenRule takes almost 100% of runtime, and GenPre and RemRedun is negligible in this dataset.
So if the runtime of our parallel algorithm decreases as we increase the number of threads, then we can say our algorithm is effective.
As the results show, the runtime of 4-pNR3 is about 30% of NR3, and 8-pNR3 is about 20% of NR3, so we can say our algorithm is effective, even if we take account into some overheads due to parallelization.
From now I will show our algorithm, parallel mining of recurrent rules, pNR3.
Now I will introduce the algorithm of Non-Redundant Recurrent Rules Miner, NR3, the work of David Lo, and others.
The NR3 receives a sequence database and three thresholds, and emits significant and non-redundant recurrent rules.
It first generates the candidates of pre-conditions using BIDE, consisting of recursions.
So we call this step GenPre.
Next, by looping the candidate pre, it generates the candidates of post-conditions and generates rules.
We call this step GenRule, and in this step, we get significant rules.
Finally, we remove remaining redundant rules using hash tables using the supports and confidence as a key.
We call this step RemRedun.
I’ll explain experiment environment.
We used two famous dataset, one is a synthetic dataset and another is real dataset.
We implemented nr3 and pNR3 in Java 8, and executed in the common Core i7 machine which has 4 physical cores.
This is an experiment result on synthetic dataset.
Above is when change minimum support, and below is when change confidence.
First chart is a runtime of algorithms, NR3 and pNR3 on 2, 4, 8 threads, second is the ratio of each tasks in NR3, and third is the size of pre-condition candidates and rules.
As we discussed before, the runtime of our parallel algorithm is maximum of GenPre and GenRule divided by N plus RemRedun.
In NR3, GenPre takes about 20% of runtime, and RemRedun is negligible in this dataset.
So if the runtime of our parallel algorithm becomes 20% of this dataset, then we can say our algorithm is effective.
As the results show, the runtime of 8-pNR3 is about 20% of NR3, so we can say our algorithm is very effective.
This is an experiment result on real world dataset.
Above is when change minimum support, and below is when change confidence.
First chart is a runtime of algorithms, NR3 and pNR3 on 2, 4, 8 threads, second is the ratio of each tasks in NR3, and third is the size of pre-condition candidates and rules.
As we discussed before, the runtime of our parallel algorithm is maximum of GenPre and GenRule divided by N plus RemRedun.
In NR3, GenRule takes almost 100% of runtime, and GenPre and RemRedun is negligible in this dataset.
So if the runtime of our parallel algorithm decreases as we increase the number of threads, then we can say our algorithm is effective.
As the results show, the runtime of 4-pNR3 is about 30% of NR3, and 8-pNR3 is about 20% of NR3, so we can say our algorithm is effective, even if we take account into some overheads due to parallelization.
From now I will show our algorithm, parallel mining of recurrent rules, pNR3.
I’ll explain experiment environment.
We used two famous dataset, one is a synthetic dataset and another is real dataset.
We implemented nr3 and pNR3 in Java 8, and executed in the common Core i7 machine which has 4 physical cores.
This is an experiment result on synthetic dataset.
Above is when change minimum support, and below is when change confidence.
First chart is a runtime of algorithms, NR3 and pNR3 on 2, 4, 8 threads, second is the ratio of each tasks in NR3, and third is the size of pre-condition candidates and rules.
As we discussed before, the runtime of our parallel algorithm is maximum of GenPre and GenRule divided by N plus RemRedun.
In NR3, GenPre takes about 20% of runtime, and RemRedun is negligible in this dataset.
So if the runtime of our parallel algorithm becomes 20% of this dataset, then we can say our algorithm is effective.
As the results show, the runtime of 8-pNR3 is about 20% of NR3, so we can say our algorithm is very effective.
This is an experiment result on real world dataset.
Above is when change minimum support, and below is when change confidence.
First chart is a runtime of algorithms, NR3 and pNR3 on 2, 4, 8 threads, second is the ratio of each tasks in NR3, and third is the size of pre-condition candidates and rules.
As we discussed before, the runtime of our parallel algorithm is maximum of GenPre and GenRule divided by N plus RemRedun.
In NR3, GenRule takes almost 100% of runtime, and GenPre and RemRedun is negligible in this dataset.
So if the runtime of our parallel algorithm decreases as we increase the number of threads, then we can say our algorithm is effective.
As the results show, the runtime of 4-pNR3 is about 30% of NR3, and 8-pNR3 is about 20% of NR3, so we can say our algorithm is effective, even if we take account into some overheads due to parallelization.
From now, I will introduce mining recurrent rules, and the algorithm NR3.
We first define some terminologies.
A sequence database is a set of sequences.
A sequence is a series of events.
In a sequence, we say the position of each event a temporal point.
And, we refer the first j event as the j-prefix of sequence.
From now I will show our algorithm, parallel mining of recurrent rules, pNR3.
I’ll explain experiment environment.
We used two famous dataset, one is a synthetic dataset and another is real dataset.
We implemented nr3 and pNR3 in Java 8, and executed in the common Core i7 machine which has 4 physical cores.
This is an experiment result on synthetic dataset.
Above is when change minimum support, and below is when change confidence.
First chart is a runtime of algorithms, NR3 and pNR3 on 2, 4, 8 threads, second is the ratio of each tasks in NR3, and third is the size of pre-condition candidates and rules.
As we discussed before, the runtime of our parallel algorithm is maximum of GenPre and GenRule divided by N plus RemRedun.
In NR3, GenPre takes about 20% of runtime, and RemRedun is negligible in this dataset.
So if the runtime of our parallel algorithm becomes 20% of this dataset, then we can say our algorithm is effective.
As the results show, the runtime of 8-pNR3 is about 20% of NR3, so we can say our algorithm is very effective.
This is an experiment result on real world dataset.
Above is when change minimum support, and below is when change confidence.
First chart is a runtime of algorithms, NR3 and pNR3 on 2, 4, 8 threads, second is the ratio of each tasks in NR3, and third is the size of pre-condition candidates and rules.
As we discussed before, the runtime of our parallel algorithm is maximum of GenPre and GenRule divided by N plus RemRedun.
In NR3, GenRule takes almost 100% of runtime, and GenPre and RemRedun is negligible in this dataset.
So if the runtime of our parallel algorithm decreases as we increase the number of threads, then we can say our algorithm is effective.
As the results show, the runtime of 4-pNR3 is about 30% of NR3, and 8-pNR3 is about 20% of NR3, so we can say our algorithm is effective, even if we take account into some overheads due to parallelization.
Now I finally conclude
We have proposed the algorithm Parallel Non-Redundant Recurrent Rules Miner, pNR3.
It utilized two strategy, the single-producer-multiple-consumer framework and the loop-level parallelism.
We showed the effectiveness of our algorithm based on the experiment on synthetic and real datasets.
For the future works, we will do some experiments on the program trace, as the purpose of the rules.
We will do experiment on many cores processor to see the effects accurately.
Also, using the large memory, we will compare our algorithm to BOB, the successor of NR3.
We are now working on improvement of the sequential recurrent rule mining algorithms.
You can refer our implementation in this repository.
This is all of my presentation.
Thank you for listening.
Do you have any questions?