Distributed stream consistency checking

Distributed Stream Consistency Checking
Shen Gao, Daniele Dell’Aglio, Jeff Z. Pan and Abraham Bernstein
Cáceres, Spain, 08.06.2018
Carlo Bernaschina (presenter)

Problem setting
ICWE, 08.06.2018Distributed Stream Consistency Checking2/25
 Real time processing of huge volumes of dynamic data
 Smart cities
 News
 Knowledge graph

The problem of noise
ICWE, 08.06.2018Distributed Stream Consistency Checking
 Streaming data are often noisy
 Broken sensors
 Malicious data injection
 Measurement errors
 How to cope with noise?
 Machine learning and numerical analyses to cope with noise in
time series
 When streams are complex (as Web streams), we want to
ensure that they are compliant to a (non-trivial)
conceptual model
3/25

Research question
How to assess the consistency of streams w.r.t. a
fixed and known a-priori conceptual model?

Towards a solution
 How to model the stream consistency check problem?
5/25

How to model the conceptual model?
 DL-Litecore
 The set of PIs and NIs composes a TBoxT
Person
Student Employee
Faculty Admin
Positive Inclusion (PI)
PhD student
Person
Organization
DJ
Negative Inclusion (NI)
6/25

How to model the data?
 ABox axioms associate:
 Individuals to classes
 Shen is a
 University of Zurich is a
 Individuals to other individuals
 Shen attends the University of Zurich
 Inconsistencies arise when the ontology (TBox + ABox)
contains contraditions
 Daniele is a
 Daniele is a
 disjoint
PhD student
University
PhD student
University
PhD student University
7/25

How to model the data stream?
 Ontology stream
 One staticTBox
 A sequence of time-annotated
ABoxes with the updates
 Sliding window over the
ontology stream
 Captures a recent set of events
A1
A3
A5
{ Shen is a }
3
5
1
t
PhD student
{ Jeff is a
Daniele is a }
Employee
Student
{ Avi is a }PhD student
TBoxPerson
Student Employee
Faculty AdminPhD student
Organiz.
Univers. High school
DJ
8/25

The stream consistency check problem
 Given an ontology stream,
we want to check if it is
consistent w.r.t. a sliding
window of a fixed size
 At each time instant, we
want to check if the events
captured by the sliding
window are consistent
 TheTBox and the current
window content compose
an ontology
A1
A3
A5
{ Shen is a }
3
5
1
t
PhD student
{ Jeff is a
Daniele is a }
University
Student
{ Jeff is a }PhD student
TBoxPerson
Student Employee
Organiz.
Univers. High school
DJ
9/25

Towards a solution
 Description logics, ontology streams
 How to cope with a huge amount of streaming data?
10/25

Scalability
How to cope with the problem when the data volume is
big?
 Sliding windows
 The content of the window may still be too large to be
processed online
 Distribution of the stream consistency checking process
 We build our solution on top of a Distributed Stream
Processing Engine (DSPE)
 We adopt the Storm terminology to introduce the main
concepts, but they are common to other DSPEs
11/25

DSPE concepts
S B1 B2
B1 B2S
B1
B1 B2
Logical topology
Physical topology
BoltsSpout
Node 1
Node 2
Node 3
Tuples
12/25

Towards a solution
 Distributed stream processing engines
 How to perform stream consistency checking over DSPE?
13/25

The NI closure
 Given theTBox T, it is possible to compute all the
possible Negative Inclusion axioms
 The set of all the possible NI axioms is named NI closure
Person
Student Employee
DJ Organization
University Company
14/25

B1
The NIs Topology Method (NTM)
 The resulting topology is the following
 A bolt evaluates when the disjoint axioms in the NI
closure are satisfied
 Each axiom is encoded as a conjunction operation
S B1
15/25
Daniele is a Person
Inconsistency
Daniele is a University
o1
Shen is a Company
Inconsistency
Shen is a Student
o2

Improving NTM
 Drawback of NTM
 The NI closure size can be exponential to the size of theTBox
 The bolt B1 becomes the bottleneck of the topology
 Introduction of inference operations to reduce the number
of conjunction operations
16/25
o
Daniele is a Student Daniele is a Person
S B1

Improving NTM - intuition
Person
Student Employee
DJ Organization
University Company
9 NIs
S B1 B2
Student -> Person
Employee -> Person
Company -> Organization
University -> Organization
S B1
1 NI
17/25

The Pipeline Topology Method (LN)
DJ(Person,Publication)
DJ(Student,Publication)
DJ(Student,Employee)
DJ(Article,Student)
DJ(Person,Organization)
...
Computes the
NI closure
DJ(Person,Publication)
DJ(Student,Employee)
DJ(Person,Organization)
...
Identifies the
essential NIs
Groups and
orders the
essential NIs
18/25

The Pipeline Topology Method (LN) cont’d
Groups are
assigned to bolts
This step has a
major impact on
performance!
Less NIs w.r.t. NTM
19/25

Towards a solution
 Distributed stream processing engines
 How to perform stream consistency checking over DSPE?
 NTM, LN
 How to they perform?
20/25

Setup
 Ontologies
 LUBM
 56 PIs, 70 NIs
 NPD
 332 PIs, 51 Nis
 Six machines
 128GB ram
 2 E5-2680 v2 processors (10 cores per processor)
 Twitter Heron 0.14.3
21/25

Comparing NTM and LN
S B1 B2
LN-x:
x NI groups
Half of the
nodes assigned
to check
consistency
Similar results
LN-2 outperforms
NTM up to 139% The load on the first
node increases
22/25

Investigating the results
LN LN LN
LN
LN
LN
LN
LN
LN LN
LN LN LN LN LN LN
LN
LN LN LN
NTM
23/25

Conclusions
 It is possible to perform consistency checking over high
volumes of data streams
 We developed two methods (NTM and LN) and studied
their performance
 More than 14 million tuples/minute
 LN can outperform NTM up to 300%
 What’s next
 Towards more expressive ontological languages
 Repairing inconsistencies
 Implementation and testing over other DPSEs
24/25

Thank you! Questions?
Distributed Stream Consistency Checking
Shen Gao, Daniele Dell’Aglio, Jeff Z. Pan,Abraham Bernstein

Distributed stream consistency checking

Recommended

Recommended

More Related Content

Similar to Distributed stream consistency checking

Similar to Distributed stream consistency checking (20)

More from Daniele Dell'Aglio

More from Daniele Dell'Aglio (20)

Recently uploaded

Recently uploaded (20)

Distributed stream consistency checking