On web stream processing

Department of Informatics
On web stream processing
Daniele Dell’Aglio
dellaglio@ifi.uzh.ch http://dellaglio.org @dandellaglio
Linköping, 22.11.2017

RDF Stream Processing
Stream
Processing
RDF
&
SPARQL
RDF Stream
Processing
(RSP)
Real-time
processing of
highly dynamic
data
Semantic Web
technologies for
data exchange
through the Web
Linköping, 22.11.2017 On web stream processing 2

Finding agreements
Many topics
– RDF streams
– Stream reasoning
– Complex event processing
– Stream query processing
– Internet/web of things
Many studies
– Data models
– Query models
– Prototypes
– Benchmarks
– Datasets
W3C RSP community group (2013 – 2016)
– Effort to (discuss | formalise | standardise | combine | evangelise) the
existing studies on RSP
– Outcomes
– Abstract model for RDF streams
– Requirements document for query languages of RDF streams
– More at: https://www.w3.org/community/rsp/

But...
W3C RSP sets some foundations and requirements, but:
– Standard protocols and exchanging mechanisms for RDF
stream are still missing
– We need generic and flexible solutions for making RDF
streams available and exchangeable on the Web

The goal: a decentralized web of RSPs
Morph
Streams
CSPARQL
TrOWL Stream
Rule
CQELS
CSPARQL
Instans
Q1: How can we let RSP engines interact and
exchange streams on the web?

The goal: a decentralized web of RSPs in the web
Morph
Streams
CSPARQL
Stream
Rule
CSPARQL
Instans
SPARQL
Q2: How to integrate stream processing with
background knowledge exposed remotely on the web?
SPARQL
CQELS
TrOWL

EXCHANGING STREAMS ON
THE WEB

How far are we?
Documents from RSP
– Abstract model of RDF Stream
– Requirements for query languages for RDF Stream
Protocols to exchange data streams on the web and
internet
– WebSocket, MQTT
Description of the stream
– SSN
Interfaces to control RSP engines

Requirements
A framework for RDF stream exchange should
1. prioritize active paradigms for data stream exchange
2. enable the combination of streaming and stored data
3. enable the possibility to build reliable, distributed and
scalable streaming applications
4. guarantee a wide range of operations over the streams
5. support the publication of information about the
stream
6. support the exchange of a wide variety of streams
7. exploit as much as possible existing protocols and
standards

WeSP
A framework to publish and exchange RDF streams on the
web
• A model to serialise RDF streams
• A model to describe RDF streams
• A communication protocol

A model to serialise RDF streams
An RDF stream can be represented as an (infinite) ordered sequence of time-
annotated data items (RDF graphs)…
... serialized in JSON-LD
[{ "@graph": {
"@id": "http://.../G1",
{ "@id": "http://.../a",
"http://.../isIn": {"@id":"http://.../rRoom"}}
},{ "@id": "http://.../G1",
"prov:generatedAt":"2016-16-12T00:01:00"
}
},{ "@graph": {
"@id": "http://.../G2",
{ "@id": "http://.../b",
"http://.../isIn": {"@id":"http://.../bRoom"}}
},{ "@id": "http://.../G2",
"prov:generatedAt":" 2016-16-12T00:03:00"
}
},…
Compliant with RDF, as well as W3C RSP abstract
data model
G1
G2
G3
{:a :isIn :rRoom}
{:b :isIn :bRoom}
{:c :talksIn :rRoom,
:d :talksIn :bRoom}
S
3
5
1
t

A model to describe RDF streams
A description of the RDF stream should be provided
• The identifier of the stream
• A description of the schema of the stream items
• Data item samples
• The location of the stream endpoint (e.g. WebSocket
URL)
This description is provided through the RDF Stream
Descriptor
• Serialised in RDF
• An extension of DCAT and SPARQL Service Descriptor
• Published according to the linked data principles

A communication protocol
Two interfaces
• Producer
• Consumer
We distinguish three types of actors (depending on the
implemented interfaces)
Producer Consumer
Stream source
Stream
transformer
Stream sink

A communication protocol: push-based streams
Producer
Consumer
Stream Descriptor
endpoint
RDF stream
endpoint
Get stream descriptor (SD)
SD
Process
SD
Subscribe to stream
Stream item
Stream item
Stream item
…
Process
stream

A communication protocol: pull-based streams
Producer
Consumer
Stream Descriptor
endpoint
RDF stream
endpoint
Get stream descriptor (SD)
SDProcess
SD
GET items
Stream items
…
Process
stream
GET items
Stream items
GET items
Stream items

Protocols
The RDF Stream Descriptor is accessible through HTTP
The transmission of the stream can happen through
different protocols
• HTTP chunked encoding
• WebSocket
• Message Queing Telemetry Transport (MQTT)
• Server-Sent Events (SSE)
• HTTP
• ...

WeSP: Proof of concepts
C-SPARQL
• Stream transformer
• WeSP implemented as a wrapper
• https://github.com/dellaglio/csparql-wesp
CQELS
• Stream transformer
• Native implementation of WeSP
• https://github.com/cqels/CQELS-1.x/
TripleWave
• Stream source
• Native implementation of WeSP
• http://streamreasoning.github.io/TripleWave

TripleWave
TripleWave is open source
• Learn more at: https://streamreasoning.github.io/TripleWave/
Triple
Wave
input?
RDF Streams
(Web socket |
HTTP-chunk |
etc.)
Stream
Descriptor

Feeding TripleWave
TripleWave supports a
variety of data
sources:
• RDF dumps with
temporal
information
• RDF with temporal
information
exposed through
SPARQL endpoints
• Streams available
on the Web
Web
API
Transform
Stream
Graph
stream
Connector
stream
Datagen
stream
Scheduler
stream
Web
Service
SPARQL
Endpoint
File
R2RML
Mapping
Conversion
Replay
Replay loop

Summary
WeSP: framework to exchange RDF streams on the web
– RDF to serialise the stream items
– RDF to describe the stream
– Application and communication protocols: HTTP,
WebSocket, MQTT, etc.
– Interfaces to produce and consume RDF streams
What’s next?
– Relation with other technologies: LDN, Activity Streams,
etc.
– Adoption
– Federated stream processing over the Web

COMBINING STREAMS AND
BACKGROUND DATA

The goal: a decentralized web of RSPs in the web
Morph
Streams
CSPARQL
Stream
Rule
CSPARQL
Instans
SPARQL
Q2: How to integrate stream processing with
background knowledge exposed remotely on the web?
SPARQL
CQELS
TrOWL

W(ω,β)
Evaluation
Time-based sliding window
S3
S4 S5
S6
S7
S8
S9 S10
S11
S12
S1
S2
β
ω
t
widthslide
Window
S

Join
RDF stream
generator
Background data
(SPARQL endpoint)
Window
The setting
Background data changes and it is stored on the web
Accessing background data is costly
Is it possible to avoid a continuous access to the
background data?

Local view
How to cope with changes on the background data?
Join
RDF stream
generator
Background data
(SPARQL endpoint)
Window
Local
view

Maintenance process
Maintenance introduces a trade-off between response quality and
time.
We propose to manage this trade-off by fixing time dimension
based on query constraints and maximizing freshness of
response.
Join
RDF stream
generator
Background data
(SPARQL endpoint)Window
Local
View
Maintenance
process

How to track background data changes?
Update streams
• stream with changes available to the query processor
• rarely available on the Web, e.g. Wikipedia,
SPARQLPush
Data changes regularly
• data generated by automatic processes that refresh it
periodically
• data warehouses, sensors
Data changes “randomly”
• Twitter user profiles, taxi status, financial updates

Requirements
The maintenance process:
1. should take into account the change rates of the data
elements in the background data;
2. should consider the dynamicity of the change rate
values;
3. should satisfy the Quality of Service constraints on
responsiveness and freshness of the answer;
4. may consider the query and its definition.

A query-driven maintenance process
WINDOW(S, ω, β) PW JOIN SERVICE(BKG) PS
WINDOW clause
JOIN Proposer Ranker
MaintainerLocal View
Ω𝑗𝑜𝑖𝑛
4 2
3
1
SERVICE clause
E
C
RND
LRU
WBM
SBM
IBM
WSJ

τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
Terminology
Best Before Time: the
time that an element will
become stale and is
defined by:
Mappings from the
WINDOW clause
Mappings in the
LOCAL VIEW
Compatible
mappings

τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
WSJ
WSJ identifies the candidate
set: the possibly stale local
view mappings involved in
the current evaluation.
WSJ analyzes the content of
the current window
evaluation and identifying
the compatible mappings
in the local view.
The possibly stale mappings
are identified by analyzing
the associated best before
time

V L Score
τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
WBM
WBM ranks the candidate set
to determine which
mappings to update.
The ranking is computed
through two values: the
renewed best before time
and the remaining life
time
The top k elements are
selected to be refreshed.
The value k is selected
according to the
responsiveness constraint.

V L Score
3
4
1
τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
WBM: renewed best before time
When would the mappings
became stale if refreshed
now?
The renewed best before
time V is computed as:

V L Score
3 3
4 1
1 3
τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
WBM: remaining life time and score
For how many future
evaluations the mappings
is involved?
The remaining life time L is
computed as:
WBM ranks the mappings by
using a score:
Score=min(L,V)
is selected for the
maintenance

Experiments

τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
Extensions: SBM
It exploits the fact that
mappings may have n-n
relations
• Each pair generates a join
(e.g. BGP)
If is refreshed, there will
be four fresh mappings
be five fresh mappings
is selected for the
maintenance

τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
Extensions: SBM
It exploits the fact that
mappings may have n-n
relations
• A result is fresh if all the
pairs are fresh (e.g.
aggregations)
be one fresh mapping
be two fresh mappings
is selected for the
maintenance
fresh

Other extensions
We developed a other rankers:
IBM: combines WBM and SBM, taking into account both
the number of produced join mappings in the present
and in future windows
FBA: dynamic allocations of the refresh operations among
different evaluations
F rankers: extensions of the presented rankers to cope
with queries with FILTER clauses on the subquery over
the background data

Summary
We proposed using the idea of materialization to optimize
processing continuous queries.
We proposed a policy to maximize the freshness according
to time constraint in continuous query.
We tested our policy against based line policies (LRU and
Random).
Future Work:
– Measuring the time overhead of maintenance
– Investigating more queries involving both remote
SPARQL endpoints and streams.
– Dynamically estimating the change rate of users.

Acknowledgments

Conclusions
RDF (or semantic) streams are getting a momentum
• Several active research groups, working on querying and
reasoning
• Prototypes, methods and applications
• Query languages, ontologies
• Use cases
However, the web dimension has only been slightly
considered

What’s next?
We still need
• Infrastructures and standards to exchange (RDF)
streams on the Web
• Agreements on languages to specify tasks over such
streams
• Query languages richer than SPARQL not only to manage
streams, but also to express higher-level operations
• Methods to manage reasoning tasks over streams
The Web dimension requires to be studied and understood
• Combination of remote streams and background data
requires new solutions
• Not only queries, but also constraints over them (QoS)

Thank you! Questions?
On web stream processing
Daniele Dell’Aglio
dellaglio@ifi.uzh.ch
http://dellaglio.org
@dandellaglio

Find more: Q1
• A. Mauri, J.-P. Calbimonte, D. Dell’Aglio, M. Balduini, E. Della Valle,
K. Aberer: Where Are the RDF Streams?: On Deploying RDF
Streams on the Web of Data with TripleWave. Poster at
International Semantic Web Conference 2015.
• A. Mauri, J.-P. Calbimonte, D. Dell’Aglio, M. Balduini, M. Brambilla,
E. Della Valle, K. Aberer: TripleWave: Spreading RDF Streams on
the Web. Resource Paper at International Semantic Web
Conference 2016.
• D. Dell'Aglio, D. Le Phuoc, A. Lê Tuán, M. Intizar Ali, J.-P.
Calbimonte: On a Web of Data Streams. DeSemWeb@ISWC 2017

Find more: Q2
• S. Dehghanzadeh, A. Mileo, D. Dell'Aglio, E. Della Valle, Shen Gao,
A. Bernstein: Online View Maintenance for Continuous Query
Evaluation. WWW (Companion Volume) 2015: 25-26
• S. Dehghanzadeh, D. Dell'Aglio, S. Gao, E. Della Valle, A. Mileo, A.
Bernstein: Approximate Continuous Query Answering over
Streams and Dynamic Linked Data Sets. ICWE 2015: 307-325
• S. Zahmatkesh, E. Della Valle, D. Dell'Aglio: When a FILTER Makes
the Difference in Continuously Answering SPARQL Queries on
Streaming and Quasi-Static Linked Data. ICWE 2016: 299-316
• S. Gao, D. Dell'Aglio, S. Dehghanzadeh, A. Bernstein, E. Della Valle,
A. Mileo: Planning Ahead: Stream-Driven Linked-Data Access
Under Update-Budget Constraints. International Semantic Web
Conference (1) 2016: 252-270
• S. Zahmatkesh, E. Della Valle, D. Dell'Aglio: Using Rank Aggregation
in Continuously Answering SPARQL Queries on Streaming and
Quasi-static Linked Data. DEBS 2017: 170-179

On web stream processing

Recommended

Recommended

More Related Content

Similar to On web stream processing

Similar to On web stream processing (20)

More from Daniele Dell'Aglio

More from Daniele Dell'Aglio (20)

Recently uploaded

Recently uploaded (20)

On web stream processing