Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets

oheila Dehghanzadeh, Daniele Dell’Aglio, Shen Gao,
manuele Della Valle, Alessandra Mileo , Abraham Bernstein
ICWE - 25 June

Outline
● Introduction to Continous Queries
● Motivating Example
● Problem Description
● Solution
● Experimental Results
● Conclusions
2ICWE - 25 June 2015

Introduction
DF Stream Processing engines usually register queries
and execute them in a continuous fashion.
RDF Stream
Generator
Query

W(ω,β)
EvaluationEvaluation
Time-based sliding window
S3
S4 S5
S6
S7
S8
S9 S10
S11
S12
SS
S1
S2
β
ω
t
widthslide
Window

Introduction
omplex continuous queries combines data streams with
remote background data.
Join
RDF Stream
Generator
Background data
(SPARQL endpoint)

Motivating Example
Finding Influential Users
nfluential User: users who have more than a specific number of
followers and are mentioned more than a specific times in a specific
period (200 seconds).
ollower number: stored in a remote endpoint.
ention number: computed by processing the stream of messages.
Inspired by Chris Testa's SemTech 2011 talk: http://goo.gl/kLSqGo

Investigating the Scenario
Symmetrical hash join
rawbacks:
• Data access constraints.
• Background data is huge and has to be fetched at every
evaluation - slow and wasting computational and financial
resources.
Join
RDF Stream
Generator
Background data
(SPARQL endpoint)

Nested Loop Join
rawbacks:
• One invocation for each mapping from the WINDOW
clause evaluation – high number of requests to the server.
• API restrictions (e.g., limited amount of requests over
time).
Join
RDF Stream
Generator
Background data
(SPARQL endpoint)

Local Views
hallenges:
• Data goes out of date
Join
RDF Stream
Generator
Background data
(SPARQL endpoint)
Local
View

Maintenance processes
aintenance introduces a trade-off between response quality and
time.
e propose to manage this trade-off by fixing time dimension
based on query constraints and maximizing freshness of response.
Join
RDF Stream
Generator
Background data
(SPARQL endpoint)
Local
View
Maintenance
Process
Freshness decreases
Refresh
Cost/Quality trade-
off
10ICWE - 25 June 2015

Problem Description
The maintenance process should identify elements of the local
view that maximize response freshness.
11ICWE - 25 June 2015

Requirements of The Maintenance Process
1. should satisfy the Quality of Service constraints
on responsiveness and freshness of the answer;
2. should take into account the change rates of the
data elements in the REST API;
3. should consider the dynamicity of the change
rate values;
4. may consider the sliding window operator.
12ICWE - 25 June 2015

Hypotheses
e formulated the following hypotheses to build the maintenance
process
P1: the freshness of the answer can increase by maintaining part
of the local view involved in the current query evaluation
P2: the freshness of the answer increases by refreshing the
(possibly) stale local view entries that would remain fresh in a
higher number of evaluations
13ICWE - 25 June 2015

JOIN WSJWSJ WBMWBM
RefresherRefresher
BKG
Window
Solution: WSJ+WBM
Local View
HP1
HP2
14ICWE - 25 June 2015

τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
Terminology
Best Before Time: the time
that an element will
become stale and is defined
by:
Mappings from the
WINDOW clause
Mappings in the
LOCAL VIEW
Compatible
mappings
15ICWE - 25 June 2015

τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
WSJ
SJ identifies the candidate
set: the possibly stale local
view mappings involved in
the current evaluation.
SJ analyzes the content of the
current window evaluation
and identifying the
compatible mappings in the
local view.
he possibly stale mappings
are identified by analyzing
the associated best before 16ICWE - 25 June 2015

V L Score
τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
WBM
BM ranks the candidate set
to determine which
mappings to update.
he ranking is computed
through two values: the
renewed best before time
and the remaining life time
he top k elements are
selected to be refreshed. The
value k is selected according
to the responsiveness
constraint. 17ICWE - 25 June 2015

V L Score
3
4
1
τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
WBM: renewed best before time
hen would the mappings
became stale if refreshed
now?
he renewed best before time
V is computed as:
18ICWE - 25 June 2015

V L Score
3 3
4 1
1 3
τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
WBM: remaining life time and score
or how many future
evaluations the mappings is
involved?
he remaining life time L is
computed as:
BM ranks the mappings by
using a score:
core=min(L,V)
19ICWE - 25 June 2015

Experiment- Data Collection
1. Streaming API
a. Twitter stream data for mention count
2. Twitter APIs to get number of followers
a. Create snapshots everyone minutes
b. Simulate the change based on user’s predefined change rates.
Streaming
Dataset
Snapshots
/synthetic
data
20ICWE - 25 June 2015

Experimental setup
e study our hypotheses using a comparative evaluation with
• LRU: use the least recently updated elements for maintenance
• RND: use a random subset of elements for maintenance
rror measure
• Comparing the differences between consecutive evaluation of the
motivated query against cache and real/synthetic dataset.
P1: We compared the cumulative staleness of using WSJ or not (i.e.,
GNR) for both baselines.
• GNR: candidate set is the whole view entries.
P2: We compared the cumulative staleness of using WBM and the
improved baselines.
21ICWE - 25 June 2015

HP1: Maintaining involved entries of local view maximizes response
accuracy.
Synthetic Real
WSJ shows better improvement by increasing the update budget than GNR.
22ICWE - 25 June 2015

HP2: Maintaining possibly stale entries from local view that will stay
fresh for a longer time maximizes response accuracy.
Synthetic Real
WBM doesn’t improve as well as WBM* which shows the estimation error
has caused by wrong estimation for BBT. Use more accurate prediction
for BBT.
23ICWE - 25 June 2015

Conclusions and Future Work
onclusions:
• We proposed using the idea of materialization to optimize processing
continuous queries.
• We proposed a policy to maximize the freshness according to time
constraint in continuous query.
• We tested our policy against based line policies (LRU and Random).
uture Work:
• Extensions of real continuous query processors with the proposed
approach
• Measuring the time overhead of maintenance
• Investigating more complex queries that have complicated join patterns
between the SERVICE and STREAM clauses.
• Dynamically estimating the change rate of users.
24ICWE - 25 June 2015

Soheila Dehghanzadeh, Daniele Dell’Aglio, Shen Gao,
Emanuele Della Valle, Alessandra Mileo , Abraham Bernstein
soheila.dehghanzadeh@insight-centre.org
http://www.slideshare.net/sallyde
ICWE - 25 June 2015

Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets

Recommended

Recommended

More Related Content

Similar to Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets

Similar to Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets (20)

Recently uploaded

Recently uploaded (20)

Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets

Editor's Notes