To perform complex tasks, RDF Stream Processing Web applications evaluate continuous queries over streams and quasi-static (background) data. While the former are pushed in the application, the latter are continuously retrieved from the sources. As soon as the background data increase the volume and become distributed over the Web, the cost to retrieve them increases and applications become unresponsive.
In this paper, we address the problem of optimizing the evaluation of these queries by leveraging local views on background data. Local views enhance performance, but require maintenance processes, because changes in the background data sources are not automatically reflected in the application.
We propose a two-step query-driven maintenance process to maintain the local view: it exploits information from the query (e.g., the sliding window definition and the current window content) to maintain the local view based on user-defined Quality of Service constraints.
Experimental evaluation show the effectiveness of the approach.
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets
1. oheila Dehghanzadeh, Daniele Dell’Aglio, Shen Gao,
manuele Della Valle, Alessandra Mileo , Abraham Bernstein
ICWE - 25 June
2. Outline
● Introduction to Continous Queries
● Motivating Example
● Problem Description
● Solution
● Experimental Results
● Conclusions
2ICWE - 25 June 2015
3. Introduction
DF Stream Processing engines usually register queries
and execute them in a continuous fashion.
3ICWE - 25 June 2015
RDF Stream
Generator
Query
5. Introduction
omplex continuous queries combines data streams with
remote background data.
Join
RDF Stream
Generator
Background data
(SPARQL endpoint)
5ICWE - 25 June 2015
6. Motivating Example
Finding Influential Users
nfluential User: users who have more than a specific number of
followers and are mentioned more than a specific times in a specific
period (200 seconds).
ollower number: stored in a remote endpoint.
ention number: computed by processing the stream of messages.
6ICWE - 25 June 2015
Inspired by Chris Testa's SemTech 2011 talk: http://goo.gl/kLSqGo
7. Investigating the Scenario
Symmetrical hash join
rawbacks:
• Data access constraints.
• Background data is huge and has to be fetched at every
evaluation - slow and wasting computational and financial
resources.
Join
RDF Stream
Generator
Background data
(SPARQL endpoint)
7ICWE - 25 June 2015
8. Investigating the Scenario
Nested Loop Join
rawbacks:
• One invocation for each mapping from the WINDOW
clause evaluation – high number of requests to the server.
• API restrictions (e.g., limited amount of requests over
time).
Join
RDF Stream
Generator
Background data
(SPARQL endpoint)
8ICWE - 25 June 2015
9. Investigating the Scenario
Local Views
hallenges:
• Data goes out of date
Join
RDF Stream
Generator
Background data
(SPARQL endpoint)
Local
View
9ICWE - 25 June 2015
10. Investigating the Scenario
Maintenance processes
aintenance introduces a trade-off between response quality and
time.
e propose to manage this trade-off by fixing time dimension
based on query constraints and maximizing freshness of response.
Join
RDF Stream
Generator
Background data
(SPARQL endpoint)
Local
View
Maintenance
Process
Freshness decreases
Refresh
Cost/Quality trade-
off
10ICWE - 25 June 2015
11. Problem Description
The maintenance process should identify elements of the local
view that maximize response freshness.
11ICWE - 25 June 2015
12. Requirements of The Maintenance Process
1. should satisfy the Quality of Service constraints
on responsiveness and freshness of the answer;
2. should take into account the change rates of the
data elements in the REST API;
3. should consider the dynamicity of the change
rate values;
4. may consider the sliding window operator.
12ICWE - 25 June 2015
13. Hypotheses
e formulated the following hypotheses to build the maintenance
process
P1: the freshness of the answer can increase by maintaining part
of the local view involved in the current query evaluation
P2: the freshness of the answer increases by refreshing the
(possibly) stale local view entries that would remain fresh in a
higher number of evaluations
13ICWE - 25 June 2015
15. τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
Terminology
Best Before Time: the time
that an element will
become stale and is defined
by:
Mappings from the
WINDOW clause
Mappings in the
LOCAL VIEW
Compatible
mappings
15ICWE - 25 June 2015
16. τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
WSJ
SJ identifies the candidate
set: the possibly stale local
view mappings involved in
the current evaluation.
SJ analyzes the content of the
current window evaluation
and identifying the
compatible mappings in the
local view.
he possibly stale mappings
are identified by analyzing
the associated best before 16ICWE - 25 June 2015
17. V L Score
τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
WBM
BM ranks the candidate set
to determine which
mappings to update.
he ranking is computed
through two values: the
renewed best before time
and the remaining life time
he top k elements are
selected to be refreshed. The
value k is selected according
to the responsiveness
constraint. 17ICWE - 25 June 2015
18. V L Score
3
4
1
τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
WBM: renewed best before time
hen would the mappings
became stale if refreshed
now?
he renewed best before time
V is computed as:
18ICWE - 25 June 2015
19. V L Score
3 3
4 1
1 3
τ
t5 6 7 8 9 10 11
W1 W2 W3 W4
124
5 6 7 8 9 10 11 124
WBM: remaining life time and score
or how many future
evaluations the mappings is
involved?
he remaining life time L is
computed as:
BM ranks the mappings by
using a score:
core=min(L,V)
19ICWE - 25 June 2015
20. Experiment- Data Collection
1. Streaming API
a. Twitter stream data for mention count
2. Twitter APIs to get number of followers
a. Create snapshots everyone minutes
b. Simulate the change based on user’s predefined change rates.
Streaming
Dataset
Snapshots
/synthetic
data
20ICWE - 25 June 2015
21. Experimental setup
e study our hypotheses using a comparative evaluation with
• LRU: use the least recently updated elements for maintenance
• RND: use a random subset of elements for maintenance
rror measure
• Comparing the differences between consecutive evaluation of the
motivated query against cache and real/synthetic dataset.
P1: We compared the cumulative staleness of using WSJ or not (i.e.,
GNR) for both baselines.
• GNR: candidate set is the whole view entries.
P2: We compared the cumulative staleness of using WBM and the
improved baselines.
21ICWE - 25 June 2015
22. HP1: Maintaining involved entries of local view maximizes response
accuracy.
Synthetic Real
WSJ shows better improvement by increasing the update budget than GNR.
22ICWE - 25 June 2015
23. HP2: Maintaining possibly stale entries from local view that will stay
fresh for a longer time maximizes response accuracy.
Synthetic Real
WBM doesn’t improve as well as WBM* which shows the estimation error
has caused by wrong estimation for BBT. Use more accurate prediction
for BBT.
23ICWE - 25 June 2015
24. Conclusions and Future Work
onclusions:
• We proposed using the idea of materialization to optimize processing
continuous queries.
• We proposed a policy to maximize the freshness according to time
constraint in continuous query.
• We tested our policy against based line policies (LRU and Random).
uture Work:
• Extensions of real continuous query processors with the proposed
approach
• Measuring the time overhead of maintenance
• Investigating more complex queries that have complicated join patterns
between the SERVICE and STREAM clauses.
• Dynamically estimating the change rate of users.
24ICWE - 25 June 2015
25. Slide
25
Soheila Dehghanzadeh, Daniele Dell’Aglio, Shen Gao,
Emanuele Della Valle, Alessandra Mileo , Abraham Bernstein
soheila.dehghanzadeh@insight-centre.org
http://www.slideshare.net/sallyde
ICWE - 25 June 2015
Editor's Notes
We motivate this work with a semtec talk
Problem is very specific, you should generalize it to other cases
How many time units we consider for one window
How many time units we slide the window to create the next window
Here we introduce some notions that we will use them over the window
In order to produce the stream of influential users over time, we need to access mention stream and follower’s data from REST API.
A sketch of the query in a continues query language
The less we maintain the faster we can process queries, but how much less? How to minimize the maintenance?
Extension: to consider all users from the stream, if a user doesn’t exist in the local view, we fetch it and replace it with one of the existing entries from the local view
The less we maintain the faster we can process queries, but how much less? How to minimize the maintenance?
Extension: to consider all users from the stream, if a user doesn’t exist in the local view, we fetch it and replace it with one of the existing entries from the local view
Our goal is to minimize the maintenance based on constraints on QoS as the cost function
If an entry stays fresh for a longer time but its life in window is short we choose entries that are staying longer in window
(B+D)/(A+B+C+D)
A=false positive
B= true positive
C=false negative
D=true negative
An efficient maintenance process should take into account the change rates of cached data as well as dynamics of the change rates , constraints on quality of service and definition of sliding window to optimally maintain the data.
Wbm picks the top-k based on the time constraints of the query , send them to refresher to maintain the local view only for that particular subset.
The maintenance policy will be done online at every evaluation of the sliding window to maintain the local view
It uses the content of the current window as well as the statistics of change rates to pick a sub-set of the local view which will be passed to maintainer to fetch the rest API and re-write the content of those elements only for that particular sub-set.
Our proposed solution uses the change rates(R1) to identify stale mappings (red,green,blue and pink)
Our proposed solution uses window definition (R4) to identify the involved elements. (red,yellow,blue and green)
So WSJ only considers the intersection which is red,green and blue
Our proposed solution uses the change rates(R1) to identify stale mappings (red,green,blue and pink)
Our proposed solution uses window definition (R4) to identify the involved elements. (red,yellow,blue and green)
So WSJ only considers the intersection which is red,green and blue
Our proposed solution uses the change rates(R1) to identify stale mappings (red,green,blue and pink)
Our proposed solution uses window definition (R4) to identify the involved elements. (red,yellow,blue and green)
So WSJ only considers the intersection which is red,green and blue
Our proposed solution uses the change rates(R1) to identify stale mappings (red,green,blue and pink)
Our proposed solution uses window definition (R4) to identify the involved elements. (red,yellow,blue and green)
So WSJ only considers the intersection which is red,green and blue
Our proposed solution uses the change rates(R1) to identify stale mappings (red,green,blue and pink)
Our proposed solution uses window definition (R4) to identify the involved elements. (red,yellow,blue and green)
So WSJ only considers the intersection which is red,green and blue
To investigate the first hypothesis, we investigate the effect if including(WSJ) or excluding(GNR) proposer in the maintenance process and for the ranker we used the 2 baselines.
WST= no maintenanceBST= If proposer just select stale-involved elements from the local view based on the update budget