WHEN A FILTER MAKES THE DIFFERENCE IN
CONTINUOUSLYANSWERING SPARQL
QUERIES ON STREAMING AND QUASI-STATIC
LINKED DATA
Shima Zahmatkesh, Emanuele Della Valle, and Daniele Dell'Aglio
DEIB - Politecnico of Milano
ICWE 2016 – Lugano
7th June 2016
Outline
• Introduction
• Motivation
• Problem Statement
• State-of-the-art
• Research Question and Hypotheses
• Proposed Solution
• Experimental Results
• Conclusion and future works
!2
Introduction to Stream Processing
Stream Processing
Engine
ResultsWindows
Stream data
Register query once
and execute it
continuously
!3
Time-based sliding window

Sequence of
timestamped events
W(3,1 ,1)
β
ω
widthslide
S3
S4 S5 S6
S7
S8 S9
S10
S11
S
S1
S2
t
1 2 3 4 5 6 7 8 9 10 11 12
W i
W i+1
!4
)
Introduction to RSP EnginesRSPengine
Web
Results
Join
Windows
RDF Streams SPARQL endpoint
- Fetching all background data in each evaluation
- High number of invocation to the endpoint
- Loosing reactiveness of the engine
!5
Introduction to RSP EnginesRSPengine
Web
Results
Join
WindowsRDF Streams SPARQL endpoint
Local
Replica
Data become stale
during time
!6
Introduction to Continuous QueryRSPengine
Web
Results
Join
Local
Replica
Maintenance
Policy
WindowsRDF Streams SPARQL endpoint
!7
Motivation
The cloth brand ACME wants to persuade influential Social
Network users to post commercial endorsements.
Every minute give me the ID of the users that are mentioned on
Social Network in the last 10 minutes whose number of followers is
greater than 100,000.
!8
Motivation
The cloth brand ACME wants to persuade influential Social
Network users to post commercial endorsements.
Every minute give me the ID of the users that are mentioned on
Social Network in the last 10 minutes whose number of followers is
greater than 100,000.
ΩW
?user :hasMentions ?metionsNum
ΩS
?user :hasFollowers ?followerCount
WINDOW
SERVICE
JOIN
FILTER
?followerCount > 100000
!8
Problem Statement
• RSP Engines:
• Fetch all the background data in each evaluation
• Have high number of invocations
• The application may loose its reactiveness.
• Class of query with WINDOW and SERVICE and FILTER
clauses
• Solution:
• Using a local replica of the background data
• Need maintenance policies to keep the data freshness
!9
State-of-the-art
• Acqua: guarantee reactiveness of RSP engines while
maximizing the freshness of the replica
• Used a local replica of the background data
• Proposed maintenance policies
• Used refresh budget γ to limit the number of the access to the
background data and guarantee reactiveness
• Class of query with WINDOW and SERVICE clauses without
FILTER
!10
Acqua framework
!11
WINDOW clause
JOIN Proposer Ranker
MaintainerLocal Replica
2
3
1
SERVICE clause
E
C
RND
LRU
WBM
WSJ
Candidate set
Elected set: top
γ mappings of
Candidate set
Research Question
• Given a class of query with WINDOW and SERVICE and
FILTER clauses
• how can we refresh the local replica in order to
• guarantee reactiveness while
• maximizing the freshness the replica?
!12
Intuition
!13
time
NumberofFollowers
t
Filtering Threshold
User A
User B
User C
User D
Hypotheses
• H.1 the replica can be maintained fresher than when using
Acqua policies, if we first refresh the mappings whose
values are closer to the Filtering Threshold.
• H.2 the replica can be maintained fresher than when using
Acqua policies by first selecting the mappings as in
Hypothesis H.1 and, then, applying the Acqua policies.
!14
Proposed Solution
!15
WINDOW clause
JOIN Proposer Ranker
MaintainerLocal Replica
2
3
1
SERVICE clause
contains
FILTER clause
E
C
WSJ
Candidate set
Elected set
Filter Update Policy
RND.F
LRU.F
WBM.F
Proposed Solution
• Filter Update Policy:
For each mapping in the replica:
• Computes how close is the value associate to the variable of the
mapping to the Filtering Threshold.
• Combination with Acqua:
• RND.F, LRU.F, WBM.F
• Select a subset of Candidate set whose mappings are closer to
Filtering Threshold.
• Apply the Acqua policies over this subset.
!16
Experimental Evaluation (1 of 3)
• Data Sets
• Streaming data: collected from 400 verified users of Twitter for
three hours of tweets
• Background data: collected the number of followers per user, every
minute during the three hours
• Transformation of the background data:
• To control the selectivity of filter
• The time-series of specified percentage of the users crosses the Filtering
Threshold at least once during the experiment.
• To avoid biases, generate 10 different dataset for each specified
percentage
!17
Experimental Evaluation (2 of 3)
• Query
• Contains WINDOW, SERVICE, and FILTER clauses
• Generate correct answer of the query by a Oracle
• Ans(Oracle)
• KPIs
• Use Jaccard distance to measure diversity of the set generated by
the query Ans(Qi) and correct answers Ans(Oraclei)
• Define cumulative Jaccard distance
!18
Experimental Evaluation (3 of 3)
• Two sets of experiments:
• Comparing the Filter Update Policy (.F) with Acqua ones
• Comparing the Filter Update Policy (.F) with the combine policies
(RND.F, LRU.F, and WBM.F)
• Sensitivity of Refresh Budget
• Sensitivity of Selectivity of Filter
!19
How to read result graphs
!20
CumulativeJaccardDistance
Iterations of query execution
Policy A
Policy C
Policy B
Policy A outperform Policy B and C
Result of Experiment 1
!21
Iterations of query execution
Result of Experiment 1
!22
90 80 75 70 60 50
Selectivity
Result of Experiment 2
!23
Iterations of query execution
Filter
Filter
Result of Experiment 2
!24
90 80 75 70 60 50
Selectivity
Conclusion
• Continuous evaluation of a class of queries that joins data
streams and background data with Filtering.
• Get the idea of local replica, maintenance policy, and refresh
budget from the state-of-the-art
• Extend the class of continuous queries
• Propose new maintenance policies
• Filter Update Policy keeps the replica fresher than Acqua
policies in higher selectivity (>60%).
• Combination of policies keep the replica even fresher than the
Filter Update Policy.
!25
Future works
• Broaden the class of queries
• Multiple filtering
• Filtering condition formulated as a ranking clause
• Pushing the FILTER clause into the SERVICE clause and
considering caching instead of local replica
• Study the effect of different trends in the data
!26
Thank you!

Any Question?
When a FILTER makes the difference in
continuously answering SPARQL queries
on streaming and quasi-static Linked Data
Shima Zahmatkesh
shima.zahmatkesh@polimi.it
DEIB - Politecnico of Milano
!27

When a FILTER makes the di fference in continuously answering SPARQL queries on streaming and quasi-static Linked Data

  • 1.
    WHEN A FILTERMAKES THE DIFFERENCE IN CONTINUOUSLYANSWERING SPARQL QUERIES ON STREAMING AND QUASI-STATIC LINKED DATA Shima Zahmatkesh, Emanuele Della Valle, and Daniele Dell'Aglio DEIB - Politecnico of Milano ICWE 2016 – Lugano 7th June 2016
  • 2.
    Outline • Introduction • Motivation •Problem Statement • State-of-the-art • Research Question and Hypotheses • Proposed Solution • Experimental Results • Conclusion and future works !2
  • 3.
    Introduction to StreamProcessing Stream Processing Engine ResultsWindows Stream data Register query once and execute it continuously !3
  • 4.
    Time-based sliding window
 Sequenceof timestamped events W(3,1 ,1) β ω widthslide S3 S4 S5 S6 S7 S8 S9 S10 S11 S S1 S2 t 1 2 3 4 5 6 7 8 9 10 11 12 W i W i+1 !4 )
  • 5.
    Introduction to RSPEnginesRSPengine Web Results Join Windows RDF Streams SPARQL endpoint - Fetching all background data in each evaluation - High number of invocation to the endpoint - Loosing reactiveness of the engine !5
  • 6.
    Introduction to RSPEnginesRSPengine Web Results Join WindowsRDF Streams SPARQL endpoint Local Replica Data become stale during time !6
  • 7.
    Introduction to ContinuousQueryRSPengine Web Results Join Local Replica Maintenance Policy WindowsRDF Streams SPARQL endpoint !7
  • 8.
    Motivation The cloth brandACME wants to persuade influential Social Network users to post commercial endorsements. Every minute give me the ID of the users that are mentioned on Social Network in the last 10 minutes whose number of followers is greater than 100,000. !8
  • 9.
    Motivation The cloth brandACME wants to persuade influential Social Network users to post commercial endorsements. Every minute give me the ID of the users that are mentioned on Social Network in the last 10 minutes whose number of followers is greater than 100,000. ΩW ?user :hasMentions ?metionsNum ΩS ?user :hasFollowers ?followerCount WINDOW SERVICE JOIN FILTER ?followerCount > 100000 !8
  • 10.
    Problem Statement • RSPEngines: • Fetch all the background data in each evaluation • Have high number of invocations • The application may loose its reactiveness. • Class of query with WINDOW and SERVICE and FILTER clauses • Solution: • Using a local replica of the background data • Need maintenance policies to keep the data freshness !9
  • 11.
    State-of-the-art • Acqua: guaranteereactiveness of RSP engines while maximizing the freshness of the replica • Used a local replica of the background data • Proposed maintenance policies • Used refresh budget γ to limit the number of the access to the background data and guarantee reactiveness • Class of query with WINDOW and SERVICE clauses without FILTER !10
  • 12.
    Acqua framework !11 WINDOW clause JOINProposer Ranker MaintainerLocal Replica 2 3 1 SERVICE clause E C RND LRU WBM WSJ Candidate set Elected set: top γ mappings of Candidate set
  • 13.
    Research Question • Givena class of query with WINDOW and SERVICE and FILTER clauses • how can we refresh the local replica in order to • guarantee reactiveness while • maximizing the freshness the replica? !12
  • 14.
  • 15.
    Hypotheses • H.1 thereplica can be maintained fresher than when using Acqua policies, if we first refresh the mappings whose values are closer to the Filtering Threshold. • H.2 the replica can be maintained fresher than when using Acqua policies by first selecting the mappings as in Hypothesis H.1 and, then, applying the Acqua policies. !14
  • 16.
    Proposed Solution !15 WINDOW clause JOINProposer Ranker MaintainerLocal Replica 2 3 1 SERVICE clause contains FILTER clause E C WSJ Candidate set Elected set Filter Update Policy RND.F LRU.F WBM.F
  • 17.
    Proposed Solution • FilterUpdate Policy: For each mapping in the replica: • Computes how close is the value associate to the variable of the mapping to the Filtering Threshold. • Combination with Acqua: • RND.F, LRU.F, WBM.F • Select a subset of Candidate set whose mappings are closer to Filtering Threshold. • Apply the Acqua policies over this subset. !16
  • 18.
    Experimental Evaluation (1of 3) • Data Sets • Streaming data: collected from 400 verified users of Twitter for three hours of tweets • Background data: collected the number of followers per user, every minute during the three hours • Transformation of the background data: • To control the selectivity of filter • The time-series of specified percentage of the users crosses the Filtering Threshold at least once during the experiment. • To avoid biases, generate 10 different dataset for each specified percentage !17
  • 19.
    Experimental Evaluation (2of 3) • Query • Contains WINDOW, SERVICE, and FILTER clauses • Generate correct answer of the query by a Oracle • Ans(Oracle) • KPIs • Use Jaccard distance to measure diversity of the set generated by the query Ans(Qi) and correct answers Ans(Oraclei) • Define cumulative Jaccard distance !18
  • 20.
    Experimental Evaluation (3of 3) • Two sets of experiments: • Comparing the Filter Update Policy (.F) with Acqua ones • Comparing the Filter Update Policy (.F) with the combine policies (RND.F, LRU.F, and WBM.F) • Sensitivity of Refresh Budget • Sensitivity of Selectivity of Filter !19
  • 21.
    How to readresult graphs !20 CumulativeJaccardDistance Iterations of query execution Policy A Policy C Policy B Policy A outperform Policy B and C
  • 22.
    Result of Experiment1 !21 Iterations of query execution
  • 23.
    Result of Experiment1 !22 90 80 75 70 60 50 Selectivity
  • 24.
    Result of Experiment2 !23 Iterations of query execution Filter Filter
  • 25.
    Result of Experiment2 !24 90 80 75 70 60 50 Selectivity
  • 26.
    Conclusion • Continuous evaluationof a class of queries that joins data streams and background data with Filtering. • Get the idea of local replica, maintenance policy, and refresh budget from the state-of-the-art • Extend the class of continuous queries • Propose new maintenance policies • Filter Update Policy keeps the replica fresher than Acqua policies in higher selectivity (>60%). • Combination of policies keep the replica even fresher than the Filter Update Policy. !25
  • 27.
    Future works • Broadenthe class of queries • Multiple filtering • Filtering condition formulated as a ranking clause • Pushing the FILTER clause into the SERVICE clause and considering caching instead of local replica • Study the effect of different trends in the data !26
  • 28.
    Thank you!
 Any Question? Whena FILTER makes the difference in continuously answering SPARQL queries on streaming and quasi-static Linked Data Shima Zahmatkesh shima.zahmatkesh@polimi.it DEIB - Politecnico of Milano !27