On Relevant Query Answering over Streaming and Distributed Data
1. ON RELEVANT QUERYANSWERING OVER
STREAMING AND DISTRIBUTED DATA
Shima Zahmatkesh
Politecnico di Milano – DEIB
Data Science Group – Stream Reasoning Team
Supervisor: Prof. Emanuele Della Valle
3. Motivation
▪ An application that shows the places
around drivers where there is an high
probability of finding free parking.
Query: return the best streets (around the car
that calls the service) where there are many free
parking lots and few cars looking for parking in
the last 10 minutes.
!3
▪ Web applications require to combine data streams
with distributed data over the Web to continuously find
the best answer to user’s queries
4. Solving the example with web stream processing
Web
Relevant
Answers
Join
Windows
Car request
streams
!4
Stream Processing Engine
Request
Response
Best streets to look for parking
Free parking lots
6. Research Question
Is it possible to optimize query evaluation in order to
continuously obtain the most relevant combinations of
streaming and evolving distributed data, while
guaranteeing the reactiveness of the engine?
!6
9. Scope of the state of the art
Is it possible to optimize query evaluation in order to
continuously obtain the most relevant combinations of
streaming and evolving distributed data, while
guaranteeing the reactiveness of the engine?
!9
Features ACQUA MinTopk
Type of data Streaming and
distributed
streaming
Relevancy ✗ ✓
Reactiveness Refresh budget Incremental
evaluation
Handling evolving data Local replica
Maintenance policies
✗
10. Scope of the research
▪ Queries that contains FILTER clause and have to filter
the data come in the distributed dataset.
▪ Top-k queries where the scoring function involves data
that appears both in the streaming and the distributed
datasets.
!10
12. Query
▪ Every minute give me the best influencers, i.e. users who
are mentioned on Social Network in the last 10 minutes
whose number of followers is greater than 100,000.
!12
REGISTER STREAM <:Influencers> AS
CONSTRUCT {?user a :influentialUser}
WHERE {
WINDOW :W(10m,1m) ON :S
{?user :hasMentions ?mentionsNumber}
SERVICE :BKG
{?user :hasFollowers ?followersCount}
FILTER (?followersCount > 100000)
}
Filtering Threshold
ACQUA
13. State of the art - ACQUA
!13
WINDOW clause
JOIN
Local Replica
Candidate set
Elected set
RND
LRU
WBM
SERVICE clause
Maintainer
3
Proposer
1
Ranker
2
15. Filter Update Policy (intuition)
!15
time
NumberofFollowers
t
User A
User B
User D
▪ Computes how close is the value associate to the
variable of each data item to the Filtering Threshold.
User C
Filtering Threshold
17. Combined Policies – ACQUA.F
!17
time
NumberofFollowers
t
Band
User A
User B
User C
User D
▪ Combine Filter Update Policy with ACQUA ones
▪ RND.F, LRU.F, and WBM.F
19. State of the art - Rank Aggregation
▪ Fairly take into account the opinions of different
algorithms.
▪ Combine the ranking lists by computing aggregated score
!19
User Score
Alice 0.8
Bob 0.7
David 0.4
User Score
Bob 0.9
David 0.8
Alice 0.7
α = 0.5
User Scoreagg
Bob 0.8
Alice 0.75
David 0.6
WBM Filter Update WBM.F+
23. W1(current window)
State of the art – MinTopK
!23
Time
Score
E
C
W1
now
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
G
H
A
MTK Lists:
Window Length = 9
Top-2
results
24. W2
State of the art – MinTopK
!23
Time
Score
E
C
E
C
W1 W2
now
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
G
H
A
MTK Lists:
Slide = 3
Top-2
results
25. W3
State of the art – MinTopK
!23
Time
Score
E
C
E
C
W1
E
F
W2 W3
now
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
G
H
A
MTK Lists:
Top-2
results
26. W3
State of the art – MinTopK
!23
Time
Score
Object Ws We
E 1 3
C 1 2
F 3 3
E
C
E
C
W1
E
F
W2 W3
now
Super-MTK
List
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
G
H
A
MTK Lists:
Top-2
results
27. Top-k Query
▪ Return every 3 minutes the top-2 popular users who are
most mentioned on Social Networks in the last 9 minutes
!24
REGISTER STREAM :TopkUsersToContact AS
SELECT ?user F(?mentionCount,?followerCount) AS ?score
FROM NAMED WINDOW :W ON :S [RANGE 9m STEP 3m]
WHERE {
WINDOW :W {?user :hasMentions ?mentionCount}
SERVICE :BKG {?user :hasFollowers ?followerCount}
}
ORDER BY DESC (?score)
LIMIT 2
29. Time
ScoreS
0
2
3
5
6
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
Score
!25
Time
FinalScore
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
Time
ScoreR
Final Score = F ( ScoreS , ScoreR)
30. Contributions (1 of 2)
▪ Data structure: Super-MTK+N List
▪ to handle changes in distributed dataset : N changes per window
▪ N additional slots
▪ MTK+N list : Keep K+N elements
▪ Complexity: O(K+N)
!26
Object Ws We
E 1 2
G 1 3
C 1 1
F 2 2
A 3 3
E
G
C
E
G
W1
W2
LBP
F
W1 W2 W3
G
A
K area
N area
MTK+N lists Super-MTK+N List
31. Contributions (2 of 2)
▪ Algorithm:
▪ Top-k+N
▪ Window expiration
▪ New arrival of distinct data items
▪ Handle changes in distributed data
▪ AcquaTop
▪ Handle updating local replica
▪ Complexity: O(K+N)
▪ Framework: AcquaTop Framework
▪ Apply maintenance policies
!27
32. Top-K+N – New object arrival
!28
Time
Score
W2 (current window)
Object Ws We
E 2 3
C 2 2
F 2 3
A 3 4
now
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
K = 2
N = 1
33. Top-K+N – New object arrival
!28
Time
Score
W2 (current window)
Object Ws We
E 2 3
C 2 2
F 2 3
A 3 4
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
K = 2
N = 1
now
Object Ws We
G 2 4
E 2 3
C 2 2
F 3 3
A 4 4
Top-K+N
34. Top-K+N - Handling Changes
!29
Time
Score
W2
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
Object Ws We
G 2 4
E 2 3
C 2 2
F 3 3
A 4 4
now
35. Top-K+N - Handling Changes
!29
Time
Score
W2
0
2
4
6
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13
C
D
E
F
B
A
G
A
Object Ws We
G 2 4
E 2 3
C 2 2
F 3 3
A 4 4
now
F
Object Ws We
G 2 4
F 2 3
E 2 3
C 2 2
A 4 4
Top-K+N
36. AcquaTop Framework
RDF Stream
Ranker
Maintainer
SPARQL endpoint
Elected set
Candidate set
Local Replica
✓ MTKN-T
✓ MTKN-F
✓ MTKN-A
Super-MTK+N
List
!30
AcquaTop Algorithm
Top-k+N Algorithm
Expiration
New Arrival
Remote Changes
37. New Maintenance Policies
▪ MTKN-T: Select objects
from top of the MTKN list
for updating
▪ MTKN-F: Select objects for
updating from the border
of K and N areas in MTKN
list (half from top N area,
and half from bottom K
area)
!31
Object Ws We
E 2 3
G 2 4
C 2 2
F 3 3
A 4 4
Object Ws We
E 2 3
G 2 4
C 2 2
F 3 3
A 4 4
2 items for
updating
2 items for
updating
39. Experimental setting
▪ Datasets:
▪ Streaming data from twitter: mention numbers of user
▪ Real data from REST twitter: follower count of users
▪ Realistic and synthetic distributed data
▪ Query
▪ Query with FILTER clause
▪ Top-k query
▪ Scoring function - > normalized weighted summation between number
of mentions in each window and number of changes in Follower Count
▪ Generate the Oracle for each query
!33
40. Experimental setting
▪ Baselines:
▪ WST : we don’t update any changes
▪ RND : randomly selects items for update
▪ MTKN-A: update all the elements in MTKN list
▪ Metrics:
▪ CJD : show the correctness of the results for 2 different sets
▪ nDCG@K : Shows how relevant are the results comparing to the
Oracle one
▪ ACC@K : Shows the accuracy of the results
!34
44. Limitations and Future work
Limitations Future work
Two class of queries Broaden the class of queries: N:M join
relationship, multi-join operators,
preference queries, …
Static refresh budget Flexible budget allocation
Full replica Cache and replacement strategies
Single stream of data and one query
for evaluation
Distributed streams and multiple
queries
Correct and complete data inaccurate or incomplete
!38
45. Conclusion
▪ Is this work, we address the problem of relevant query
answering over streaming and distributed data.
▪ Proposed maintenance policies for queries with FILTER
clause.
▪ Proposed framework for top-k query answering and
maintenance policies to generate more relevant and
accurate result.
▪ We get more relevant and accurate results comparing
to the sate-of-the-art approaches.
!39
46. Thank you!
Any Question?
On Relevant Query Answering over
Streaming and Distributed Data
Shima Zahmatkesh
shima.zahmatkesh@polimi.it
DEIB - Politecnico of Milano
!40