[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams
1. Cloud based publish/subscribe model for
Top-k matching over continuous data
streams
Author:
Y.S. Horawalavithana
10002103
Supervisor:
Dr. D.N. Ranasinghe
U/Graduate Thesis Defense
January 23, 2015
UNIVERSITY OF COLOMBO SCHOOL OF COMPUTING
SCS 4001: INDIVIDUAL PROJECT
1
2. 2
Overview
• Motivation
• Target
• Design & Architecture
• Related work
• Dynamic Diversification
• Incremental Top-k
• Implementation
• Evaluation
• Conclusion
• Future work
4. 4
Boolean publish/subscribe
Drawbacks
A subscriber may be either overloaded with
publications or receive too few publications
Impossible to compare different matching
publications as ranking functions are not
defined, and
Partial matching between subscriptions and
publications is not supported.
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
5. 5
Top-k publish/subscribe
Expressive stateful query processing systems
User defined parameter k restricts the
delivered publications
Pub/Sub Matching
Top-k pub/sub scoring or ranking
Pub/Sub Indexing
Indexing to support personalized subscriptions
Indexing to support continuous Top-k
publications retrieval
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
6. 6
Target
1. How to define an efficient scoring algorithm by integrating query
independent & dependent score metrics taken into account?
- Relevance, Freshness & Diversity
2. How to adapt existing indexing data structures used in state-of-the-art
publish/subscribe systems under
a) large subscription volume,
b) high event rate and,
c) the variety of subscribable attributes,
to support Top-k matching queries?
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
7. 7
Scope
Optimize Top-k Heuristic for specific domain
E-commerce with buyers & sellers
Subscriptions & publications follow a pre-defined
data-structure
The number of incoming publications follow a
Poisson random variable
Retrieve Top-k publications against subscriptions,
not reverse.
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
12. 12
Relevancy: Personalized Subscription space
2
Carrier = Verizon
Storage ≤ 32GB
2.5
Carrier = AT&T
Storage ≤ 16𝐺𝐵
1.75
Brand = HTC
1.3
2.3
Carrier = Verizon
Color = White
OS = Android
Storage = 16GB
Brand = HTC
Subscribe
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
13. 13
Subscription Indexing: Modified opIndex
Based on inverted-lists
Posting lists
Two level portioning
Attribute posting list
Operator posting list
Locate satisfying subscription tuples
Relevancy score
By satisfying relations
By satisfying subscription tuples
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
14. 14
Freshness
When window becomes larger,
Older publications may prevent the newer publications
to enter into Top-k results
Lease relevancy scores?
But have to re-calculate scores
Forward decaying!
Fresh-relevancy score = relevancy score × Freshness score
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
15. 15
Diversity: Top-k representative set
Representative Top-kDrawback
(without diversity)
What we want
(with diversity)
Method to retrieve Top-k publications from matching publications
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
16. 16
MAX* k-diversity problem
where
1. P = {p1, …, pn}
2. k ≤ n
3. d: a distance metric
4. f: a diversity function
),(argmax*
dSfS
k|S|
PS
Find:
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
17. 17
Proposed: MAXDIVREL k-diversity problem
S-Pinrelevancy&similarity-distheminimize,,
Sinrelevancy&similarity-disthemaximize,,g
),,(
),,(
maxarg),,(argmax*
rdSh
rdS
rdSh
rdSg
rdSfS
PS
where
1. P = {p1, …, pn}
2. d: a distance metric
3. r: a relevance metric
4. f: a diversity function
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
18. 18
Formal Definition: MAXDIVREL k-diversity
SPpSp
ji
i
j
Spp
ji
i
j
ji
ji
ppd
pr
pr
SP
rdSh
ppd
pr
pr
S
rdS
,
,
dominanceholds),(
)(
)(
||
1
,,argmin
ceindependenholds),(
)(
)(
||
1
,,gargmax
where
1. P = {p1, …, pn}
2. d: a distance metric
3. r: a relevance metric
4. 𝛼 > 0
Independence condition:
∀𝑝𝑖, 𝑝𝑗 ∈ 𝑆, 𝑑 𝑝𝑖, 𝑝𝑗 > 𝛼
Dominance condition:
∀𝑝𝑖 ∈ 𝑃, ∃𝑝𝑗 ∈ 𝑆 𝑠. 𝑡. 𝑑 𝑝𝑖, 𝑝𝑗 ≤ 𝛼; 𝑖 ≠ 𝑗
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
21. 21
Handling streaming publications
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2𝛼
𝑝6
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2𝑣6
Continuity Requirements
1. Durability
an item is selected as diversified in 𝑖 𝑡ℎ window may still have the chance to be in 𝑖 + 1 𝑡ℎ window
if it's not expired & other valid items in 𝑖 + 1 𝑡ℎ
window are failed to compete with it.
2. Order
Publication stream follow the chronological order
We avoid the selection of item j as diverse later, when we already selected an item i which is not-
older than j.
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
23. 23
Locality Sensitive Hashing (LSH)
Simple Idea
if two points are close together, then after a “projection” operation these two
points will remain close together
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
24. 24
LSH Analysis
For any given points 𝑝, 𝑞 ∈ 𝑅 𝑑
𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≥ 𝑃1 𝑓𝑜𝑟 𝑝 − 𝑞 ≤ 𝑑1
𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≤ 𝑃2 𝑓𝑜𝑟 𝑝 − 𝑞 ≥ 𝑐𝑑1 = 𝑑2
• Hash function h is (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive,
• Ideally we need
• (𝑃1−𝑃2) to be large
• (𝑑1−𝑑2) to be small
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
25. 25
LSH in MAXDIVREL:
Publications as categorical data
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
26. 26
LSH in MAXDIVREL:
Characteristic Matrix
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
27. 27
LSH in MAXDIVREL:
Minhashing
No Publications any more!
Signature to represent
Technique
Randomly permute the rows at
characteristic matrix m times
Take the number of the 1st row, in
the permuted order,
which the column has a 1 for
the correspondent column of
publications.
First permutation of rows at characteristic matrix
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
Advantage:
Reduce the dimensions into a small
minhash signature
28. 28
LSH in MAXDIVREL:
Signature Matrix
Fast-minhashing
Select m number of random hash
functions
To model the effect of m number of
random permutation
Mathematically proved only when,
The number of rows is a prime.
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
29. 29
LSH in MAXDIVREL:
LSH Buckets
Take r sized
signature vectors
From m sized
minhash-
signature
Map them into,
L Hash-Tables
Each with
arbitrary b
number of
buckets
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
31. 31
LSH in MAXDIVREL:
Analysis
For two vectors x,y
𝐽𝐷 𝑥, 𝑦 = 1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 ;
𝑤ℎ𝑒𝑟𝑒, 𝐽𝑆𝐼𝑀 𝑥, 𝑦 =
𝑥 ∩ 𝑦
𝑥 ∪ 𝑦
For publications x & y
𝐽𝑆𝐼𝑀 𝑥, 𝑦 ∝ 𝑃𝑟𝑜𝑏 𝐻 𝑥 = 𝐻 𝑦
At a particular hash table
x & y map into the same bucket:
𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
x & y does not map into the same bucket:
1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
At L Hash-tables
x & y does not map into the same bucket:
(1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
) 𝐿 1 − (1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏) 𝐿
True near neighbors will
be unlikely to be unlucky
in all the projections
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
32. 32
LSH in MAXDIVREL:
Batch-wise Top-k computation
Bucket “Winner” – a publication which has the
highest relevancy score
Winner is dominant to represent it's bucket
neighborhood
Top-k "winners“ that have a majority of votes
k winners are independent
𝑃𝐴 𝑃𝐵 𝑃𝐶 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . .
ith
window
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
33. 33
LSH in MAXDIVREL:
Incremental Top-k computation
𝑁𝑒𝑤 𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑖 𝑈𝑝𝑑𝑎𝑡𝑒 𝑖 𝑡ℎ
𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑖𝑠𝑡𝑖𝑐 𝑣𝑒𝑐𝑡𝑜𝑟
Characteristic
Matrix
𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑒 𝑖 𝑡ℎ
𝑚𝑖𝑛ℎ𝑎𝑠ℎ 𝑠𝑖𝑔𝑛𝑎𝑡𝑢𝑟𝑒
Signature
Matrix
Map 𝑖 𝑡ℎ
signature
into L hash-tables
Update “Winner” at
bucket 𝑖 𝑡ℎ
signature
maps into
Vote 𝑇𝑜𝑝 − 𝑘 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
34. 34
LSH in MAXDIVREL:
When new publication F arrives…
Only buckets 𝐵13
, 𝐵23
, 𝐵32
, 𝐵43
will vote
Follow continuity requirements
Durability
Order
𝑃𝐴 𝑃𝐵 𝑃𝐶 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . .
ith
window
(i+1)th
window
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
39. 39
Subscriber Effectiveness:
Quality or natural behvior
Testing zipf or power law hypothesis on
distribution of ranked results (KS Test)
i. Fitting power law
ii. Goodness of fit tests
iii. Alternative distributions
Compute 19030 ranked distributions
over 100K publication stream
Under different subscriber views
Under different sized sliding window
instances
Sample distribution of ranked votes
logzipf_prob(rank)
log (rank)
𝑧𝑖𝑝𝑓 𝑘: 𝑠, 𝑁 =
1
𝑘 𝑠
𝑛=1
𝑁
(
1
𝑛 𝑠)
N - number of elements in distribution,
k - rank of element
s - value of exponent
40. 40
Subscriber Effectiveness:
i. Fitting power law
Zipf exponent values
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
41. 41
Subscriber Effectiveness:
i. Fitting power law
Illustration of Zipf exponent values convergence
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
42. 42
Subscriber Effectiveness:
i. Fitting power law
Zipf exponent values under different similarity threshold
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
43. 43
Subscriber Effectiveness:
ii. Goodness of fit tests
𝛾1 = 𝑚𝑎𝑥 𝑥≥𝑥 𝑚𝑖𝑛
𝑓 𝑥 − 𝑔 𝑥
𝑓 𝑥 : 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑟𝑎𝑛𝑘 𝐶𝐷𝐹
𝑔 𝑥 : 𝑝𝑒𝑟𝑓𝑒𝑐𝑡 𝑓𝑖𝑡𝑡𝑒𝑑 𝐶𝐷𝐹
𝑝 − 𝑣𝑎𝑙𝑢𝑒 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝛾𝑖 𝑤ℎ𝑒𝑟𝑒𝛾𝑖 > 𝛾1;
𝑖
𝑖 = 1000 𝑠𝑦𝑛𝑡ℎ𝑒𝑡𝑖𝑐 𝑧𝑖𝑝𝑓 𝑑𝑎𝑡𝑎𝑠𝑒𝑡𝑠
P-values of KS test under different subscriber views
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
44. 44
Subscriber Effectiveness:
iii. Testing alternative distributions
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
45. 45
Subscriber Effectiveness:
Other diversity based methods
P-dispersion problem
MAXMIN
MAXSUM
Minimum
independent-
dominating set
problem
MAXDIVREL
DisC
For an even comparison,
Combine relevancy at all diversity method
To achieve a bi-criteria objective
Average zipf law exponent in a comparison with other methods
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
46. 46
Subscriber Effectiveness:
Other diversity based methods
P-dispersion problem
MAXMIN
MAXSUM
Minimum
independent-
dominating set
problem
MAXDIVREL
DisC
A comparison of average zipf law exponent with other methods
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
47. 47
Subscriber Effectiveness:
Accuracy of Top-k results
LSH Index vs. NAÏVE
Rank probability
Diversity probability
Accuracy on similarity threshold = 0.55 Accuracy on similarity threshold = 0.85
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
49. 49
Performance
Subscription index update time
Index construction time on opIndex vs. modified opIndex
opIndex vs. modified opIndex
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
50. 50
Efficiency:
Initial matching time at modified opIndex
Initial matching time under different size of subscription spaces Initial matching time under different size of publications
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
51. 51
Performance & Efficiency:
LSH Index
BLSH index construction + update time on different number of minhash functions
Number of minhash functions
(m) =
1
𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑒𝑟𝑟𝑜𝑟2
How much accuracy
do we sacrifice by
comparing small
minhash signatures?
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
52. 52
Performance & Efficiency
ILSH vs. BLSH vs. NAÏVE
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 . .
BLSH
or
NAIVE
BLSH
or
NAIVE
BLSH
or
NAIVE
BLSH
or
NAIVE
ILSH
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
53. 53
Performance & Efficiency:
BLSH vs. NAÏVE
log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
54. 54
Performance & Efficiency:
ILSH vs. BLSH vs. NAÏVE
log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
55. 55
Conclusions
Diversified results produced by MAXDIVREL based on independent-
dominating set problem
Exhibits strong natural behavior other than,
Methods based p-dispersion problem
Relevancy is a important factor to employ
In distance based diversity methods
Always has the tendency to produce the diverse set of personalized
results
Absolute ranks are sensitive to the preference value
While keeping the deviation small among relative ranks
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
56. 56
Conclusions (Ctd.)
Locality Sensitive Hashing (LSH) indexing method
Produce MAXDIVREL diverse set of results at average 70% accuracy
over naïve method
Reduce the matching time very significantly over NAÏVE method
Further, refine by it’s incremental version
For handling streaming publications
Avoid the curse of re-computing neighborhoods
No such k to restrict the delivery of Top publications
Given a window size & delivery method
Model can produce best diverse set of personalized results
To represent the set of all matching publications at given instance
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
57. 57
Major Contributions
Dynamic diversification method based on independent-dominating set
problem
We introduced a novel diversity definition based on representative
neighborhoods, called MAXDIVREL k-diversity employing relevancy.
Index based diversification approach to rank results incrementally
We proposed a novel, hashing based index approach to solve
MAXDIVREL continuous k-diversity problem based on Locality Sensitive
Hashing (LSH) technique
Advanced evaluation method to measure the quality of diverse results
First significant try to model natural behavior of diversity methods in
pub/sub community
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
58. 58
Future work
Explore other suitable use-cases to apply proposed model & develop
prototype applications, E.g.
Personalized newspaper for every Facebook user
Diverse set of personalized Twitter trends
Social annotation of news-stories
Exploit overlap among diversified results of users who have similar interest
Employ existing implicit methods to extract human preferences
E.g. click stream analytics
Develop LSH based index over multi-threaded distributed environment
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work