[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams

Cloud based publish/subscribe model for
Top-k matching over continuous data
streams
Author:
Y.S. Horawalavithana
10002103
Supervisor:
Dr. D.N. Ranasinghe
U/Graduate Thesis Defense
January 23, 2015
UNIVERSITY OF COLOMBO SCHOOL OF COMPUTING
SCS 4001: INDIVIDUAL PROJECT
1

2
Overview
• Motivation
• Target
• Design & Architecture
• Related work
• Dynamic Diversification
• Incremental Top-k
• Implementation
• Evaluation
• Conclusion
• Future work

3
Motivation – “Big Filter”
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

4
Boolean publish/subscribe
Drawbacks
 A subscriber may be either overloaded with
publications or receive too few publications
 Impossible to compare different matching
publications as ranking functions are not
defined, and
 Partial matching between subscriptions and
publications is not supported.

5
Top-k publish/subscribe
 Expressive stateful query processing systems
 User defined parameter k restricts the
delivered publications
 Pub/Sub Matching
 Top-k pub/sub scoring or ranking
 Pub/Sub Indexing
 Indexing to support personalized subscriptions
 Indexing to support continuous Top-k
publications retrieval

6
Target
1. How to define an efficient scoring algorithm by integrating query
independent & dependent score metrics taken into account?
- Relevance, Freshness & Diversity
2. How to adapt existing indexing data structures used in state-of-the-art
publish/subscribe systems under
a) large subscription volume,
b) high event rate and,
c) the variety of subscribable attributes,
to support Top-k matching queries?

7
Scope
 Optimize Top-k Heuristic for specific domain
 E-commerce with buyers & sellers
 Subscriptions & publications follow a pre-defined
data-structure
 The number of incoming publications follow a
Poisson random variable
 Retrieve Top-k publications against subscriptions,
not reverse.

8
Design & Architecture
Expire
Expire
Publication
Store
Subscription
Store
Subscription
Indexing
Relevance
Matching
Publication
Stream
Matching
Publication
Store
Publication
(Relevance
Score)
Publication
Indexing
Top-k
Continuous
Diversity
Personalized
Subscription
Personalized
Subscription
Personalized
Subscription
Dissimilarity
Relevancy
Event
Delivery
Top-k
Notification
Store
Notification
Notification
Notification
Sliding window

9
Related work:General Top-k publish/subscribe
Pub/sub model Subscription
Timing
policy
Diversity
Scoring
metric
Subscription
Indexing
method
Incremental
publication
indexing
Architecture
PrefSIENA
(Drosou, ACM
DEBS 2009)
Preferential
subscription
Sliding
window
Relevancy +
MAXMIN
diversity
Subscription
covering
Centralized
message-
brokers
RRPS
(Lu, ICCSA 2009)
Normal Continuous QoS Centralized
DaZaLaPs
(Pripuzi, IS 2012)
Normal
Sliding
window
Relevancy Grid based P2P
Top-k pub/sub
(Shraer[Google],
VLDB 2014)
Normal Continuous
Relevancy +
Freshness
Tree based TAAT & DAAT Centralized
Our model
Personalized
subscription
space
Sliding
window
MAXDIVREL
diversity
Inverted-list
based
Hashing
based
Cloud based








10
Sliding window Top-k computation
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 ....
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 ....
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 ....
𝑃5 𝑃1
𝑃5 𝑃6
𝑃5 𝑃9
Top-2
Matching publication stream
h=1
h=3
Jumping
step
(h)
 Top-k notifications
delivery
 On-demand
 Pro-active
Expired
Active
Top-k

11
Relevancy: Personalized Subscription space
Carrier = AT&T (0.4) Subscribe
Brand = HTC (0.3)
Storage ≤ 16𝐺𝐵 (0.7)
1.75
1.3
2.3
Carrier = Verizon (0.5)
Storage ≤ 32GB (0.2)
2.52
Storage ≤ 32GB (0.6)
Brand = HTC (0.3)

12
Relevancy: Personalized Subscription space
2
Carrier = Verizon
Storage ≤ 32GB
2.5
Carrier = AT&T
Storage ≤ 16𝐺𝐵
1.75
Brand = HTC
1.3
2.3
Carrier = Verizon
Color = White
OS = Android
Storage = 16GB
Brand = HTC
Subscribe

13
Subscription Indexing: Modified opIndex
 Based on inverted-lists
 Posting lists
 Two level portioning
 Attribute posting list
 Operator posting list
 Locate satisfying subscription tuples
 Relevancy score
 By satisfying relations
 By satisfying subscription tuples

14
Freshness
 When window becomes larger,
 Older publications may prevent the newer publications
to enter into Top-k results
 Lease relevancy scores?
 But have to re-calculate scores
 Forward decaying!
 Fresh-relevancy score = relevancy score × Freshness score

15
Diversity: Top-k representative set
Representative Top-kDrawback
(without diversity)
What we want
(with diversity)
Method to retrieve Top-k publications from matching publications

16
MAX* k-diversity problem
where
1. P = {p1, …, pn}
2. k ≤ n
3. d: a distance metric
4. f: a diversity function
),(argmax*
dSfS
k|S|
PS



Find:

17
Proposed: MAXDIVREL k-diversity problem
 
  S-Pinrelevancy&similarity-distheminimize,,
Sinrelevancy&similarity-disthemaximize,,g
),,(
),,(
maxarg),,(argmax*




rdSh
rdS
rdSh
rdSg
rdSfS
PS
where
1. P = {p1, …, pn}
3. r: a relevance metric
4. f: a diversity function

18
Formal Definition: MAXDIVREL k-diversity
 
  






SPpSp
ji
i
j
Spp
ji
i
j
ji
ji
ppd
pr
pr
SP
rdSh
ppd
pr
pr
S
rdS
,
,
dominanceholds),(
)(
)(
||
1
,,argmin
ceindependenholds),(
)(
)(
||
1
,,gargmax
where
1. P = {p1, …, pn}
3. r: a relevance metric
4. 𝛼 > 0
Independence condition:
∀𝑝𝑖, 𝑝𝑗 ∈ 𝑆, 𝑑 𝑝𝑖, 𝑝𝑗 > 𝛼
Dominance condition:
∀𝑝𝑖 ∈ 𝑃, ∃𝑝𝑗 ∈ 𝑆 𝑠. 𝑡. 𝑑 𝑝𝑖, 𝑝𝑗 ≤ 𝛼; 𝑖 ≠ 𝑗

19
NP-Hardness:
Minimum independent-dominating set
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2
𝛼
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2

𝑣1
𝑣4
𝑣3
𝑣2
𝑣5
𝑣1
𝑣4
𝑣3
𝑣2
𝑣5
  jijiji ppppdppodNeighborho  ,|)(
𝑣1
𝑣4
𝑣3𝑣2
𝑣5
Publication
space
Graph
model
Independent, dominating Independent, dominating Independent, dominating Dominating, not independent

20
NAÏVE Greedy argmax
𝑟(𝑝𝑖)2
𝑝 𝑗∈𝑁(𝑝 𝑖) 𝑟(𝑝𝑗) × 𝑑(𝑝𝑖, 𝑝𝑗)

21
Handling streaming publications
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2𝛼
𝑝6
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2𝑣6
Continuity Requirements
1. Durability
an item is selected as diversified in 𝑖 𝑡ℎ window may still have the chance to be in 𝑖 + 1 𝑡ℎ window
if it's not expired & other valid items in 𝑖 + 1 𝑡ℎ
window are failed to compete with it.
2. Order
Publication stream follow the chronological order
We avoid the selection of item j as diverse later, when we already selected an item i which is not-
older than j.

22
MAXDIVREL continuous k-diversity
𝑃1 𝑃2 𝑃3 𝑃4 .. 𝑃𝑗 𝑃𝑗+1 .. .. .. ....
Matching publication stream
𝑃1 𝑃2 𝑃3 𝑃4 .. 𝑃𝑗 𝑃𝑗+1 .. .. .. ....
ith window
(i+1)th window
𝑆𝑖
∗
𝑆𝑖+1
∗
MAXDIVREL k-diversity
MAXDIVREL k-diversity
Independence
Dominance
Durability
Order
 Straightforward solution:
 Apply naïve greedy method at each instance
 Propose incremental index mechanism!
 Avoid the curse of re-calculating neighborhood

23
Locality Sensitive Hashing (LSH)
 Simple Idea
 if two points are close together, then after a “projection” operation these two
points will remain close together

24
LSH Analysis
 For any given points 𝑝, 𝑞 ∈ 𝑅 𝑑
𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≥ 𝑃1 𝑓𝑜𝑟 𝑝 − 𝑞 ≤ 𝑑1
𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≤ 𝑃2 𝑓𝑜𝑟 𝑝 − 𝑞 ≥ 𝑐𝑑1 = 𝑑2
• Hash function h is (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive,
• Ideally we need
• (𝑃1−𝑃2) to be large
• (𝑑1−𝑑2) to be small

25
LSH in MAXDIVREL:
Publications as categorical data

26
LSH in MAXDIVREL:
Characteristic Matrix

27
LSH in MAXDIVREL:
Minhashing
 No Publications any more!
 Signature to represent
 Technique
 Randomly permute the rows at
characteristic matrix m times
 Take the number of the 1st row, in
the permuted order,
 which the column has a 1 for
the correspondent column of
publications.
First permutation of rows at characteristic matrix
 Advantage:
 Reduce the dimensions into a small
minhash signature

28
LSH in MAXDIVREL:
Signature Matrix
Fast-minhashing
Select m number of random hash
functions
To model the effect of m number of
random permutation
Mathematically proved only when,
The number of rows is a prime.

29
LSH in MAXDIVREL:
LSH Buckets
 Take r sized
signature vectors
 From m sized
minhash-
signature
 Map them into,
 L Hash-Tables
 Each with
arbitrary b
number of
buckets

30
LSH in MAXDIVREL:
How to select L, r?
For two vectors x,y
𝐽𝐷 𝑥, 𝑦 = 1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 ;
𝑤ℎ𝑒𝑟𝑒, 𝐽𝑆𝐼𝑀 𝑥, 𝑦 =
𝑥 ∩ 𝑦
𝑥 ∪ 𝑦
1. 𝐿 × 𝑟 = 𝑚
2. ?
2) 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(𝑠) ≈
1
𝐿
1
𝑟

31
LSH in MAXDIVREL:
Analysis
For two vectors x,y
𝐽𝐷 𝑥, 𝑦 = 1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 ;
𝑤ℎ𝑒𝑟𝑒, 𝐽𝑆𝐼𝑀 𝑥, 𝑦 =
𝑥 ∩ 𝑦
𝑥 ∪ 𝑦
 For publications x & y
𝐽𝑆𝐼𝑀 𝑥, 𝑦 ∝ 𝑃𝑟𝑜𝑏 𝐻 𝑥 = 𝐻 𝑦
 At a particular hash table
 x & y map into the same bucket:
𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
 x & y does not map into the same bucket:
1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
 At L Hash-tables
 x & y does not map into the same bucket:
(1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
) 𝐿 1 − (1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏) 𝐿
True near neighbors will
be unlikely to be unlucky
in all the projections

32
LSH in MAXDIVREL:
Batch-wise Top-k computation
 Bucket “Winner” – a publication which has the
highest relevancy score
 Winner is dominant to represent it's bucket
neighborhood
 Top-k "winners“ that have a majority of votes
 k winners are independent
𝑃𝐴 𝑃𝐵 𝑃𝐶 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . .
ith
window

33
LSH in MAXDIVREL:
Incremental Top-k computation
𝑁𝑒𝑤 𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑖 𝑈𝑝𝑑𝑎𝑡𝑒 𝑖 𝑡ℎ
𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑖𝑠𝑡𝑖𝑐 𝑣𝑒𝑐𝑡𝑜𝑟
Characteristic
Matrix
𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑒 𝑖 𝑡ℎ
𝑚𝑖𝑛ℎ𝑎𝑠ℎ 𝑠𝑖𝑔𝑛𝑎𝑡𝑢𝑟𝑒
Signature
Matrix
Map 𝑖 𝑡ℎ
signature
into L hash-tables
Update “Winner” at
bucket 𝑖 𝑡ℎ
signature
maps into
Vote 𝑇𝑜𝑝 − 𝑘 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒

34
LSH in MAXDIVREL:
When new publication F arrives…
 Only buckets 𝐵13
, 𝐵23
, 𝐵32
, 𝐵43
will vote
 Follow continuity requirements
 Durability
 Order
𝑃𝐴 𝑃𝐵 𝑃𝐶 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . .
ith
window
(i+1)th
window


35
Implementation

36
Cloud service modules
Source: Amazon Kinesis Source: Amazon Elastic-cache

Publication Stream  Zipfian subscriptions
 Normalized preferences
37
Evaluation:
Dataset
Amazon on-line market place data available at 17th – 19th November 2014
𝑧𝑖𝑝𝑓 𝑘: 𝑠, 𝑁 =
1
𝑘 𝑠
𝑛=1
𝑁
(
1
𝑛 𝑠)
N - number of elements in distribution,
k - rank of element
s - value of exponent
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑏𝑐𝑟𝑖𝑏𝑒𝑟 𝑣𝑖𝑒𝑤𝑠
=
𝑖=2
32
48 𝑐 𝑖
+ 42 𝑐 𝑖
+ 54 𝑐 𝑖
+ 66 𝑐 𝑖
+ 57 𝑐 𝑖
+ 67 𝑐 𝑖

38
Evaluation:
Methodology
Subscriber
Effectiveness
Performance &
Efficiency
Quality
Accuracy
Resiliency
Freshness
Index construction time
Top-k matching time
 Platform: Amazon AWS
 Linux based micro-node instances
 Each with 2.3 GHz, 8GB memory
 Algorithms are implemented in Java

39
Subscriber Effectiveness:
Quality or natural behvior
 Testing zipf or power law hypothesis on
distribution of ranked results (KS Test)
i. Fitting power law
ii. Goodness of fit tests
iii. Alternative distributions
 Compute 19030 ranked distributions
over 100K publication stream
 Under different subscriber views
 Under different sized sliding window
instances
Sample distribution of ranked votes
logzipf_prob(rank)
log (rank)
𝑧𝑖𝑝𝑓 𝑘: 𝑠, 𝑁 =
1
𝑘 𝑠
𝑛=1
𝑁
(
1
𝑛 𝑠)
N - number of elements in distribution,
k - rank of element
s - value of exponent

40
Zipf exponent values

41
Illustration of Zipf exponent values convergence

42
Zipf exponent values under different similarity threshold

43
ii. Goodness of fit tests
𝛾1 = 𝑚𝑎𝑥 𝑥≥𝑥 𝑚𝑖𝑛
𝑓 𝑥 − 𝑔 𝑥
𝑓 𝑥 : 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑟𝑎𝑛𝑘 𝐶𝐷𝐹
𝑔 𝑥 : 𝑝𝑒𝑟𝑓𝑒𝑐𝑡 𝑓𝑖𝑡𝑡𝑒𝑑 𝐶𝐷𝐹
𝑝 − 𝑣𝑎𝑙𝑢𝑒 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝛾𝑖 𝑤ℎ𝑒𝑟𝑒𝛾𝑖 > 𝛾1;
𝑖
𝑖 = 1000 𝑠𝑦𝑛𝑡ℎ𝑒𝑡𝑖𝑐 𝑧𝑖𝑝𝑓 𝑑𝑎𝑡𝑎𝑠𝑒𝑡𝑠
P-values of KS test under different subscriber views

44
iii. Testing alternative distributions

45
Other diversity based methods
P-dispersion problem
MAXMIN
MAXSUM
Minimum
independent-
dominating set
problem
MAXDIVREL
DisC
For an even comparison,
Combine relevancy at all diversity method
To achieve a bi-criteria objective
Average zipf law exponent in a comparison with other methods

46
Other diversity based methods
P-dispersion problem
MAXMIN
MAXSUM
Minimum
independent-
dominating set
problem
MAXDIVREL
DisC
A comparison of average zipf law exponent with other methods

47
Accuracy of Top-k results
LSH Index vs. NAÏVE
 Rank probability
 Diversity probability
Accuracy on similarity threshold = 0.55 Accuracy on similarity threshold = 0.85

48
Resiliency of Top-k results
Getting Top-k publications (Unordered) Getting Top-k publications (Ordered)

49
Performance
Subscription index update time
Index construction time on opIndex vs. modified opIndex
opIndex vs. modified opIndex

50
Efficiency:
Initial matching time at modified opIndex
Initial matching time under different size of subscription spaces Initial matching time under different size of publications

51
Performance & Efficiency:
LSH Index
BLSH index construction + update time on different number of minhash functions
Number of minhash functions
(m) =
1
𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑒𝑟𝑟𝑜𝑟2
 How much accuracy
do we sacrifice by
comparing small
minhash signatures?

52
Performance & Efficiency
ILSH vs. BLSH vs. NAÏVE
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 . .
BLSH
or
NAIVE
BLSH
or
NAIVE
BLSH
or
NAIVE
BLSH
or
NAIVE
ILSH

53
BLSH vs. NAÏVE
log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500

54
ILSH vs. BLSH vs. NAÏVE
log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500

55
Conclusions
 Diversified results produced by MAXDIVREL based on independent-
dominating set problem
 Exhibits strong natural behavior other than,
 Methods based p-dispersion problem
 Relevancy is a important factor to employ
 In distance based diversity methods
 Always has the tendency to produce the diverse set of personalized
results
 Absolute ranks are sensitive to the preference value
 While keeping the deviation small among relative ranks

56
Conclusions (Ctd.)
 Locality Sensitive Hashing (LSH) indexing method
 Produce MAXDIVREL diverse set of results at average 70% accuracy
over naïve method
 Reduce the matching time very significantly over NAÏVE method
 Further, refine by it’s incremental version
 For handling streaming publications
 Avoid the curse of re-computing neighborhoods
 No such k to restrict the delivery of Top publications
 Given a window size & delivery method
 Model can produce best diverse set of personalized results
 To represent the set of all matching publications at given instance

57
Major Contributions
 Dynamic diversification method based on independent-dominating set
problem
 We introduced a novel diversity definition based on representative
neighborhoods, called MAXDIVREL k-diversity employing relevancy.
 Index based diversification approach to rank results incrementally
 We proposed a novel, hashing based index approach to solve
MAXDIVREL continuous k-diversity problem based on Locality Sensitive
Hashing (LSH) technique
 Advanced evaluation method to measure the quality of diverse results
 First significant try to model natural behavior of diversity methods in
pub/sub community

58
Future work
 Explore other suitable use-cases to apply proposed model & develop
prototype applications, E.g.
 Personalized newspaper for every Facebook user
 Diverse set of personalized Twitter trends
 Social annotation of news-stories
 Exploit overlap among diversified results of users who have similar interest
 Employ existing implicit methods to extract human preferences
 E.g. click stream analytics
 Develop LSH based index over multi-threaded distributed environment

60
Appendix
Freshness
Mean delay between publications = 5000ms
A comparison between relevancy scores after influenced by freshness

61
Appendix
NAÏVE Ranking time
Average naïve Top-k matching time in comparison with size D of publications

62
Appendix
BLSH Ranking time
Average BLSH Top-k matching time in comparison with size D of publications

63
Appendix
ILSH Ranking time
Average ILSH Top-k matching time in comparison with size D of publications

[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Similar to [Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams

Similar to [Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams (20)

More from Sameera Horawalavithana

More from Sameera Horawalavithana (15)

Recently uploaded

Recently uploaded (20)

[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams

Editor's Notes