SlideShare a Scribd company logo
1 of 63
Cloud based publish/subscribe model for
Top-k matching over continuous data
streams
Author:
Y.S. Horawalavithana
10002103
Supervisor:
Dr. D.N. Ranasinghe
U/Graduate Thesis Defense
January 23, 2015
UNIVERSITY OF COLOMBO SCHOOL OF COMPUTING
SCS 4001: INDIVIDUAL PROJECT
1
2
Overview
• Motivation
• Target
• Design & Architecture
• Related work
• Dynamic Diversification
• Incremental Top-k
• Implementation
• Evaluation
• Conclusion
• Future work
3
Motivation – “Big Filter”
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
4
Boolean publish/subscribe
Drawbacks
 A subscriber may be either overloaded with
publications or receive too few publications
 Impossible to compare different matching
publications as ranking functions are not
defined, and
 Partial matching between subscriptions and
publications is not supported.
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
5
Top-k publish/subscribe
 Expressive stateful query processing systems
 User defined parameter k restricts the
delivered publications
 Pub/Sub Matching
 Top-k pub/sub scoring or ranking
 Pub/Sub Indexing
 Indexing to support personalized subscriptions
 Indexing to support continuous Top-k
publications retrieval
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
6
Target
1. How to define an efficient scoring algorithm by integrating query
independent & dependent score metrics taken into account?
- Relevance, Freshness & Diversity
2. How to adapt existing indexing data structures used in state-of-the-art
publish/subscribe systems under
a) large subscription volume,
b) high event rate and,
c) the variety of subscribable attributes,
to support Top-k matching queries?
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
7
Scope
 Optimize Top-k Heuristic for specific domain
 E-commerce with buyers & sellers
 Subscriptions & publications follow a pre-defined
data-structure
 The number of incoming publications follow a
Poisson random variable
 Retrieve Top-k publications against subscriptions,
not reverse.
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
8
Design & Architecture
Expire
Expire
Publication
Store
Subscription
Store
Subscription
Indexing
Relevance
Matching
Publication
Stream
Matching
Publication
Store
Publication
(Relevance
Score)
Publication
Indexing
Top-k
Continuous
Diversity
Personalized
Subscription
Personalized
Subscription
Personalized
Subscription
Dissimilarity
Relevancy
Event
Delivery
Top-k
Notification
Store
Notification
Notification
Notification
Sliding window
9
Related work:General Top-k publish/subscribe
Pub/sub model Subscription
Timing
policy
Diversity
Scoring
metric
Subscription
Indexing
method
Incremental
publication
indexing
Architecture
PrefSIENA
(Drosou, ACM
DEBS 2009)
Preferential
subscription
Sliding
window
Relevancy +
MAXMIN
diversity
Subscription
covering
Centralized
message-
brokers
RRPS
(Lu, ICCSA 2009)
Normal Continuous QoS Centralized
DaZaLaPs
(Pripuzi, IS 2012)
Normal
Sliding
window
Relevancy Grid based P2P
Top-k pub/sub
(Shraer[Google],
VLDB 2014)
Normal Continuous
Relevancy +
Freshness
Tree based TAAT & DAAT Centralized
Our model
Personalized
subscription
space
Sliding
window
MAXDIVREL
diversity
Inverted-list
based
Hashing
based
Cloud based







1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
10
Sliding window Top-k computation
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 ....
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 ....
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 ....
𝑃5 𝑃1
𝑃5 𝑃6
𝑃5 𝑃9
Top-2
Matching publication stream
h=1
h=3
Jumping
step
(h)
 Top-k notifications
delivery
 On-demand
 Pro-active
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
Expired
Active
Top-k
11
Relevancy: Personalized Subscription space
Carrier = AT&T (0.4) Subscribe
Brand = HTC (0.3)
Storage ≤ 16𝐺𝐵 (0.7)
1.75
1.3
2.3
Carrier = Verizon (0.5)
Storage ≤ 32GB (0.2)
2.52
Storage ≤ 32GB (0.6)
Brand = HTC (0.3)
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
12
Relevancy: Personalized Subscription space
2
Carrier = Verizon
Storage ≤ 32GB
2.5
Carrier = AT&T
Storage ≤ 16𝐺𝐵
1.75
Brand = HTC
1.3
2.3
Carrier = Verizon
Color = White
OS = Android
Storage = 16GB
Brand = HTC
Subscribe
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
13
Subscription Indexing: Modified opIndex
 Based on inverted-lists
 Posting lists
 Two level portioning
 Attribute posting list
 Operator posting list
 Locate satisfying subscription tuples
 Relevancy score
 By satisfying relations
 By satisfying subscription tuples
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
14
Freshness
 When window becomes larger,
 Older publications may prevent the newer publications
to enter into Top-k results
 Lease relevancy scores?
 But have to re-calculate scores
 Forward decaying!
 Fresh-relevancy score = relevancy score × Freshness score
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
15
Diversity: Top-k representative set
Representative Top-kDrawback
(without diversity)
What we want
(with diversity)
Method to retrieve Top-k publications from matching publications
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
16
MAX* k-diversity problem
where
1. P = {p1, …, pn}
2. k ≤ n
3. d: a distance metric
4. f: a diversity function
),(argmax*
dSfS
k|S|
PS



Find:
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
17
Proposed: MAXDIVREL k-diversity problem
 
  S-Pinrelevancy&similarity-distheminimize,,
Sinrelevancy&similarity-disthemaximize,,g
),,(
),,(
maxarg),,(argmax*




rdSh
rdS
rdSh
rdSg
rdSfS
PS
where
1. P = {p1, …, pn}
2. d: a distance metric
3. r: a relevance metric
4. f: a diversity function
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
18
Formal Definition: MAXDIVREL k-diversity
 
  






SPpSp
ji
i
j
Spp
ji
i
j
ji
ji
ppd
pr
pr
SP
rdSh
ppd
pr
pr
S
rdS
,
,
dominanceholds),(
)(
)(
||
1
,,argmin
ceindependenholds),(
)(
)(
||
1
,,gargmax
where
1. P = {p1, …, pn}
2. d: a distance metric
3. r: a relevance metric
4. 𝛼 > 0
Independence condition:
∀𝑝𝑖, 𝑝𝑗 ∈ 𝑆, 𝑑 𝑝𝑖, 𝑝𝑗 > 𝛼
Dominance condition:
∀𝑝𝑖 ∈ 𝑃, ∃𝑝𝑗 ∈ 𝑆 𝑠. 𝑡. 𝑑 𝑝𝑖, 𝑝𝑗 ≤ 𝛼; 𝑖 ≠ 𝑗
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
19
NP-Hardness:
Minimum independent-dominating set
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2
𝛼
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2

𝑣1
𝑣4
𝑣3
𝑣2
𝑣5
𝑣1
𝑣4
𝑣3
𝑣2
𝑣5
  jijiji ppppdppodNeighborho  ,|)(
𝑣1
𝑣4
𝑣3𝑣2
𝑣5
Publication
space
Graph
model
Independent, dominating Independent, dominating Independent, dominating Dominating, not independent
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
20
NAÏVE Greedy argmax
𝑟(𝑝𝑖)2
𝑝 𝑗∈𝑁(𝑝 𝑖) 𝑟(𝑝𝑗) × 𝑑(𝑝𝑖, 𝑝𝑗)
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
21
Handling streaming publications
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2𝛼
𝑝6
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2𝑣6
Continuity Requirements
1. Durability
an item is selected as diversified in 𝑖 𝑡ℎ window may still have the chance to be in 𝑖 + 1 𝑡ℎ window
if it's not expired & other valid items in 𝑖 + 1 𝑡ℎ
window are failed to compete with it.
2. Order
Publication stream follow the chronological order
We avoid the selection of item j as diverse later, when we already selected an item i which is not-
older than j.
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
22
MAXDIVREL continuous k-diversity
𝑃1 𝑃2 𝑃3 𝑃4 .. 𝑃𝑗 𝑃𝑗+1 .. .. .. ....
Matching publication stream
𝑃1 𝑃2 𝑃3 𝑃4 .. 𝑃𝑗 𝑃𝑗+1 .. .. .. ....
ith window
(i+1)th window
𝑆𝑖
∗
𝑆𝑖+1
∗
MAXDIVREL k-diversity
MAXDIVREL k-diversity
Independence
Dominance
Durability
Order
 Straightforward solution:
 Apply naïve greedy method at each instance
 Propose incremental index mechanism!
 Avoid the curse of re-calculating neighborhood
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
23
Locality Sensitive Hashing (LSH)
 Simple Idea
 if two points are close together, then after a “projection” operation these two
points will remain close together
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
24
LSH Analysis
 For any given points 𝑝, 𝑞 ∈ 𝑅 𝑑
𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≥ 𝑃1 𝑓𝑜𝑟 𝑝 − 𝑞 ≤ 𝑑1
𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≤ 𝑃2 𝑓𝑜𝑟 𝑝 − 𝑞 ≥ 𝑐𝑑1 = 𝑑2
• Hash function h is (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive,
• Ideally we need
• (𝑃1−𝑃2) to be large
• (𝑑1−𝑑2) to be small
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
25
LSH in MAXDIVREL:
Publications as categorical data
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
26
LSH in MAXDIVREL:
Characteristic Matrix
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
27
LSH in MAXDIVREL:
Minhashing
 No Publications any more!
 Signature to represent
 Technique
 Randomly permute the rows at
characteristic matrix m times
 Take the number of the 1st row, in
the permuted order,
 which the column has a 1 for
the correspondent column of
publications.
First permutation of rows at characteristic matrix
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
 Advantage:
 Reduce the dimensions into a small
minhash signature
28
LSH in MAXDIVREL:
Signature Matrix
Fast-minhashing
Select m number of random hash
functions
To model the effect of m number of
random permutation
Mathematically proved only when,
The number of rows is a prime.
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
29
LSH in MAXDIVREL:
LSH Buckets
 Take r sized
signature vectors
 From m sized
minhash-
signature
 Map them into,
 L Hash-Tables
 Each with
arbitrary b
number of
buckets
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
30
LSH in MAXDIVREL:
How to select L, r?
For two vectors x,y
𝐽𝐷 𝑥, 𝑦 = 1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 ;
𝑤ℎ𝑒𝑟𝑒, 𝐽𝑆𝐼𝑀 𝑥, 𝑦 =
𝑥 ∩ 𝑦
𝑥 ∪ 𝑦
1. 𝐿 × 𝑟 = 𝑚
2. ?
2) 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(𝑠) ≈
1
𝐿
1
𝑟
31
LSH in MAXDIVREL:
Analysis
For two vectors x,y
𝐽𝐷 𝑥, 𝑦 = 1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 ;
𝑤ℎ𝑒𝑟𝑒, 𝐽𝑆𝐼𝑀 𝑥, 𝑦 =
𝑥 ∩ 𝑦
𝑥 ∪ 𝑦
 For publications x & y
𝐽𝑆𝐼𝑀 𝑥, 𝑦 ∝ 𝑃𝑟𝑜𝑏 𝐻 𝑥 = 𝐻 𝑦
 At a particular hash table
 x & y map into the same bucket:
𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
 x & y does not map into the same bucket:
1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
 At L Hash-tables
 x & y does not map into the same bucket:
(1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
) 𝐿 1 − (1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏) 𝐿
True near neighbors will
be unlikely to be unlucky
in all the projections
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
32
LSH in MAXDIVREL:
Batch-wise Top-k computation
 Bucket “Winner” – a publication which has the
highest relevancy score
 Winner is dominant to represent it's bucket
neighborhood
 Top-k "winners“ that have a majority of votes
 k winners are independent
𝑃𝐴 𝑃𝐵 𝑃𝐶 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . .
ith
window
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
33
LSH in MAXDIVREL:
Incremental Top-k computation
𝑁𝑒𝑤 𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑖 𝑈𝑝𝑑𝑎𝑡𝑒 𝑖 𝑡ℎ
𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑖𝑠𝑡𝑖𝑐 𝑣𝑒𝑐𝑡𝑜𝑟
Characteristic
Matrix
𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑒 𝑖 𝑡ℎ
𝑚𝑖𝑛ℎ𝑎𝑠ℎ 𝑠𝑖𝑔𝑛𝑎𝑡𝑢𝑟𝑒
Signature
Matrix
Map 𝑖 𝑡ℎ
signature
into L hash-tables
Update “Winner” at
bucket 𝑖 𝑡ℎ
signature
maps into
Vote 𝑇𝑜𝑝 − 𝑘 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
34
LSH in MAXDIVREL:
When new publication F arrives…
 Only buckets 𝐵13
, 𝐵23
, 𝐵32
, 𝐵43
will vote
 Follow continuity requirements
 Durability
 Order
𝑃𝐴 𝑃𝐵 𝑃𝐶 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . .
ith
window
(i+1)th
window

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
35
Implementation
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
36
Cloud service modules
Source: Amazon Kinesis Source: Amazon Elastic-cache
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
Publication Stream  Zipfian subscriptions
 Normalized preferences
37
Evaluation:
Dataset
Amazon on-line market place data available at 17th – 19th November 2014
𝑧𝑖𝑝𝑓 𝑘: 𝑠, 𝑁 =
1
𝑘 𝑠
𝑛=1
𝑁
(
1
𝑛 𝑠)
N - number of elements in distribution,
k - rank of element
s - value of exponent
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑏𝑐𝑟𝑖𝑏𝑒𝑟 𝑣𝑖𝑒𝑤𝑠
=
𝑖=2
32
48 𝑐 𝑖
+ 42 𝑐 𝑖
+ 54 𝑐 𝑖
+ 66 𝑐 𝑖
+ 57 𝑐 𝑖
+ 67 𝑐 𝑖
38
Evaluation:
Methodology
Subscriber
Effectiveness
Performance &
Efficiency
Quality
Accuracy
Resiliency
Freshness
Index construction time
Top-k matching time
 Platform: Amazon AWS
 Linux based micro-node instances
 Each with 2.3 GHz, 8GB memory
 Algorithms are implemented in Java
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
39
Subscriber Effectiveness:
Quality or natural behvior
 Testing zipf or power law hypothesis on
distribution of ranked results (KS Test)
i. Fitting power law
ii. Goodness of fit tests
iii. Alternative distributions
 Compute 19030 ranked distributions
over 100K publication stream
 Under different subscriber views
 Under different sized sliding window
instances
Sample distribution of ranked votes
logzipf_prob(rank)
log (rank)
𝑧𝑖𝑝𝑓 𝑘: 𝑠, 𝑁 =
1
𝑘 𝑠
𝑛=1
𝑁
(
1
𝑛 𝑠)
N - number of elements in distribution,
k - rank of element
s - value of exponent
40
Subscriber Effectiveness:
i. Fitting power law
Zipf exponent values
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
41
Subscriber Effectiveness:
i. Fitting power law
Illustration of Zipf exponent values convergence
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
42
Subscriber Effectiveness:
i. Fitting power law
Zipf exponent values under different similarity threshold
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
43
Subscriber Effectiveness:
ii. Goodness of fit tests
𝛾1 = 𝑚𝑎𝑥 𝑥≥𝑥 𝑚𝑖𝑛
𝑓 𝑥 − 𝑔 𝑥
𝑓 𝑥 : 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑟𝑎𝑛𝑘 𝐶𝐷𝐹
𝑔 𝑥 : 𝑝𝑒𝑟𝑓𝑒𝑐𝑡 𝑓𝑖𝑡𝑡𝑒𝑑 𝐶𝐷𝐹
𝑝 − 𝑣𝑎𝑙𝑢𝑒 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝛾𝑖 𝑤ℎ𝑒𝑟𝑒𝛾𝑖 > 𝛾1;
𝑖
𝑖 = 1000 𝑠𝑦𝑛𝑡ℎ𝑒𝑡𝑖𝑐 𝑧𝑖𝑝𝑓 𝑑𝑎𝑡𝑎𝑠𝑒𝑡𝑠
P-values of KS test under different subscriber views
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
44
Subscriber Effectiveness:
iii. Testing alternative distributions
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
45
Subscriber Effectiveness:
Other diversity based methods
P-dispersion problem
MAXMIN
MAXSUM
Minimum
independent-
dominating set
problem
MAXDIVREL
DisC
For an even comparison,
Combine relevancy at all diversity method
To achieve a bi-criteria objective
Average zipf law exponent in a comparison with other methods
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
46
Subscriber Effectiveness:
Other diversity based methods
P-dispersion problem
MAXMIN
MAXSUM
Minimum
independent-
dominating set
problem
MAXDIVREL
DisC
A comparison of average zipf law exponent with other methods
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
47
Subscriber Effectiveness:
Accuracy of Top-k results
LSH Index vs. NAÏVE
 Rank probability
 Diversity probability
Accuracy on similarity threshold = 0.55 Accuracy on similarity threshold = 0.85
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
48
Subscriber Effectiveness:
Resiliency of Top-k results
Getting Top-k publications (Unordered) Getting Top-k publications (Ordered)
49
Performance
Subscription index update time
Index construction time on opIndex vs. modified opIndex
opIndex vs. modified opIndex
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
50
Efficiency:
Initial matching time at modified opIndex
Initial matching time under different size of subscription spaces Initial matching time under different size of publications
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
51
Performance & Efficiency:
LSH Index
BLSH index construction + update time on different number of minhash functions
Number of minhash functions
(m) =
1
𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑒𝑟𝑟𝑜𝑟2
 How much accuracy
do we sacrifice by
comparing small
minhash signatures?
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
52
Performance & Efficiency
ILSH vs. BLSH vs. NAÏVE
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 . .
BLSH
or
NAIVE
BLSH
or
NAIVE
BLSH
or
NAIVE
BLSH
or
NAIVE
ILSH
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
53
Performance & Efficiency:
BLSH vs. NAÏVE
log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
54
Performance & Efficiency:
ILSH vs. BLSH vs. NAÏVE
log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
55
Conclusions
 Diversified results produced by MAXDIVREL based on independent-
dominating set problem
 Exhibits strong natural behavior other than,
 Methods based p-dispersion problem
 Relevancy is a important factor to employ
 In distance based diversity methods
 Always has the tendency to produce the diverse set of personalized
results
 Absolute ranks are sensitive to the preference value
 While keeping the deviation small among relative ranks
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
56
Conclusions (Ctd.)
 Locality Sensitive Hashing (LSH) indexing method
 Produce MAXDIVREL diverse set of results at average 70% accuracy
over naïve method
 Reduce the matching time very significantly over NAÏVE method
 Further, refine by it’s incremental version
 For handling streaming publications
 Avoid the curse of re-computing neighborhoods
 No such k to restrict the delivery of Top publications
 Given a window size & delivery method
 Model can produce best diverse set of personalized results
 To represent the set of all matching publications at given instance
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
57
Major Contributions
 Dynamic diversification method based on independent-dominating set
problem
 We introduced a novel diversity definition based on representative
neighborhoods, called MAXDIVREL k-diversity employing relevancy.
 Index based diversification approach to rank results incrementally
 We proposed a novel, hashing based index approach to solve
MAXDIVREL continuous k-diversity problem based on Locality Sensitive
Hashing (LSH) technique
 Advanced evaluation method to measure the quality of diverse results
 First significant try to model natural behavior of diversity methods in
pub/sub community
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
58
Future work
 Explore other suitable use-cases to apply proposed model & develop
prototype applications, E.g.
 Personalized newspaper for every Facebook user
 Diverse set of personalized Twitter trends
 Social annotation of news-stories
 Exploit overlap among diversified results of users who have similar interest
 Employ existing implicit methods to extract human preferences
 E.g. click stream analytics
 Develop LSH based index over multi-threaded distributed environment
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
59
Q&A
THANK YOU!
60
Appendix
Freshness
Mean delay between publications = 5000ms
A comparison between relevancy scores after influenced by freshness
61
Appendix
NAÏVE Ranking time
Average naïve Top-k matching time in comparison with size D of publications
62
Appendix
BLSH Ranking time
Average BLSH Top-k matching time in comparison with size D of publications
63
Appendix
ILSH Ranking time
Average ILSH Top-k matching time in comparison with size D of publications

More Related Content

Viewers also liked

Thesis Identifying Activity
Thesis Identifying ActivityThesis Identifying Activity
Thesis Identifying Activitymr_rodriguez23
 
La motivation au travail ude s
La motivation au travail ude sLa motivation au travail ude s
La motivation au travail ude sjoannecyr1962
 
Planning Mode Simulator: A simulation tool for studying ALMA's scheduling be...
 Planning Mode Simulator: A simulation tool for studying ALMA's scheduling be... Planning Mode Simulator: A simulation tool for studying ALMA's scheduling be...
Planning Mode Simulator: A simulation tool for studying ALMA's scheduling be...Arturo Hoffstadt
 
Thesis Defense Presentation
Thesis Defense PresentationThesis Defense Presentation
Thesis Defense Presentationosideloc
 
My Thesis Defense Presentation
My Thesis Defense PresentationMy Thesis Defense Presentation
My Thesis Defense PresentationDavid Onoue
 
Dissertation oral defense presentation
Dissertation   oral defense presentationDissertation   oral defense presentation
Dissertation oral defense presentationDr. Naomi Mangatu
 
How to Defend your Thesis Proposal like a Professional
How to Defend your Thesis Proposal like a ProfessionalHow to Defend your Thesis Proposal like a Professional
How to Defend your Thesis Proposal like a ProfessionalMiriam College
 
Opportunistic persistent data storage
Opportunistic persistent data storage Opportunistic persistent data storage
Opportunistic persistent data storage Luke Weerasooriya
 
Thesis Powerpoint
Thesis PowerpointThesis Powerpoint
Thesis Powerpointneha47
 
The thesis and its parts
The thesis and its partsThe thesis and its parts
The thesis and its partsDraizelle Sexon
 
Thesis Power Point Presentation
Thesis Power Point PresentationThesis Power Point Presentation
Thesis Power Point Presentationriddhikapandya1985
 
Writing thesis chapters 1-3 guidelines
Writing thesis chapters 1-3 guidelinesWriting thesis chapters 1-3 guidelines
Writing thesis chapters 1-3 guidelinespoleyseugenio
 

Viewers also liked (12)

Thesis Identifying Activity
Thesis Identifying ActivityThesis Identifying Activity
Thesis Identifying Activity
 
La motivation au travail ude s
La motivation au travail ude sLa motivation au travail ude s
La motivation au travail ude s
 
Planning Mode Simulator: A simulation tool for studying ALMA's scheduling be...
 Planning Mode Simulator: A simulation tool for studying ALMA's scheduling be... Planning Mode Simulator: A simulation tool for studying ALMA's scheduling be...
Planning Mode Simulator: A simulation tool for studying ALMA's scheduling be...
 
Thesis Defense Presentation
Thesis Defense PresentationThesis Defense Presentation
Thesis Defense Presentation
 
My Thesis Defense Presentation
My Thesis Defense PresentationMy Thesis Defense Presentation
My Thesis Defense Presentation
 
Dissertation oral defense presentation
Dissertation   oral defense presentationDissertation   oral defense presentation
Dissertation oral defense presentation
 
How to Defend your Thesis Proposal like a Professional
How to Defend your Thesis Proposal like a ProfessionalHow to Defend your Thesis Proposal like a Professional
How to Defend your Thesis Proposal like a Professional
 
Opportunistic persistent data storage
Opportunistic persistent data storage Opportunistic persistent data storage
Opportunistic persistent data storage
 
Thesis Powerpoint
Thesis PowerpointThesis Powerpoint
Thesis Powerpoint
 
The thesis and its parts
The thesis and its partsThe thesis and its parts
The thesis and its parts
 
Thesis Power Point Presentation
Thesis Power Point PresentationThesis Power Point Presentation
Thesis Power Point Presentation
 
Writing thesis chapters 1-3 guidelines
Writing thesis chapters 1-3 guidelinesWriting thesis chapters 1-3 guidelines
Writing thesis chapters 1-3 guidelines
 

Similar to [Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams

[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...Sameera Horawalavithana
 
Reranking based-recommender-system-with-deep-learning
Reranking based-recommender-system-with-deep-learningReranking based-recommender-system-with-deep-learning
Reranking based-recommender-system-with-deep-learningAhmed Saleh
 
Portfolio & Roadmap: 2 tools to scale Agile
Portfolio & Roadmap: 2 tools to scale AgilePortfolio & Roadmap: 2 tools to scale Agile
Portfolio & Roadmap: 2 tools to scale AgileDashlane
 
ODSC West 2021 – Composition in ML
ODSC West 2021 – Composition in MLODSC West 2021 – Composition in ML
ODSC West 2021 – Composition in MLBryan Bischof
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache SparkQuantUniversity
 
QUESTION 1 1. Create a clustered bar graph showing employment stat.docx
QUESTION 1 1. Create a clustered bar graph showing employment stat.docxQUESTION 1 1. Create a clustered bar graph showing employment stat.docx
QUESTION 1 1. Create a clustered bar graph showing employment stat.docxmakdul
 
SESP_Ovidiu_Popoviciu_Poster
SESP_Ovidiu_Popoviciu_PosterSESP_Ovidiu_Popoviciu_Poster
SESP_Ovidiu_Popoviciu_PosterOvidiu Popoviciu
 
Working beyond boundaries
Working beyond boundariesWorking beyond boundaries
Working beyond boundariesPLACEmaking
 
ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016Paolo Missier
 
State of the Stack v4 - OpenStack in All It's Glory
State of the Stack v4 - OpenStack in All It's GloryState of the Stack v4 - OpenStack in All It's Glory
State of the Stack v4 - OpenStack in All It's GloryRandy Bias
 
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.pptProto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.pptAnirbanBhar3
 
Evolutionary Architecture And Design
Evolutionary Architecture And DesignEvolutionary Architecture And Design
Evolutionary Architecture And DesignNaresh Jain
 
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation [ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation Sameera Horawalavithana
 
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...Neo4j
 
Technical briefing on Software Release Planning
Technical briefing on Software Release PlanningTechnical briefing on Software Release Planning
Technical briefing on Software Release PlanningGuenther Ruhe
 

Similar to [Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams (20)

[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
 
Reranking based-recommender-system-with-deep-learning
Reranking based-recommender-system-with-deep-learningReranking based-recommender-system-with-deep-learning
Reranking based-recommender-system-with-deep-learning
 
Portfolio & Roadmap: 2 tools to scale Agile
Portfolio & Roadmap: 2 tools to scale AgilePortfolio & Roadmap: 2 tools to scale Agile
Portfolio & Roadmap: 2 tools to scale Agile
 
ODSC West 2021 – Composition in ML
ODSC West 2021 – Composition in MLODSC West 2021 – Composition in ML
ODSC West 2021 – Composition in ML
 
"Paradigm Shifting" Presentation
"Paradigm Shifting" Presentation"Paradigm Shifting" Presentation
"Paradigm Shifting" Presentation
 
Framework for Agile Living Labs - FALL
Framework for Agile Living Labs - FALLFramework for Agile Living Labs - FALL
Framework for Agile Living Labs - FALL
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
 
QUESTION 1 1. Create a clustered bar graph showing employment stat.docx
QUESTION 1 1. Create a clustered bar graph showing employment stat.docxQUESTION 1 1. Create a clustered bar graph showing employment stat.docx
QUESTION 1 1. Create a clustered bar graph showing employment stat.docx
 
19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
 
SESP_Ovidiu_Popoviciu_Poster
SESP_Ovidiu_Popoviciu_PosterSESP_Ovidiu_Popoviciu_Poster
SESP_Ovidiu_Popoviciu_Poster
 
Working beyond boundaries
Working beyond boundariesWorking beyond boundaries
Working beyond boundaries
 
Ux for data exploration
Ux for data explorationUx for data exploration
Ux for data exploration
 
ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016
 
State of the Stack v4 - OpenStack in All It's Glory
State of the Stack v4 - OpenStack in All It's GloryState of the Stack v4 - OpenStack in All It's Glory
State of the Stack v4 - OpenStack in All It's Glory
 
080613 Mega-Project Schedule Integration & Management RCF Method-1
080613 Mega-Project Schedule Integration & Management RCF Method-1 080613 Mega-Project Schedule Integration & Management RCF Method-1
080613 Mega-Project Schedule Integration & Management RCF Method-1
 
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.pptProto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
 
Evolutionary Architecture And Design
Evolutionary Architecture And DesignEvolutionary Architecture And Design
Evolutionary Architecture And Design
 
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation [ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
 
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
 
Technical briefing on Software Release Planning
Technical briefing on Software Release PlanningTechnical briefing on Software Release Planning
Technical briefing on Software Release Planning
 

More from Sameera Horawalavithana

Data-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and SimulationData-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and SimulationSameera Horawalavithana
 
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
 Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
Drivers of Polarized Discussions on Twitter during Venezuela Political CrisisSameera Horawalavithana
 
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
 Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
Twitter Is the Megaphone of Cross-platform Messaging on the White HelmetsSameera Horawalavithana
 
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...Sameera Horawalavithana
 
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHubMentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHubSameera Horawalavithana
 
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...Sameera Horawalavithana
 
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...Sameera Horawalavithana
 
Be Elastic: Leapset Innovation session 06-08-2015
Be Elastic: Leapset Innovation session 06-08-2015Be Elastic: Leapset Innovation session 06-08-2015
Be Elastic: Leapset Innovation session 06-08-2015Sameera Horawalavithana
 
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingTalk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingSameera Horawalavithana
 

More from Sameera Horawalavithana (15)

Data-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and SimulationData-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and Simulation
 
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
 Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
 
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
 Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
 
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
 
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHubMentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
 
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
 
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
Dancing with Stream Processing
Dancing with Stream ProcessingDancing with Stream Processing
Dancing with Stream Processing
 
Be Elastic: Leapset Innovation session 06-08-2015
Be Elastic: Leapset Innovation session 06-08-2015Be Elastic: Leapset Innovation session 06-08-2015
Be Elastic: Leapset Innovation session 06-08-2015
 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
 
Zipf distribution
Zipf distributionZipf distribution
Zipf distribution
 
Query personalization
Query personalizationQuery personalization
Query personalization
 
Dancing with publish/subscribe
Dancing with publish/subscribeDancing with publish/subscribe
Dancing with publish/subscribe
 
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingTalk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams

  • 1. Cloud based publish/subscribe model for Top-k matching over continuous data streams Author: Y.S. Horawalavithana 10002103 Supervisor: Dr. D.N. Ranasinghe U/Graduate Thesis Defense January 23, 2015 UNIVERSITY OF COLOMBO SCHOOL OF COMPUTING SCS 4001: INDIVIDUAL PROJECT 1
  • 2. 2 Overview • Motivation • Target • Design & Architecture • Related work • Dynamic Diversification • Incremental Top-k • Implementation • Evaluation • Conclusion • Future work
  • 3. 3 Motivation – “Big Filter” 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 4. 4 Boolean publish/subscribe Drawbacks  A subscriber may be either overloaded with publications or receive too few publications  Impossible to compare different matching publications as ranking functions are not defined, and  Partial matching between subscriptions and publications is not supported. 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 5. 5 Top-k publish/subscribe  Expressive stateful query processing systems  User defined parameter k restricts the delivered publications  Pub/Sub Matching  Top-k pub/sub scoring or ranking  Pub/Sub Indexing  Indexing to support personalized subscriptions  Indexing to support continuous Top-k publications retrieval 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 6. 6 Target 1. How to define an efficient scoring algorithm by integrating query independent & dependent score metrics taken into account? - Relevance, Freshness & Diversity 2. How to adapt existing indexing data structures used in state-of-the-art publish/subscribe systems under a) large subscription volume, b) high event rate and, c) the variety of subscribable attributes, to support Top-k matching queries? 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 7. 7 Scope  Optimize Top-k Heuristic for specific domain  E-commerce with buyers & sellers  Subscriptions & publications follow a pre-defined data-structure  The number of incoming publications follow a Poisson random variable  Retrieve Top-k publications against subscriptions, not reverse. 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 9. 9 Related work:General Top-k publish/subscribe Pub/sub model Subscription Timing policy Diversity Scoring metric Subscription Indexing method Incremental publication indexing Architecture PrefSIENA (Drosou, ACM DEBS 2009) Preferential subscription Sliding window Relevancy + MAXMIN diversity Subscription covering Centralized message- brokers RRPS (Lu, ICCSA 2009) Normal Continuous QoS Centralized DaZaLaPs (Pripuzi, IS 2012) Normal Sliding window Relevancy Grid based P2P Top-k pub/sub (Shraer[Google], VLDB 2014) Normal Continuous Relevancy + Freshness Tree based TAAT & DAAT Centralized Our model Personalized subscription space Sliding window MAXDIVREL diversity Inverted-list based Hashing based Cloud based        1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 10. 10 Sliding window Top-k computation 𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 .... 𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 .... 𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 .... 𝑃5 𝑃1 𝑃5 𝑃6 𝑃5 𝑃9 Top-2 Matching publication stream h=1 h=3 Jumping step (h)  Top-k notifications delivery  On-demand  Pro-active 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work Expired Active Top-k
  • 11. 11 Relevancy: Personalized Subscription space Carrier = AT&T (0.4) Subscribe Brand = HTC (0.3) Storage ≤ 16𝐺𝐵 (0.7) 1.75 1.3 2.3 Carrier = Verizon (0.5) Storage ≤ 32GB (0.2) 2.52 Storage ≤ 32GB (0.6) Brand = HTC (0.3) 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 12. 12 Relevancy: Personalized Subscription space 2 Carrier = Verizon Storage ≤ 32GB 2.5 Carrier = AT&T Storage ≤ 16𝐺𝐵 1.75 Brand = HTC 1.3 2.3 Carrier = Verizon Color = White OS = Android Storage = 16GB Brand = HTC Subscribe 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 13. 13 Subscription Indexing: Modified opIndex  Based on inverted-lists  Posting lists  Two level portioning  Attribute posting list  Operator posting list  Locate satisfying subscription tuples  Relevancy score  By satisfying relations  By satisfying subscription tuples 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 14. 14 Freshness  When window becomes larger,  Older publications may prevent the newer publications to enter into Top-k results  Lease relevancy scores?  But have to re-calculate scores  Forward decaying!  Fresh-relevancy score = relevancy score × Freshness score 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 15. 15 Diversity: Top-k representative set Representative Top-kDrawback (without diversity) What we want (with diversity) Method to retrieve Top-k publications from matching publications 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 16. 16 MAX* k-diversity problem where 1. P = {p1, …, pn} 2. k ≤ n 3. d: a distance metric 4. f: a diversity function ),(argmax* dSfS k|S| PS    Find: 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 17. 17 Proposed: MAXDIVREL k-diversity problem     S-Pinrelevancy&similarity-distheminimize,, Sinrelevancy&similarity-disthemaximize,,g ),,( ),,( maxarg),,(argmax*     rdSh rdS rdSh rdSg rdSfS PS where 1. P = {p1, …, pn} 2. d: a distance metric 3. r: a relevance metric 4. f: a diversity function 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 18. 18 Formal Definition: MAXDIVREL k-diversity            SPpSp ji i j Spp ji i j ji ji ppd pr pr SP rdSh ppd pr pr S rdS , , dominanceholds),( )( )( || 1 ,,argmin ceindependenholds),( )( )( || 1 ,,gargmax where 1. P = {p1, …, pn} 2. d: a distance metric 3. r: a relevance metric 4. 𝛼 > 0 Independence condition: ∀𝑝𝑖, 𝑝𝑗 ∈ 𝑆, 𝑑 𝑝𝑖, 𝑝𝑗 > 𝛼 Dominance condition: ∀𝑝𝑖 ∈ 𝑃, ∃𝑝𝑗 ∈ 𝑆 𝑠. 𝑡. 𝑑 𝑝𝑖, 𝑝𝑗 ≤ 𝛼; 𝑖 ≠ 𝑗 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 19. 19 NP-Hardness: Minimum independent-dominating set 𝑝1 𝑝2 𝑝3 𝑝4 𝑝5 𝑣1 𝑣4 𝑣3 𝑣5 𝑣2 𝛼 𝑣1 𝑣4 𝑣3 𝑣5 𝑣2  𝑣1 𝑣4 𝑣3 𝑣2 𝑣5 𝑣1 𝑣4 𝑣3 𝑣2 𝑣5   jijiji ppppdppodNeighborho  ,|)( 𝑣1 𝑣4 𝑣3𝑣2 𝑣5 Publication space Graph model Independent, dominating Independent, dominating Independent, dominating Dominating, not independent 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 20. 20 NAÏVE Greedy argmax 𝑟(𝑝𝑖)2 𝑝 𝑗∈𝑁(𝑝 𝑖) 𝑟(𝑝𝑗) × 𝑑(𝑝𝑖, 𝑝𝑗) 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 21. 21 Handling streaming publications 𝑝1 𝑝2 𝑝3 𝑝4 𝑝5 𝑣1 𝑣4 𝑣3 𝑣5 𝑣2𝛼 𝑝6 𝑣1 𝑣4 𝑣3 𝑣5 𝑣2𝑣6 Continuity Requirements 1. Durability an item is selected as diversified in 𝑖 𝑡ℎ window may still have the chance to be in 𝑖 + 1 𝑡ℎ window if it's not expired & other valid items in 𝑖 + 1 𝑡ℎ window are failed to compete with it. 2. Order Publication stream follow the chronological order We avoid the selection of item j as diverse later, when we already selected an item i which is not- older than j. 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 22. 22 MAXDIVREL continuous k-diversity 𝑃1 𝑃2 𝑃3 𝑃4 .. 𝑃𝑗 𝑃𝑗+1 .. .. .. .... Matching publication stream 𝑃1 𝑃2 𝑃3 𝑃4 .. 𝑃𝑗 𝑃𝑗+1 .. .. .. .... ith window (i+1)th window 𝑆𝑖 ∗ 𝑆𝑖+1 ∗ MAXDIVREL k-diversity MAXDIVREL k-diversity Independence Dominance Durability Order  Straightforward solution:  Apply naïve greedy method at each instance  Propose incremental index mechanism!  Avoid the curse of re-calculating neighborhood 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 23. 23 Locality Sensitive Hashing (LSH)  Simple Idea  if two points are close together, then after a “projection” operation these two points will remain close together 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 24. 24 LSH Analysis  For any given points 𝑝, 𝑞 ∈ 𝑅 𝑑 𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≥ 𝑃1 𝑓𝑜𝑟 𝑝 − 𝑞 ≤ 𝑑1 𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≤ 𝑃2 𝑓𝑜𝑟 𝑝 − 𝑞 ≥ 𝑐𝑑1 = 𝑑2 • Hash function h is (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive, • Ideally we need • (𝑃1−𝑃2) to be large • (𝑑1−𝑑2) to be small 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 25. 25 LSH in MAXDIVREL: Publications as categorical data 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 26. 26 LSH in MAXDIVREL: Characteristic Matrix 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 27. 27 LSH in MAXDIVREL: Minhashing  No Publications any more!  Signature to represent  Technique  Randomly permute the rows at characteristic matrix m times  Take the number of the 1st row, in the permuted order,  which the column has a 1 for the correspondent column of publications. First permutation of rows at characteristic matrix 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work  Advantage:  Reduce the dimensions into a small minhash signature
  • 28. 28 LSH in MAXDIVREL: Signature Matrix Fast-minhashing Select m number of random hash functions To model the effect of m number of random permutation Mathematically proved only when, The number of rows is a prime. 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 29. 29 LSH in MAXDIVREL: LSH Buckets  Take r sized signature vectors  From m sized minhash- signature  Map them into,  L Hash-Tables  Each with arbitrary b number of buckets 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 30. 30 LSH in MAXDIVREL: How to select L, r? For two vectors x,y 𝐽𝐷 𝑥, 𝑦 = 1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 ; 𝑤ℎ𝑒𝑟𝑒, 𝐽𝑆𝐼𝑀 𝑥, 𝑦 = 𝑥 ∩ 𝑦 𝑥 ∪ 𝑦 1. 𝐿 × 𝑟 = 𝑚 2. ? 2) 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(𝑠) ≈ 1 𝐿 1 𝑟
  • 31. 31 LSH in MAXDIVREL: Analysis For two vectors x,y 𝐽𝐷 𝑥, 𝑦 = 1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 ; 𝑤ℎ𝑒𝑟𝑒, 𝐽𝑆𝐼𝑀 𝑥, 𝑦 = 𝑥 ∩ 𝑦 𝑥 ∪ 𝑦  For publications x & y 𝐽𝑆𝐼𝑀 𝑥, 𝑦 ∝ 𝑃𝑟𝑜𝑏 𝐻 𝑥 = 𝐻 𝑦  At a particular hash table  x & y map into the same bucket: 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏  x & y does not map into the same bucket: 1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏  At L Hash-tables  x & y does not map into the same bucket: (1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏 ) 𝐿 1 − (1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏) 𝐿 True near neighbors will be unlikely to be unlucky in all the projections 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 32. 32 LSH in MAXDIVREL: Batch-wise Top-k computation  Bucket “Winner” – a publication which has the highest relevancy score  Winner is dominant to represent it's bucket neighborhood  Top-k "winners“ that have a majority of votes  k winners are independent 𝑃𝐴 𝑃𝐵 𝑃𝐶 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . . ith window 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 33. 33 LSH in MAXDIVREL: Incremental Top-k computation 𝑁𝑒𝑤 𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑖 𝑈𝑝𝑑𝑎𝑡𝑒 𝑖 𝑡ℎ 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑖𝑠𝑡𝑖𝑐 𝑣𝑒𝑐𝑡𝑜𝑟 Characteristic Matrix 𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑒 𝑖 𝑡ℎ 𝑚𝑖𝑛ℎ𝑎𝑠ℎ 𝑠𝑖𝑔𝑛𝑎𝑡𝑢𝑟𝑒 Signature Matrix Map 𝑖 𝑡ℎ signature into L hash-tables Update “Winner” at bucket 𝑖 𝑡ℎ signature maps into Vote 𝑇𝑜𝑝 − 𝑘 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 34. 34 LSH in MAXDIVREL: When new publication F arrives…  Only buckets 𝐵13 , 𝐵23 , 𝐵32 , 𝐵43 will vote  Follow continuity requirements  Durability  Order 𝑃𝐴 𝑃𝐵 𝑃𝐶 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . . ith window (i+1)th window  1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 35. 35 Implementation 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 36. 36 Cloud service modules Source: Amazon Kinesis Source: Amazon Elastic-cache 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 37. Publication Stream  Zipfian subscriptions  Normalized preferences 37 Evaluation: Dataset Amazon on-line market place data available at 17th – 19th November 2014 𝑧𝑖𝑝𝑓 𝑘: 𝑠, 𝑁 = 1 𝑘 𝑠 𝑛=1 𝑁 ( 1 𝑛 𝑠) N - number of elements in distribution, k - rank of element s - value of exponent 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑏𝑐𝑟𝑖𝑏𝑒𝑟 𝑣𝑖𝑒𝑤𝑠 = 𝑖=2 32 48 𝑐 𝑖 + 42 𝑐 𝑖 + 54 𝑐 𝑖 + 66 𝑐 𝑖 + 57 𝑐 𝑖 + 67 𝑐 𝑖
  • 38. 38 Evaluation: Methodology Subscriber Effectiveness Performance & Efficiency Quality Accuracy Resiliency Freshness Index construction time Top-k matching time  Platform: Amazon AWS  Linux based micro-node instances  Each with 2.3 GHz, 8GB memory  Algorithms are implemented in Java 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 39. 39 Subscriber Effectiveness: Quality or natural behvior  Testing zipf or power law hypothesis on distribution of ranked results (KS Test) i. Fitting power law ii. Goodness of fit tests iii. Alternative distributions  Compute 19030 ranked distributions over 100K publication stream  Under different subscriber views  Under different sized sliding window instances Sample distribution of ranked votes logzipf_prob(rank) log (rank) 𝑧𝑖𝑝𝑓 𝑘: 𝑠, 𝑁 = 1 𝑘 𝑠 𝑛=1 𝑁 ( 1 𝑛 𝑠) N - number of elements in distribution, k - rank of element s - value of exponent
  • 40. 40 Subscriber Effectiveness: i. Fitting power law Zipf exponent values 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 41. 41 Subscriber Effectiveness: i. Fitting power law Illustration of Zipf exponent values convergence 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 42. 42 Subscriber Effectiveness: i. Fitting power law Zipf exponent values under different similarity threshold 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 43. 43 Subscriber Effectiveness: ii. Goodness of fit tests 𝛾1 = 𝑚𝑎𝑥 𝑥≥𝑥 𝑚𝑖𝑛 𝑓 𝑥 − 𝑔 𝑥 𝑓 𝑥 : 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑟𝑎𝑛𝑘 𝐶𝐷𝐹 𝑔 𝑥 : 𝑝𝑒𝑟𝑓𝑒𝑐𝑡 𝑓𝑖𝑡𝑡𝑒𝑑 𝐶𝐷𝐹 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝛾𝑖 𝑤ℎ𝑒𝑟𝑒𝛾𝑖 > 𝛾1; 𝑖 𝑖 = 1000 𝑠𝑦𝑛𝑡ℎ𝑒𝑡𝑖𝑐 𝑧𝑖𝑝𝑓 𝑑𝑎𝑡𝑎𝑠𝑒𝑡𝑠 P-values of KS test under different subscriber views 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 44. 44 Subscriber Effectiveness: iii. Testing alternative distributions 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 45. 45 Subscriber Effectiveness: Other diversity based methods P-dispersion problem MAXMIN MAXSUM Minimum independent- dominating set problem MAXDIVREL DisC For an even comparison, Combine relevancy at all diversity method To achieve a bi-criteria objective Average zipf law exponent in a comparison with other methods 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 46. 46 Subscriber Effectiveness: Other diversity based methods P-dispersion problem MAXMIN MAXSUM Minimum independent- dominating set problem MAXDIVREL DisC A comparison of average zipf law exponent with other methods 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 47. 47 Subscriber Effectiveness: Accuracy of Top-k results LSH Index vs. NAÏVE  Rank probability  Diversity probability Accuracy on similarity threshold = 0.55 Accuracy on similarity threshold = 0.85 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 48. 48 Subscriber Effectiveness: Resiliency of Top-k results Getting Top-k publications (Unordered) Getting Top-k publications (Ordered)
  • 49. 49 Performance Subscription index update time Index construction time on opIndex vs. modified opIndex opIndex vs. modified opIndex 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 50. 50 Efficiency: Initial matching time at modified opIndex Initial matching time under different size of subscription spaces Initial matching time under different size of publications 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 51. 51 Performance & Efficiency: LSH Index BLSH index construction + update time on different number of minhash functions Number of minhash functions (m) = 1 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑒𝑟𝑟𝑜𝑟2  How much accuracy do we sacrifice by comparing small minhash signatures? 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 52. 52 Performance & Efficiency ILSH vs. BLSH vs. NAÏVE 𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 . . BLSH or NAIVE BLSH or NAIVE BLSH or NAIVE BLSH or NAIVE ILSH 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 53. 53 Performance & Efficiency: BLSH vs. NAÏVE log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 54. 54 Performance & Efficiency: ILSH vs. BLSH vs. NAÏVE log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 55. 55 Conclusions  Diversified results produced by MAXDIVREL based on independent- dominating set problem  Exhibits strong natural behavior other than,  Methods based p-dispersion problem  Relevancy is a important factor to employ  In distance based diversity methods  Always has the tendency to produce the diverse set of personalized results  Absolute ranks are sensitive to the preference value  While keeping the deviation small among relative ranks 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 56. 56 Conclusions (Ctd.)  Locality Sensitive Hashing (LSH) indexing method  Produce MAXDIVREL diverse set of results at average 70% accuracy over naïve method  Reduce the matching time very significantly over NAÏVE method  Further, refine by it’s incremental version  For handling streaming publications  Avoid the curse of re-computing neighborhoods  No such k to restrict the delivery of Top publications  Given a window size & delivery method  Model can produce best diverse set of personalized results  To represent the set of all matching publications at given instance 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 57. 57 Major Contributions  Dynamic diversification method based on independent-dominating set problem  We introduced a novel diversity definition based on representative neighborhoods, called MAXDIVREL k-diversity employing relevancy.  Index based diversification approach to rank results incrementally  We proposed a novel, hashing based index approach to solve MAXDIVREL continuous k-diversity problem based on Locality Sensitive Hashing (LSH) technique  Advanced evaluation method to measure the quality of diverse results  First significant try to model natural behavior of diversity methods in pub/sub community 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 58. 58 Future work  Explore other suitable use-cases to apply proposed model & develop prototype applications, E.g.  Personalized newspaper for every Facebook user  Diverse set of personalized Twitter trends  Social annotation of news-stories  Exploit overlap among diversified results of users who have similar interest  Employ existing implicit methods to extract human preferences  E.g. click stream analytics  Develop LSH based index over multi-threaded distributed environment 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 60. 60 Appendix Freshness Mean delay between publications = 5000ms A comparison between relevancy scores after influenced by freshness
  • 61. 61 Appendix NAÏVE Ranking time Average naïve Top-k matching time in comparison with size D of publications
  • 62. 62 Appendix BLSH Ranking time Average BLSH Top-k matching time in comparison with size D of publications
  • 63. 63 Appendix ILSH Ranking time Average ILSH Top-k matching time in comparison with size D of publications

Editor's Notes

  1. to overcome the drawbacks identified in traditional pub/sub systems