[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams

A Publish/Subscribe Model for
Top-k Matching Over
Continuous Data-streams
Author:
Y.S. Horawalavithana
10002103
Supervisor:
Dr. D.N. Ranasinghe

Outline
• Motivation
• Research Problem
• Re-cap proposal defense!
• Design & Architecture
• Related Work
• Contribution
• Scoring Algorithm
• Query Personalization
• Events Novelty
• Relevancy + Freshness
• MAXDIVREL Diversity
• Dual-Indexing mechanism
• To Do List

Motivation – “The Big Filter”

General Publish/Subscribe Model

Drawbacks in Boolean Matching
Traditional
Publish/SubscribePublish
Subscribe
Notify
Bob likes to update about
smartphones. He prefers to get
notify on products from Verizon &
AT&T.
But Ideally Bob prefers to get
notify on products from Verizon
only if there are not enough
notifications from AT&T.

Drawbacks in Boolean Matching (Contd.)
• Subscriptions & matching publications are
considered as equally important.
• Publications are delivered to Bob
whenever there is a satisfied
subscription.
• Bob may be either overloaded with
publications or receive too few
publications over time,
• Impossible to compare different matching
publications with respect to Bob’s
subscriptions as ranking functions are not
defined, and
• Partial matching between subscriptions and
publications is not supported.

Top-k Publish/Subscribe
• Expressive stateful query processing systems
• to overcome the drawbacks identified in traditional pub/sub systems
• User defined parameter k restricts the delivered publications
• Pub/Sub Matching?
• Top-k pub/sub scoring or ranking
• Pub/Sub Indexing?
• Indexing to support personalized subscriptions
• Indexing to support continuous Top-k publications retrieval

Research Goal
How to alleviate the Information Overload
problem based on publish/subscribe
communication paradigm which is augmented
by different scoring mechanisms over
continuous information-streams?

Research Problem
1. How to define an efficient scoring algorithm by integrating query
independent & dependent score metrics taken into account?
- Relevance, Freshness & Diversity
2. How to adapt existing indexing data structures used in state-of-the-
art publish/subscribe systems under
a) large subscription volume,
b) high event rate(velocity) and,
c) the variety of subscribable attributes,
to support top-k matching queries?

Centralized Top-k Publish/Subscribe

Why not client centered Top-k matching with
Traditional pub/sub layer on Top?
• In subscriber point of view,
• We support partial matching between subscriptions & publications
• Personalized subscriptions
• We address the overlapping interest of many subscribers
• Experiment with system resiliency: Retrieve Top-k results on domain knowledge
• We can have large volume of subscription space with variety of attributes through an
efficient in-memory indexing mechanism
• In publisher point of view,
• Depend on the order of incoming matched publications

Expire
Expire
Publication
Store
Subscription
Store
Subscription
Indexing
Relevance
Matching
Publication
Stream
Matching
Publication
Store
Publication
(Relevance
Score)
Publication
Indexing
Top-k
Continuous
Diversity
Personalized
Subscription
Personalized
Subscription
Personalized
Subscription
Dissimilarity
Relevancy
Event
Delivery
Top-k
Notification
Store
Notification
Notification
Notification
Sliding window

k-index(Whang2009)
BE*Tree-index(Sadhogi2012)
gridIndex(Pripuzi2012)
opIndex(Zhang2014)
MAXMIN Diversity/ Cover
Tree(Drosou2014)
Pref_pub/sub(Drosou2009)
Top-k/w pub/sub (Pripuzi2012)
Forward_Decay (Cormode2009)
Binary_Decsions (Campailla2001)
Publication_Aging (Shraer2013)
Pref_pub/sub with
diversity (Pitoura2009)
DIsC_diversity (Drosou2012)
Top-k representative Queries
(Ranu2014)

Comparison: Subscription (Contd.)
Typical Pub/Sub
• Just matching a publication
whenever there’s a satisfied
subscription
Top-k Pub/Sub
• A publication is scored against a
satisfied subscription space
Item = Smartphone
Item = Smartphone
Carrier = AT&T
Carrier = AT&T
Item = Smartphone
Carrier = AT&T
Item = Smartphone
Item = Smartphone
Carrier = AT&T
Carrier = AT&T
Item = Smartphone
Carrier = AT&T

Comparison: Subscription
Typical Pub/Sub
• All subscriptions are considered
equally
• No personalized subscriptions
Top-k Pub/Sub
• Subscribers can express some
events are more important than
others by ranking subscriptions
• can have a degree of user interest
over subscription space
• limit redundancy by avoiding
results with overlapping content
• “AT&T Smartphone" include in
“Smartphone“
• Make rare events visible

How to assign preference over subscription?
Quantitative approach
• Assign interest to each
subscription
Qualitative approach
• Specify the interest between two
subscriptions
Item = Smartphone
Item = Smartphone
Carrier = AT&T
Carrier = AT&T
0.7
0.5
0.9
Item = Smartphone
Item = Smartphone
Carrier = AT&T
Carrier = AT&T
>
<

Personalized subscriptions
Explicit Global Ordering Explicit Local Ordering Explicit Local + Implicit Global Ordering
Subscription Preferences Attribute Preferences Attribute-Subscription Preferences
Carrier = AT&T
OS = Android
0.9
Carrier = Verizon
OS = iOS
0.7
>
Carrier = AT&T
Carrier = Verizon
>
OS = iOS
OS = Android
<
Carrier = AT&T (0.6)
OS = Android (0.3)
Carrier = Verizon (0.2)
OS = iOS (0.5)
OS = iOS (0.7)
Brand = Apple (0.4)

We Propose: Relating Attributes
a) Subscription covering b) Subscription Merging c) Relating Attributes
attribute1
attribute2
attribute1
attribute2
attribute1
attribute2
S1
S2
S3
S1 S2
S3

Relating Attributes: Demonstration
• Let's assume that, Bob would like to get notify on products related
with following personalized queries:

Brand=HTC(0.3)
Storage ≤ 32GB (0.6)
2
Carrier = Verizon (0.5)
Storage ≤ 32GB (0.2)
2.5
Storage ≤ 16𝐺𝐵(0.7)
1.75
Brand = HTC (0.3)
1.3
2.3

2
Carrier = Verizon
Storage ≤ 32GB
2.5
Carrier = AT&T
Storage ≤ 16𝐺𝐵
1.75
Brand = HTC
1.3
2.3

• A seller pushes a product

Subscription Indexing
• Can have a performance bottleneck when,
• Matching between publication & user personalized subscription space.
• Extensively studied in pub/sub community
• Don’t re-invent the wheel
• We extend an existing indexing mechanism to,
• Apply our personalized subscription model

Decision Making
opIndex
• Dynamically adopt to the variety of
attributes
• Two-space partitioning
• Attribute & operator
• Can support a wide range of operators
• Ex: Regular Expression
• Perform better when subscription
space become larger
• index construction time,
• memory cost and,
• query processing time.
k-Index, BE* Index
• Can’t deal with the variety of attributes
• Three-space partitioning
• Subscription size, Attribute & Value
• Supports only a small set of operators
• Are outperformed by opIndex

Events Novelty
• Motivation:
• A popular news pub/sub system like Google news maintain publications
within last 30 days, but most of the time produce top-k results within last day
or two.
• Most important in Top-k computation,
• Demonstration using time policy to compute Top-k results

When to compute Top-k results?
• Our matching model deal with continuous data-stream
• Impossible to filter an unbounded stream
• We should have a time policy to compute Top-k results per subscription
I. Continuous
II. Periodic
III. Sliding Windows

Sliding Window Top-k computation
• Compute top-k results based on publications within moving windows
(time or events) e.g. w=2
P1 P2 P3 P4 P5 P6 P7 P8 P9 …
T 2T 3T 4T 5T
P1 P2 P4

Remark: Sliding Window
• Adaptive than continuous & periodic
• when w = 1; act as continuous
• when w = T; act as periodic
• But here w is Flexible
• We can dynamically change w based on event arrival rate
• Can address streams other than Poisson distribution
• Without losing generality, our model based on sliding event windows
• But when event window becomes larger?

Freshness: Time Decaying
Problem
• Older publications may prevent the newer publications to enter into
top-k results
Solution
• Lease or Expire using a time decay function
• We combine Freshness with relevancy score

Time Decaying Function
• We consider “Forward decay” to compute the publication age
• So we don’t have to compute the decay score each window

Event Diversity
• In Top-k publish/subscribe,
• getting a diverse results within Top-k publications play a major role
• As an example, Bob would like to get notify about smart-phones from
the carrier=AT&T and brand=HTC.
• Without the notion of diversity, delivered top-k publications may have much
similarity between them.
• Even though, the received publications are personalized, Bob may recognize
such a system as not effective.

Define Diversity: Taxonomy
Result
Diversification
Dissimilarity Coverage Novelty
Discrete or
continuous domain

Dissimilarity
• Choosing to deliver items that are dissimilar to each other
• P-dispersion problem
• Selecting k items out of n, such that, the average pairwise distance between
the selected items is maximized
• NP-Hard
• k-diversity problem
• Is based on p-dispersion problem
• Rely on heuristics to solve large instance of the problem

K-diversity problem
• Let P be the set of matching publications; |P| = n, and given a
distance metric d to express the dissimilarity between publication
points, finding the diverse set 𝑆∗of P such that
𝑆∗ = arg max 𝑓 𝑆, 𝑑 ;

MAXDIVREL Diversity
Address continuous k-diversity problem

Not to reinvent the wheel
• Most diversity definitions are aligned with,
• P-dispersion problem
• Here, we do consider to combine diversity & relevancy as,
• mono-objective formulation
• Not more based on p-dispersion

Beyond Diversity & Relevance
• We select a set of diverse set which,
• increase the "global" importance of a selected publication, and
• reduce the "global" importance of a non-selected publication.
• We define the problem in static version,
• MAXDIVREL k-diversity problem
• We define the problem in continuous version,
• MAXDIVREL continuous k-diversity problem

Definition: MAXDIVREL (static version)

MAXDIVREL k-diversity problem
• Can map into Top-k representative query problem in graph databases
which is NP-Hard
• Specialized version of set cover problem
• Can prove! 

MAXDIVREL k-diversity Algorithm: Greedy

MAXDIVREL Continuous k-diversity problem
• Continuity Requirements
• Durability
• an item is selected as diversified in 𝑖 𝑡ℎ window may still have the chance to be in
𝑖 + 1 𝑡ℎ window if it's not expired & other valid items in 𝑖 + 1 𝑡ℎwindow are failed to
compete with it.
• Order
• Publication stream follow the chronological order
• We avoid the selection of item j as diverse later, when we already selected an item i
which is not-older than j.

Definition: MAXDIVREL (continuous version)

MAXDIVREL continuous k-diversity problem
• Apply MAXDIVREL k-diversity Greedy algorithm in each window
• Time complexity
• When re-calculating neighborhood
• We propose an incremental MAXDIVREL algorithm
• Calculate neighborhood at window 𝑖 + 1 𝑡ℎ using already calculated neighborhood at
window 𝑖 𝑡ℎ
• Indexing publications at each window
• Combine with subscription indexing
• Dual-indexing mechanism!

To Do List: Implementation
• Indexing based on inverted-index
• Why inverted index?
• Centralized, will try Cloud Based
• Using message broker system E.g. RabbitMQ, ZeroMQ, ActiveMQ
• Why RabbitMQ?

To Do List: Evaluation
• Multiple Directions
• Zipf property
• Using synthetic & real data-set (e.g. zipf distribution tool, Ebay, AOL Query logs)
• Algorithm efficiency
• Experiment with,
• The volume of subscriptions
• The variety of publications
• The arrival rate of publications (e.g. dynamic sliding window model)
• Using POIKILO evaluation tool
• Dual-Indexing Performance & Scalability
• Experiment with,
• Index construction time at each window
• Memory cost
• Query processing time (e.g. Neighborhood calculation)

Thank You!
Your review will be Golden!
Welcome to read the design chapters!

[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Similar to [Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams

Similar to [Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams (20)

More from Sameera Horawalavithana

More from Sameera Horawalavithana (16)

Recently uploaded

Recently uploaded (20)

[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams

Editor's Notes