Efficient Filtering in Pub-Sub Systems using BDD

Efficient Filtering in Publish-
Subscribe Systems using BDD
Alexis Campailla, Sagar Chaki, Edmund Clarke,
Somesh Jha, Helmut Veith

Prepared by Nabeel Mohamed
4/16/08

1

Outline
 Research problem at hand
 Content-based Publish-Subscribe
 Subscription Query Language
 BDD Semantics
 BDD Based matching
 Experimental Results
 Discussion (Pros and Cons)

2

Research Problem at Hand
 Loosely-coupled interactions in
publish-subscribe systems allows to
build very large scale systems
 However, filtering techniques used are
a major bottleneck
 Efficiency of the filtering technique
plays a major role in scalability
 Whatever technique we use should be
provably correct
3

Major Contributions
 A Precise semantics to match
messages (events) to subscriptions
(subscription queries)
 Modeling filtering as a satisfiability
check in BDD

4

Roadmap
 BDD Semantics

5

Publish-Subscribe Systems
Distributed
Publisher Subscriber
Content Routers subscribe
Notify()

Subscribe()
publish
unsubscribe
Publisher Notify() Subscriber
Unsubscribe() Notify()
publish

notify
Publisher Distributed Subscriber
Notify()
Subscription
Mgmt and Routing

6

Publish-Subscribe Systems
 Publishers andSubscribers are
loosely coupled
◦ Space decoupled
◦ Time decoupled
◦ Synchronization decoupled
 Content routers(brokers) form a
structured p2p system

Scalable Systems
7

Message (Event) Filtering
 Filtering
◦ Matching incoming messages (events)
generated by Publishers with subscription
criteria
◦ A main task of content routers (brokers) –
filtering engine
 Content-based pub-sub systems routes
messages (events) based on the content
itself
 Example: Filter Quotes with symbol =
Google and offer price < 400 in a
Financial ticker.
8

Example Pub-Sub Systems
 Stock market feeds
◦ For delivery of financial data such as
stock quotes, trade reports, news, etc. to
customers
◦ OPRA feed disseminates more than
100,000 quotes/sec
 Sensor networks
 Network traffic analysis
 Transaction log analysis

9

Desirable Functions of a Filtering
Engine
 Correctness:
◦ Correctly matching incoming messages with
subscription criteria
 Expressiveness:
◦ Rich subscription language
 Efficiency:
◦ Real time matching
 Scalability:
◦ Handling a large number of subscriptions
 Dynamic:
◦ Capability to add and remove subscriptions
online
10

Related Work
 Most existing systems support only
conjunctive subscriptions
◦ GRYPHON
◦ SIENA
◦ Le Subscribe
 Example: The following subscription
requires 27 GRYPHON-like
subscriptions while BDD handles it
naturally.

11

Related Work
 Some systems have higher expressive
power at the expense of less efficient
filtering.
◦ ELVIN
 Can we come up with an efficient
filtering technique while providing an
expressive subscription language?
 BDD based filtering may be employed
in existing systems to improve
matching efficiency
12

Roadmap
 BDD Semantics

13

Subscription Query Language
 The language used to describe
subscription criteria or subscriptions
 Three Subscription Languages of
increasing complexity
◦ SiSL – Simple Subscription Language
◦ StSL – Strict Subscription Language
◦ DeSL – Default Subscription Language

14

Messages and Attributes
V = <v1, .., vn> = a finite sequence of
attributes
 Each attribute vi has a type

 Each attribute vi has a corresponding
domain
 Event schema =

15

Messages and Attributes
 A message = an assignment of values
to some (not necessarily all) of the
attributes
 Formally, a message is a mapping m

such that for each attribute v, either

(m does not define v) ≡
 A message is total if it defines all
attributes in V.
16

Messages and Attributes –
Example 1
 Let V = <company, product, price>
over the event schema <STR, STR,
DBL>
 Consider the following message:
<company> IBM </company>
<product>PC AT, 20 Mhz, 256 KB RAM</product>
<price>5000</price>
 This describes a
total message m1
where m1(company) = “IBM”,
m1(product) = “PC AT, 20 Mhz, 256 KB
RAM” and m1(price) = 5000.

17

Messages and Attributes –
Example 2
 Consider the following message:

<company> IBM </company>
<product>PC AT, 20 Mhz, 256 KB RAM</product>

 This describes a different message m2
which is not total (i.e. partial), since
m2(price) = *.

18

Three Subscription Languages
 SiSL – Simple Subscription Language
◦ All messages are total
 StSL – Strict Subscription Language
◦ Messages define all attributes that occur in
the query (subscription criteria)
◦ SiSL is a subset of StSL
 DeSL – Default Subscription Language
◦ All attributes are initialized to default values
(e.g. using NULL)
◦ Extends the functionality of SiSL to
heterogeneous message formats

19

Formalizing SiSL Queries
(Subscriptions)
 Atomic formulas
 Let vbe an attribute in V
If and
then the formulas v = c, v < c, c < v
are atomic formulas.
If , atomic formulas are
defined similarly.
If
then the formulas are
atomic formulas. ( ≡ substring)
20

(Subscriptions)
 Atoms = the set of atomic formulas
 A Query is a Boolean combination
of atomic formulas
 = the set of attributes occurring
in
 = the set of atomic formulas
occurring in

21

(Subscriptions)
 Abbreviations

22

Example: SiSL Query
 The followingSiSL query matches all
messages for 1000 Mhz PCs
manufactured by IBM, Dell or Siemens
which cost at most $1000.

23

(Subscriptions)
 = The instantiation of a query by
a message m.
 Definition:
is defined as the query obtained
from by replacing all variables
for which m(v) ≠ * by m(v).
 Definition:
The SiSL query matches the total
message m if evaluates to true.
24

Formalizing StSL Queries
(Subscriptions)
 StSL (Strict Subscription Language) is
generalization of SiSL.
 Definition: adequacy
A message m is adequate for a query
, if for all , it holds that m(v)
≠ *.
 Definition:
The query matches m, iff m is
adequate for and
25

Formalizing DeSL Queries
(Subscriptions)
 DeSL (Default Subscription Language)
is the most general out of the three.
 For each attribute vi, there’s a default
value
 Definition:
The default extension of m is
defined as follows.

26

Formalizing DeSL Queries
(Subscriptions)
 Definition:
The query matches the message m
under default semantics if (i.e.
evaluates to true)

27

Roadmap
 BDD Semantics

28

BDDs (Binary Decision
Diagrams)
 Notations
A = a set of propositional variables
= a linear ordering (variable
ordering) on A
= An ordered BDD over A, whose
non-terminal nodes are labeled by
variables in A, terminals by 0 or 1.
= The Boolean function
represented by node v in

29

Properties of BDDs
 Each non-terminal node v has two out-
edges: low edge and high edge
 Let a non-terminal node v with label ai
has successors at the low and high
edges u and w respectively. Then,

 ≡
 Size = # nodes in the BDD

30

Example: BDD
 The following BDD represents the
Boolean function x AND ( y OR z).
 The variable ordering is

31

Shared BDDs (SBDDs)
 While OBDDs represent one Boolean
function, SBDDs represent multiple
Boolean functions.
 SBDD is a collection of component
OBDDs respecting same variable
ordering.
 SBDD has a set of output nodes Vo =
{o1, …, on} each corresponding to
Boolean functions <f1,…, fn>
respectively.
32

SBDDs
 Every root node of component
OBDDS Vo
 Notation:
Denotes the BDD together with its
output nodes {o1, …, on}
 is polynomial time
computable from any other shared
BDD over A for <f1,…, fn>

33

Example: Shared BDD
 Node 1 represents

34

BDD Data Structure
 A BDD with n nodes is represented as a
graph whose vertices are the natural
numbers 1,…, n.
 The adjacency relationship is described
by an array of size n.
 ith element = (low[i], high[i], label[i],
value[i])
◦ low[i] = low successor of i
◦ high[i] = high successor of i
◦ label[i] = label of i
◦ value[i] = used later to store the result of the
BDD evaluation corresponding to i.

35

BDD Evaluation

 The above algorithm computes the
value of each node in under the
assignment where
 = = value of ith component
36

BDD Evaluation

 Notice that we
can compute the value
of Boolean functions associated with
each output node in one pass.

37

BDD Restrictions
 The idea is to restrict the possible
truth assignments such that
external constraint f (a Boolean fn
over A) evaluates to true under
 Definition: f-restriction

38

Roadmap
 BDD Semantics

39

Query BDDs
 Key Idea
◦ Represent many subscription queries by a
single shared BDD whose nodes
correspond to atomic sub-formulas of the
queries.
◦ Messages are matched against queries
by simply running EvalBDD on the shared
BDD.

40

Query BDDs
 , a sequence of queries
over the set of attributes V
A= , the set of atomic
sub-formulas of the queries.
 is the set of propositional variables
such that each atomic sub-formula a
in A is assigned a propositional
variable
 = Boolean query obtained by
substituting each a with 41

Example: Query BDDs
 Let & two subscriptions received

 Then, =

 Three atomic sub-formulas => Three
propositional variables

42

Example: Query BDDs
 Let the variable order be

SBDD corresponding
to the queries

43

Query Matching: SiSL
 Use EvalBDD algorithm for query
matching
 A query Qi is considered matched if
the BDD node corresponding to Qi
evaluates to 1.
 Bottom-up evaluation makes sure sub-
queries are evaluated only once.

44

Query Matching: DeSL
 Same as handling complete
messages
 When a message received, it is
extended to a total message before
performing the matching.

45

Query Matching: StSL
 Recall thata message m matches a
subscription Q iff m is adequate for Q
and m satisfies Q.
 Can use a modified EvalBDD to
perform faster matching
 Key Ideas
◦ An undefined atom renders all sub-
formulas in which it occurs undefined.
◦ Treat * as new value undefined

46

Query Matching: StSL

 MVEvalBDD for StSL is significantly
faster than EvalBDD for SiSL

47

Roadmap
 BDD Semantics

48

# Nodes in SBDD vs. #
Subscriptions

 Number of nodes scale almost linearly
◦ High scalability
 Restriction further reduces node count,
minimizing memory requirements
49

Matching time for SiSL and StSL

Time for StSL queries

 Inputs: Number of subscription queries
and message density (how total)
 Partial messages can be matched
quickly.
50

Roadmap
 BDD Semantics

51

Variable Ordering vs. BDD size
 Variable ordering has
a tremendous
influence on BDD size.

52

Pros
 Introduces a well-formed semantics to
describe the matching process in
publish-subscribe systems
 Matching as a satisfiability checking in
SBDD allows to incrementally check
multiple subscriptions
 Scalable
 StSL is more efficient than SiSL

53

Cons/Improvements
 Does not describe any heuristics to select
node ordering (NP-hard);
◦ Can we order based on the significance of the
attributes involved?
 Does not explore possibility of eliminating
redundancies due to semantically related
atomic sub-formulas (e.g.: price = 100 and
price > 80) (again NP-hard)
◦ Can we further reduce the node count exploiting
the semantics without causing side effect?
 Efficiency of matching is not compared with
existing systems

54

Conclusion
 Two major contributions
◦ A Precise semantics to match messages
to subscriptions
◦ Modeling filtering as a satisfiability check
in BDD

55

Efficient Filtering in Pub-Sub Systems using BDD

Recommended

Recommended

More Related Content

Similar to Efficient Filtering in Pub-Sub Systems using BDD

Similar to Efficient Filtering in Pub-Sub Systems using BDD (20)

More from Nabeel Yoosuf

More from Nabeel Yoosuf (12)

Recently uploaded

Recently uploaded (20)

Efficient Filtering in Pub-Sub Systems using BDD