Elastic Scaling of a High-Throughput
Content-Based Publish/Subscribe Engine
Raphaël Barazzutti, Thomas Heinze, André Martin,
Emanuel Onica, Pascal Felber, Christof Fetzer,
Zbigniew Jerzak, Marcelo Pasin and Etienne Rivière
University of Neuchâtel, Switzerland
SAP AG, Germany
TU Dresden, Germany
ICDCS 2014, Madrid, Spain
etienne.riviere@unine.ch
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Context
• Publish/Subscribe as a service
• Running on private or public clouds
• Decoupled communication
• Composition of applications from multiple domains
• Content-based filtering
• Subscriptions filter on the content of publications
• Canonical example: stock quotes filtering
2
Publisher Subscriber
Pub/Sub as a Service
Adm. domain A Adm. domain B Adm. domain C
PublisherSubscriber
sp
notify(pub)
p
s
register(sub)
pregister(pub)
s name = "IBM"
price > 120$
volume > 10,000
p name = "IBM"
price = 131$
volume = 12,312
p name = "IBM"
price = 112$
volume = 9,892
open = 109$
close = 113 $
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Pub/sub as a service: requirements
• Arbitrary representation of publications and subscriptions
• Not limited to attribute-based filtering
• Untrusted domains: encrypted filtering
• High-throughput and scalability
• Thousands subscriptions per second
• Thousands publications per second
• Thousands to millions notifications per second
• Availability, dependability and low delays
• StreamHub [DEBS 2013]
• Supports arbitrary filtering scheme, in particular
encrypted filtering (ASPE)
• Built on top of a Stream Processing Engine
3
Publisher Subscriber
P/Sa.a.S.
Trusted domain Trusted domain
Subscriber
?
p s
?
encryption encryption
?
Untrusted domain
?
p
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Problem: resource provisioning
• Pub/sub service resource usage is unpredictable and varies over time
• Example: Frankfurt stock exchange, Nov 18, 2011
• Throughput/delay requirements vs budget require appropriate provisioning
• This talk: can we make a high-throughput pub/sub service elastic?
• Scale in and scale out based on actual requirements
• Maintain service continuity and minimize reconfiguration impact
4
0
200
400
600
800
1000
1200
09:00 12:00 15:00 18:00 21:00
Tickspersecond
Time (hours)
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Outline
• Background
• StreamHub high-throughput content-based pub/sub engine
• StreamMine3G stream processing engine
• (elastic)e-StreamHub principles
• Components
• Slice migration
• Enforcing elasticity policies
• Evaluation
• Micro-benchmarks
• Trace-based using Frankfurt stock exchange workload
5
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
StreamHub: principles
• Tiered approach to pub/sub for cloud deployments
• Split pub/sub into three fundamental, consecutive operations
• Exploit massive data parallelism of each operation
• Supports arbitrary filtering mechanism
• Event flows do not depend on a filtering scheme characteristics
• This work: use computationally intensive encrypted filtering
• StreamHub engine = Stream Processing application
• Each of the 3 pub/sub operations mapped to an stream operator
• Operators supported by a Stream Processing Engine
6
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
StreamMine3G stream processing engine
• Distributed stream computation as a DAG of operators
• Each operator split in multiple slices
• Externalized state management, no state sharing between slices
• Support for unicast, anycast & broadcast primitives
7
slice 1
slice 2
slice n-1
slice n
...
Operator 1 Operator 2
Operator 3
?
Broadcast
Anycast
Unicast
slice 3
...
Operator 4
Each slice
supported
by multiple
cores
<key,value>
e
<k,v> (k%2=1)
e
Slice state
management
No shared
memory
e
e
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
StreamHub Engine [DEBS13]
8
Operator Access Point (AP) Matching (M) Exit Point (EP)
Function Subscriptions Partitioning Publications Filtering Publications Dispatching
State Stateless Stateful (persistent) Stateful (transient)
Role
➡ Decide where to store
subs
➡ Broadcast pubs to next
operator
➡ Store subs
➡ Filter incoming pubs,
create list of matching
subscribers ids
➡ Aggregate lists of
matching subscribers ids
➡ Prepare & dispatch
notifications
Subscriber
SUB
x > 3 &
y == 5
encrypts
SUB
C5F80
BA363
Publisher
PUB
x = 7
y = 10
encrypts
PUB
88F3B
2A09C
DCCP/ConnectionPoint
stores
sends
notifications of
matching encrypted publications
AP:1
AP operator M operator
p
s
Broadcast
(pubs)
Unicast
(subs)
AP:2
AP:3
AP:4
Unicast Multicast
AP:5
AP:6
Anycast
?
EP operator
M:1
M:2
M:3
M:4
M:5
M:6
EP:1
EP:2
EP:3
EP:4
EP:5
EP:6
DCCP/ConnectionPoint
e-StreamHub engine
ASPE
Public Cloud deployment
(same
DCCP)
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
e(elastic)-StreamHub principle
• Fixed number of slices for each operator
• Elastic scaling decisions based on experienced load
• Scale out: allocate new host(s), migrate slices
• Scale in: migrate slices, deallocate host(s)
• Redistribute slices upon load unbalance between hosts
9
EP:1
EP:2
EP:3
EP:4
EP:5
EP:6
AP:1
AP:2
AP:3
AP:4
AP:5
AP:6
M:1
M:2
M:3
M:4
M:5
M:6
Host 1
Host 1 Host 2 Host 3
EP:1
EP:3
EP:6
AP:1 M:1
M:2
EP:2
AP:5
AP:6
M:3
M:4
EP:4
EP:5
AP:3
AP:4
M:5
M:6
AP:2
scale
out
Host 1 Host 2
EP:1
EP:3
EP:6
AP:1 M:1
M:2
EP:2
AP:5
AP:6
M:3
M:4
AP:2
EP:4
EP:5AP:3
AP:4 M:5 M:6
scale
in
initial single host
placement
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
e-StreamHub: components
• StreamMine3G: support for slice migration (including state)
• Goal: minimal interruption of flow
• Manager: orchestrates migration and collect workload probes
• SM3G and manager state stored in ZooKeeper for dependability
• Elasticity enforcer: takes migration decisions based on observed load
• Based on elasticity policies
10
ZooKeeper
SM3G
Host 1
AP:1
M:1
EP:1
SM3G
Host 2
AP:2
M:2
AP:3 SM3G
Host 3
M:3
EP:2
EP:3
M:3
elasticity
policies
Elasticity
enforcer
Manager
aggregated
probes
migration
requests
collects probes,
orchestrates
migrations
configures
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Slice migration
1. Input events duplicated to a buffer on dest host
2. Slice halted; state is copied from src to dest host
• Incoming events are still buffered
• States associated with vectors of sequence numbers, one for each of previous
operator slices
• Allows knowing which events were accounted in
3. Missed events replayed on state, slice resumed
➡ “Downtime” of the slice is essentially equivalent to state copying time
11
OP:1
SM3G runtime
Host 1
OP:1
SM3G runtime
Host 1
(inactive)
SM3G runtime
Host 2
buffer
(inactive)
SM3G runtime
Host 1
(inactive)
SM3G runtime
Host 2
buffer
(inactive)
SM3G runtime
Host 1
SM3G runtime
Host 2
OP:1
SM3G runtime
Host 2
OP:1
events state
t t
t t+n
buffer
n
1 2 3 4 5
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Elasticity policy
• CPU utilization probes collected periodically
• Secondary metrics: network usage and memory usage
• Global rules: criteria on the average load
• Trigger scale-in and scale-out operations
• Define a target average CPU utilization
• Local rules: criteria on the load of a single host
• Grace period between migrations (30s)
• Minimize cost of migrations (sum of migrated states)
12
Rules CPU Effect
Global
average > 70% scale out
Global
average < 30% scale in
Local < 20% or > 80% redistribute slices
Target average CPU = 50%average CPU = 50%
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Elasticity policy enforcement
• Three-steps resolution when global or local rule is violated
1. Decide on the set of slices to migrate
• For each host, identify a set of slices with:
sum(CPU utilization) ≥ abs(current average utilization - target utilization)
• Subset sum problem: dynamic programming, pseudo-polynomial complexity
• Returns multiple solutions: select the one with less state transfer involved
2. Scaling out: add enough hosts for avg(load) to be ≤ target (50%)
Scaling in: mark enough least-loaded hosts for avg(load) to be ≥ target (50%)
3. Decide on new placement
• First-fit bin packing algorithm
• Start with current state without selected slices, greedily assign by decreasing weight
13
Host 1 Host 2
12%AP:1
11%AP:2
25%M:1
25%M:2
26%M:3
23%M:4
13%EP:1
12%EP:2
Host 1 Host 2
25%M:1
25%M:2
26%M:3
23%M:4
Host 3
12%AP:1
11%AP:2
13%EP:1
12%EP:2
before scale out after scale out
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Evaluation
14
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Experimental setup
• Prototype in C, C++ and Java
• 22 hosts, each with 8 cores and 8 GB
• 1 Gbps switched network
• 4 hosts with G and S: generators and sinks
• 3 hosts for infrastructure
• 1 to 15 hosts used for e-StreamHub
• 8 to 120 cores
• Number of slices is fixed
• micro-benchmarks: 4 AP, 8 M, 4 EP
• trace-based evaluation: 8 AP, 16 M, 8 EP
• Encrypted pubs and subs (ASPE)
• 100,000 encrypted subs
• Each sub is ~1.2 KB
• 1% matching ratio: 1,000 notifications / pub
15
EP:1
EP:2
EP:3
EP:4
EP:5
EP:6
AP:1
AP:2
AP:3
AP:4
AP:5
AP:6
M:1
M:2
M:3
M:4
M:5
M:6
Host 1/15
M:7
M:8
M:9
M:10
M:11
M:12
AP:7
AP:8
EP:7
EP:8
M:13
M:14
M:15
M:16
Host E
ZooKeeper
Host F
Manager
Host G
Elasticity
enforcer
Host D
G:4 S:4
Host C
G:3 S:3
Host B
G:2 S:2
Host A
G:1 S:1
(unused)
Host 2/15
(unused)
Host 15/15
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Static e-StreamHub: Performance
16
Delays(ms)
Number of hosts and attribution to operators (AP|M|EP)
Distribution of delays under half of max. throughput
Max 75
th
50
th
25
th Min
0
100
200
300
400
500
2:0.5|1|0.5
4:
1|2|1
6:1.5|3|1.5
8:
2|4|2
10:2.5|5|2.5
12:
3|6|3
Publications/s
Maximum throughput
0
100
200
300
400
500
Per second:
42.2 million matchings
422,000 notifications
Median delay
12 hosts:
3 hosts for AP slices
6 hosts for M slices
3 hosts for EP slices
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Slice migration: times
• Individual migration times, averaged over 25 migrations
• Under constant flow of 100 publications / second
• Worst-case scenario with “large” slices states
• 4 AP, 8 M and 4 EP with 100K and 400K subscriptions
• Migration time minimal for AP and EP, dominated by state size for M
17
AP
M
(12.500 subscriptions,
total of 100,000)
M
(50.000 subscriptions,
total of 400,000)
EP
average 232 ms 1.497 s 2.533 s 275 ms
standard
deviation
31 ms 354 ms 1.557 s 52 ms
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Impact of migrations on delay
18
0.0
0.5
1.0
1.5
2.0
2.5
100 120 140 160 180 200
delay(seconds)
time (seconds)
migrations: AP (1) AP (2) M (1) M (2) EP
min average max
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Steady increase/decrease of load
19
• Increase/decrease of
publication rate
• 0 to 350 publications
per second
• 0 to 35 millions
matching operations
per second
• CPU load
• average, max and min
over 30 seconds
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Steady increase/decrease of load
19
• Increase/decrease of
publication rate
• 0 to 350 publications
per second
• 0 to 35 millions
matching operations
per second
• CPU load
• average, max and min
over 30 seconds
publication rate (publications / second)
0
50
100
150
200
250
300
350
400
hosts
0
2
4
6
8
10
12
14
16
18
Time (seconds)
host CPU load: min avg max
0
20
40
60
80
100
0 300 600 900 1200 1500 1800
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Steady increase/decrease of load
19
• Increase/decrease of
publication rate
• 0 to 350 publications
per second
• 0 to 35 millions
matching operations
per second
• CPU load
• average, max and min
over 30 seconds
publication rate (publications / second)
0
50
100
150
200
250
300
350
400
hosts
0
2
4
6
8
10
12
14
16
18
Time (seconds)
average delays (seconds)
0
0.5
1
1.5
2
2.5
3
3.5
4
0 300 600 900 1200 1500 1800
• Notification delays
• average over 30
seconds
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Frankfurt stock exchange
20
publication rate (publications / second)
0
50
100
150
200
hosts
0
2
4
6
8
10
Time (seconds)
host CPU load: min avg max
0
20
40
60
80
100
0 500 1000 1500 2000 2500
• Replay of the Frankfurt
stock exchange
workload
• Sped up 10 times: about
40 minutes shown
• 100,000 encrypted
subscriptions
• CPU load
• average, max and min
over 30 seconds
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Frankfurt stock exchange
20
publication rate (publications / second)
0
50
100
150
200
hosts
0
2
4
6
8
10
Time (seconds)
average delays (seconds)
0
0.5
1
1.5
2
0 500 1000 1500 2000 2500
• Replay of the Frankfurt
stock exchange
workload
• Sped up 10 times: about
40 minutes shown
• 100,000 encrypted
subscriptions
• CPU load
• average, max and min
over 30 seconds
• Notification delays
• average over 30 seconds
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Conclusion
• Elastic scaling of a high-throughput content-based pub/sub engine
• Implemented at the level of the support stream processing engine
• Applicability to other long-lived stream processing applications
• Support of live operator slice migration
• Redistribution of slices according to monitored workload
• Allows deploying pub/sub as a service on a public cloud without
the need to provision for worst-case scenarios
• Evaluation using Frankfurt stock exchange traces
• Perspectives
• Monitor network flows, minimize inter-host communications in
migration plans
• Leverage active replication, used for dependability, to minimize impact
on delay while migrating (this requires deterministic execution)
21
Sunday 6 July 14
Elastic Scaling of a High-Throughput
Content-Based Publish/Subscribe Engine
Questions?
This work was supported by the EU FP7 program (SRT-15 project)
http://www.srt-15.eu
ICDCS 2014, Madrid, Spain
etienne.riviere@unine.ch
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Additional slides
23
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Independence from Filtering Scheme
• Attribute-based filtering widely studied
• Represent content using discrete attributes
• Subscriptions = conjunctions of discrete predicates on attributes values
• Broker overlays typically rely on containment & aggregation capabilities of
attribute-based filtering
• Alternative filtering schemes
• Encrypted filtering
• ASPE (Choi et al., DEXA10)
• Prefiltering (DEBS 2012)
➡ No guaranteed support for
containment or aggregation
• The architecture of a pub/sub engine
should be independent from the
filtering scheme(s) it supports
24
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Independence from Filtering Scheme
• Attribute-based filtering widely studied
• Represent content using discrete attributes
• Subscriptions = conjunctions of discrete predicates on attributes values
• Broker overlays typically rely on containment & aggregation capabilities of
attribute-based filtering
• Alternative filtering schemes
• Encrypted filtering
• ASPE (Choi et al., DEXA10)
• Prefiltering (DEBS 2012)
➡ No guaranteed support for
containment or aggregation
• The architecture of a pub/sub engine
should be independent from the
filtering scheme(s) it supports
24
Publisher Subscriber
P/Sa.a.S.
Trusted domain Trusted domain
Subscriber
p
s
Trusted (?) domain
p
s
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Independence from Filtering Scheme
• Attribute-based filtering widely studied
• Represent content using discrete attributes
• Subscriptions = conjunctions of discrete predicates on attributes values
• Broker overlays typically rely on containment & aggregation capabilities of
attribute-based filtering
• Alternative filtering schemes
• Encrypted filtering
• ASPE (Choi et al., DEXA10)
• Prefiltering (DEBS 2012)
➡ No guaranteed support for
containment or aggregation
• The architecture of a pub/sub engine
should be independent from the
filtering scheme(s) it supports
24
Publisher Subscriber
P/Sa.a.S.
Trusted domain Trusted domain
Subscriber
p
s
Trusted (?) domain
p
s
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Independence from Filtering Scheme
• Attribute-based filtering widely studied
• Represent content using discrete attributes
• Subscriptions = conjunctions of discrete predicates on attributes values
• Broker overlays typically rely on containment & aggregation capabilities of
attribute-based filtering
• Alternative filtering schemes
• Encrypted filtering
• ASPE (Choi et al., DEXA10)
• Prefiltering (DEBS 2012)
➡ No guaranteed support for
containment or aggregation
• The architecture of a pub/sub engine
should be independent from the
filtering scheme(s) it supports
24
Publisher Subscriber
P/Sa.a.S.
Trusted domain Trusted domain
Subscriber
?
p s
?
encryption encryption
?
Untrusted domain
?
p
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Events Paths and Support Libraries
25
libcluster
Subscriber
s
TCP
DCCP AP
Anycast
M
Determine key to
matching operator
using clustering
Store subscription
in operator
slice state
libfilter
Optional
lib Library
State
O Operator slice
StreamHub client
C Static component
M
Unicast
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Events Paths and Support Libraries
25
libcluster
Publisher
Subscriber
s
TCP
DCCP AP
Anycast
M
p
TCP
DCCP AP
Anycast
M
Determine key to
matching operator
using clustering
Store subscription
in operator
slice state
libfilter
Filter publication,
return matching
subscribers list
Broadcast
M
EP
EP
Unicast
DCCP
DCCP
Multicast
DCCP
Subscriber
p
TCP
p
p
...
Optional
p,{s}
Publication +
subscribers
lib Library
State
O Operator slice
StreamHub client
C Static component
M
Unicast
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
StreamHub interface
• Data Conversion and Connection Point(s) (DCCP)
• Stateless component(s) running on server(s) with direct WAN connectivity
• Maintain persistent TCP connections to/from publishers and subscribers
• Convert between external/internal format
26
Publisher
Subscriber
StreamHub engine
p
s
Data converter &
connection point
p
DCCP
p
sPersistent TCP
connections
GPB
format
Internal
format
p
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Use an Overlay of Brokers?
• Brokers organized in an overlay (mesh, tree) ;
each broker performs all pub/sub operations
• Storing subscriptions; matching, forwarding, notifying of publications
• Complex maintenance of routing tables between brokers
• Assumptions on the filtering scheme for inter-broker communication
and publication forwarding: containment & aggregation
• Scalability/throughput depend on workload, notification delays may vary
27
Publisher Subscriber
s
s
s
s
Subscriber Subscriber Subscriber
s s
Publisher Publisher
Broker
Broker
Broker
s
RT
updates
RT
updates
Sunday 6 July 14
ICDCS 2014 - Etienne Rivière
Use an Overlay of Brokers?
• Brokers organized in an overlay (mesh, tree) ;
each broker performs all pub/sub operations
• Storing subscriptions; matching, forwarding, notifying of publications
• Complex maintenance of routing tables between brokers
• Assumptions on the filtering scheme for inter-broker communication
and publication forwarding: containment & aggregation
• Scalability/throughput depend on workload, notification delays may vary
27
Publisher Subscriber
p
s
s
s
Subscriber Subscriber SubscriberPublisher Publisher
p
p
Broker
Broker
Broker
p p
Sunday 6 July 14

Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe Engine

  • 1.
    Elastic Scaling ofa High-Throughput Content-Based Publish/Subscribe Engine Raphaël Barazzutti, Thomas Heinze, André Martin, Emanuel Onica, Pascal Felber, Christof Fetzer, Zbigniew Jerzak, Marcelo Pasin and Etienne Rivière University of Neuchâtel, Switzerland SAP AG, Germany TU Dresden, Germany ICDCS 2014, Madrid, Spain etienne.riviere@unine.ch Sunday 6 July 14
  • 2.
    ICDCS 2014 -Etienne Rivière Context • Publish/Subscribe as a service • Running on private or public clouds • Decoupled communication • Composition of applications from multiple domains • Content-based filtering • Subscriptions filter on the content of publications • Canonical example: stock quotes filtering 2 Publisher Subscriber Pub/Sub as a Service Adm. domain A Adm. domain B Adm. domain C PublisherSubscriber sp notify(pub) p s register(sub) pregister(pub) s name = "IBM" price > 120$ volume > 10,000 p name = "IBM" price = 131$ volume = 12,312 p name = "IBM" price = 112$ volume = 9,892 open = 109$ close = 113 $ Sunday 6 July 14
  • 3.
    ICDCS 2014 -Etienne Rivière Pub/sub as a service: requirements • Arbitrary representation of publications and subscriptions • Not limited to attribute-based filtering • Untrusted domains: encrypted filtering • High-throughput and scalability • Thousands subscriptions per second • Thousands publications per second • Thousands to millions notifications per second • Availability, dependability and low delays • StreamHub [DEBS 2013] • Supports arbitrary filtering scheme, in particular encrypted filtering (ASPE) • Built on top of a Stream Processing Engine 3 Publisher Subscriber P/Sa.a.S. Trusted domain Trusted domain Subscriber ? p s ? encryption encryption ? Untrusted domain ? p Sunday 6 July 14
  • 4.
    ICDCS 2014 -Etienne Rivière Problem: resource provisioning • Pub/sub service resource usage is unpredictable and varies over time • Example: Frankfurt stock exchange, Nov 18, 2011 • Throughput/delay requirements vs budget require appropriate provisioning • This talk: can we make a high-throughput pub/sub service elastic? • Scale in and scale out based on actual requirements • Maintain service continuity and minimize reconfiguration impact 4 0 200 400 600 800 1000 1200 09:00 12:00 15:00 18:00 21:00 Tickspersecond Time (hours) Sunday 6 July 14
  • 5.
    ICDCS 2014 -Etienne Rivière Outline • Background • StreamHub high-throughput content-based pub/sub engine • StreamMine3G stream processing engine • (elastic)e-StreamHub principles • Components • Slice migration • Enforcing elasticity policies • Evaluation • Micro-benchmarks • Trace-based using Frankfurt stock exchange workload 5 Sunday 6 July 14
  • 6.
    ICDCS 2014 -Etienne Rivière StreamHub: principles • Tiered approach to pub/sub for cloud deployments • Split pub/sub into three fundamental, consecutive operations • Exploit massive data parallelism of each operation • Supports arbitrary filtering mechanism • Event flows do not depend on a filtering scheme characteristics • This work: use computationally intensive encrypted filtering • StreamHub engine = Stream Processing application • Each of the 3 pub/sub operations mapped to an stream operator • Operators supported by a Stream Processing Engine 6 Sunday 6 July 14
  • 7.
    ICDCS 2014 -Etienne Rivière StreamMine3G stream processing engine • Distributed stream computation as a DAG of operators • Each operator split in multiple slices • Externalized state management, no state sharing between slices • Support for unicast, anycast & broadcast primitives 7 slice 1 slice 2 slice n-1 slice n ... Operator 1 Operator 2 Operator 3 ? Broadcast Anycast Unicast slice 3 ... Operator 4 Each slice supported by multiple cores <key,value> e <k,v> (k%2=1) e Slice state management No shared memory e e Sunday 6 July 14
  • 8.
    ICDCS 2014 -Etienne Rivière StreamHub Engine [DEBS13] 8 Operator Access Point (AP) Matching (M) Exit Point (EP) Function Subscriptions Partitioning Publications Filtering Publications Dispatching State Stateless Stateful (persistent) Stateful (transient) Role ➡ Decide where to store subs ➡ Broadcast pubs to next operator ➡ Store subs ➡ Filter incoming pubs, create list of matching subscribers ids ➡ Aggregate lists of matching subscribers ids ➡ Prepare & dispatch notifications Subscriber SUB x > 3 & y == 5 encrypts SUB C5F80 BA363 Publisher PUB x = 7 y = 10 encrypts PUB 88F3B 2A09C DCCP/ConnectionPoint stores sends notifications of matching encrypted publications AP:1 AP operator M operator p s Broadcast (pubs) Unicast (subs) AP:2 AP:3 AP:4 Unicast Multicast AP:5 AP:6 Anycast ? EP operator M:1 M:2 M:3 M:4 M:5 M:6 EP:1 EP:2 EP:3 EP:4 EP:5 EP:6 DCCP/ConnectionPoint e-StreamHub engine ASPE Public Cloud deployment (same DCCP) Sunday 6 July 14
  • 9.
    ICDCS 2014 -Etienne Rivière e(elastic)-StreamHub principle • Fixed number of slices for each operator • Elastic scaling decisions based on experienced load • Scale out: allocate new host(s), migrate slices • Scale in: migrate slices, deallocate host(s) • Redistribute slices upon load unbalance between hosts 9 EP:1 EP:2 EP:3 EP:4 EP:5 EP:6 AP:1 AP:2 AP:3 AP:4 AP:5 AP:6 M:1 M:2 M:3 M:4 M:5 M:6 Host 1 Host 1 Host 2 Host 3 EP:1 EP:3 EP:6 AP:1 M:1 M:2 EP:2 AP:5 AP:6 M:3 M:4 EP:4 EP:5 AP:3 AP:4 M:5 M:6 AP:2 scale out Host 1 Host 2 EP:1 EP:3 EP:6 AP:1 M:1 M:2 EP:2 AP:5 AP:6 M:3 M:4 AP:2 EP:4 EP:5AP:3 AP:4 M:5 M:6 scale in initial single host placement Sunday 6 July 14
  • 10.
    ICDCS 2014 -Etienne Rivière e-StreamHub: components • StreamMine3G: support for slice migration (including state) • Goal: minimal interruption of flow • Manager: orchestrates migration and collect workload probes • SM3G and manager state stored in ZooKeeper for dependability • Elasticity enforcer: takes migration decisions based on observed load • Based on elasticity policies 10 ZooKeeper SM3G Host 1 AP:1 M:1 EP:1 SM3G Host 2 AP:2 M:2 AP:3 SM3G Host 3 M:3 EP:2 EP:3 M:3 elasticity policies Elasticity enforcer Manager aggregated probes migration requests collects probes, orchestrates migrations configures Sunday 6 July 14
  • 11.
    ICDCS 2014 -Etienne Rivière Slice migration 1. Input events duplicated to a buffer on dest host 2. Slice halted; state is copied from src to dest host • Incoming events are still buffered • States associated with vectors of sequence numbers, one for each of previous operator slices • Allows knowing which events were accounted in 3. Missed events replayed on state, slice resumed ➡ “Downtime” of the slice is essentially equivalent to state copying time 11 OP:1 SM3G runtime Host 1 OP:1 SM3G runtime Host 1 (inactive) SM3G runtime Host 2 buffer (inactive) SM3G runtime Host 1 (inactive) SM3G runtime Host 2 buffer (inactive) SM3G runtime Host 1 SM3G runtime Host 2 OP:1 SM3G runtime Host 2 OP:1 events state t t t t+n buffer n 1 2 3 4 5 Sunday 6 July 14
  • 12.
    ICDCS 2014 -Etienne Rivière Elasticity policy • CPU utilization probes collected periodically • Secondary metrics: network usage and memory usage • Global rules: criteria on the average load • Trigger scale-in and scale-out operations • Define a target average CPU utilization • Local rules: criteria on the load of a single host • Grace period between migrations (30s) • Minimize cost of migrations (sum of migrated states) 12 Rules CPU Effect Global average > 70% scale out Global average < 30% scale in Local < 20% or > 80% redistribute slices Target average CPU = 50%average CPU = 50% Sunday 6 July 14
  • 13.
    ICDCS 2014 -Etienne Rivière Elasticity policy enforcement • Three-steps resolution when global or local rule is violated 1. Decide on the set of slices to migrate • For each host, identify a set of slices with: sum(CPU utilization) ≥ abs(current average utilization - target utilization) • Subset sum problem: dynamic programming, pseudo-polynomial complexity • Returns multiple solutions: select the one with less state transfer involved 2. Scaling out: add enough hosts for avg(load) to be ≤ target (50%) Scaling in: mark enough least-loaded hosts for avg(load) to be ≥ target (50%) 3. Decide on new placement • First-fit bin packing algorithm • Start with current state without selected slices, greedily assign by decreasing weight 13 Host 1 Host 2 12%AP:1 11%AP:2 25%M:1 25%M:2 26%M:3 23%M:4 13%EP:1 12%EP:2 Host 1 Host 2 25%M:1 25%M:2 26%M:3 23%M:4 Host 3 12%AP:1 11%AP:2 13%EP:1 12%EP:2 before scale out after scale out Sunday 6 July 14
  • 14.
    ICDCS 2014 -Etienne Rivière Evaluation 14 Sunday 6 July 14
  • 15.
    ICDCS 2014 -Etienne Rivière Experimental setup • Prototype in C, C++ and Java • 22 hosts, each with 8 cores and 8 GB • 1 Gbps switched network • 4 hosts with G and S: generators and sinks • 3 hosts for infrastructure • 1 to 15 hosts used for e-StreamHub • 8 to 120 cores • Number of slices is fixed • micro-benchmarks: 4 AP, 8 M, 4 EP • trace-based evaluation: 8 AP, 16 M, 8 EP • Encrypted pubs and subs (ASPE) • 100,000 encrypted subs • Each sub is ~1.2 KB • 1% matching ratio: 1,000 notifications / pub 15 EP:1 EP:2 EP:3 EP:4 EP:5 EP:6 AP:1 AP:2 AP:3 AP:4 AP:5 AP:6 M:1 M:2 M:3 M:4 M:5 M:6 Host 1/15 M:7 M:8 M:9 M:10 M:11 M:12 AP:7 AP:8 EP:7 EP:8 M:13 M:14 M:15 M:16 Host E ZooKeeper Host F Manager Host G Elasticity enforcer Host D G:4 S:4 Host C G:3 S:3 Host B G:2 S:2 Host A G:1 S:1 (unused) Host 2/15 (unused) Host 15/15 Sunday 6 July 14
  • 16.
    ICDCS 2014 -Etienne Rivière Static e-StreamHub: Performance 16 Delays(ms) Number of hosts and attribution to operators (AP|M|EP) Distribution of delays under half of max. throughput Max 75 th 50 th 25 th Min 0 100 200 300 400 500 2:0.5|1|0.5 4: 1|2|1 6:1.5|3|1.5 8: 2|4|2 10:2.5|5|2.5 12: 3|6|3 Publications/s Maximum throughput 0 100 200 300 400 500 Per second: 42.2 million matchings 422,000 notifications Median delay 12 hosts: 3 hosts for AP slices 6 hosts for M slices 3 hosts for EP slices Sunday 6 July 14
  • 17.
    ICDCS 2014 -Etienne Rivière Slice migration: times • Individual migration times, averaged over 25 migrations • Under constant flow of 100 publications / second • Worst-case scenario with “large” slices states • 4 AP, 8 M and 4 EP with 100K and 400K subscriptions • Migration time minimal for AP and EP, dominated by state size for M 17 AP M (12.500 subscriptions, total of 100,000) M (50.000 subscriptions, total of 400,000) EP average 232 ms 1.497 s 2.533 s 275 ms standard deviation 31 ms 354 ms 1.557 s 52 ms Sunday 6 July 14
  • 18.
    ICDCS 2014 -Etienne Rivière Impact of migrations on delay 18 0.0 0.5 1.0 1.5 2.0 2.5 100 120 140 160 180 200 delay(seconds) time (seconds) migrations: AP (1) AP (2) M (1) M (2) EP min average max Sunday 6 July 14
  • 19.
    ICDCS 2014 -Etienne Rivière Steady increase/decrease of load 19 • Increase/decrease of publication rate • 0 to 350 publications per second • 0 to 35 millions matching operations per second • CPU load • average, max and min over 30 seconds Sunday 6 July 14
  • 20.
    ICDCS 2014 -Etienne Rivière Steady increase/decrease of load 19 • Increase/decrease of publication rate • 0 to 350 publications per second • 0 to 35 millions matching operations per second • CPU load • average, max and min over 30 seconds publication rate (publications / second) 0 50 100 150 200 250 300 350 400 hosts 0 2 4 6 8 10 12 14 16 18 Time (seconds) host CPU load: min avg max 0 20 40 60 80 100 0 300 600 900 1200 1500 1800 Sunday 6 July 14
  • 21.
    ICDCS 2014 -Etienne Rivière Steady increase/decrease of load 19 • Increase/decrease of publication rate • 0 to 350 publications per second • 0 to 35 millions matching operations per second • CPU load • average, max and min over 30 seconds publication rate (publications / second) 0 50 100 150 200 250 300 350 400 hosts 0 2 4 6 8 10 12 14 16 18 Time (seconds) average delays (seconds) 0 0.5 1 1.5 2 2.5 3 3.5 4 0 300 600 900 1200 1500 1800 • Notification delays • average over 30 seconds Sunday 6 July 14
  • 22.
    ICDCS 2014 -Etienne Rivière Frankfurt stock exchange 20 publication rate (publications / second) 0 50 100 150 200 hosts 0 2 4 6 8 10 Time (seconds) host CPU load: min avg max 0 20 40 60 80 100 0 500 1000 1500 2000 2500 • Replay of the Frankfurt stock exchange workload • Sped up 10 times: about 40 minutes shown • 100,000 encrypted subscriptions • CPU load • average, max and min over 30 seconds Sunday 6 July 14
  • 23.
    ICDCS 2014 -Etienne Rivière Frankfurt stock exchange 20 publication rate (publications / second) 0 50 100 150 200 hosts 0 2 4 6 8 10 Time (seconds) average delays (seconds) 0 0.5 1 1.5 2 0 500 1000 1500 2000 2500 • Replay of the Frankfurt stock exchange workload • Sped up 10 times: about 40 minutes shown • 100,000 encrypted subscriptions • CPU load • average, max and min over 30 seconds • Notification delays • average over 30 seconds Sunday 6 July 14
  • 24.
    ICDCS 2014 -Etienne Rivière Conclusion • Elastic scaling of a high-throughput content-based pub/sub engine • Implemented at the level of the support stream processing engine • Applicability to other long-lived stream processing applications • Support of live operator slice migration • Redistribution of slices according to monitored workload • Allows deploying pub/sub as a service on a public cloud without the need to provision for worst-case scenarios • Evaluation using Frankfurt stock exchange traces • Perspectives • Monitor network flows, minimize inter-host communications in migration plans • Leverage active replication, used for dependability, to minimize impact on delay while migrating (this requires deterministic execution) 21 Sunday 6 July 14
  • 25.
    Elastic Scaling ofa High-Throughput Content-Based Publish/Subscribe Engine Questions? This work was supported by the EU FP7 program (SRT-15 project) http://www.srt-15.eu ICDCS 2014, Madrid, Spain etienne.riviere@unine.ch Sunday 6 July 14
  • 26.
    ICDCS 2014 -Etienne Rivière Additional slides 23 Sunday 6 July 14
  • 27.
    ICDCS 2014 -Etienne Rivière Independence from Filtering Scheme • Attribute-based filtering widely studied • Represent content using discrete attributes • Subscriptions = conjunctions of discrete predicates on attributes values • Broker overlays typically rely on containment & aggregation capabilities of attribute-based filtering • Alternative filtering schemes • Encrypted filtering • ASPE (Choi et al., DEXA10) • Prefiltering (DEBS 2012) ➡ No guaranteed support for containment or aggregation • The architecture of a pub/sub engine should be independent from the filtering scheme(s) it supports 24 Sunday 6 July 14
  • 28.
    ICDCS 2014 -Etienne Rivière Independence from Filtering Scheme • Attribute-based filtering widely studied • Represent content using discrete attributes • Subscriptions = conjunctions of discrete predicates on attributes values • Broker overlays typically rely on containment & aggregation capabilities of attribute-based filtering • Alternative filtering schemes • Encrypted filtering • ASPE (Choi et al., DEXA10) • Prefiltering (DEBS 2012) ➡ No guaranteed support for containment or aggregation • The architecture of a pub/sub engine should be independent from the filtering scheme(s) it supports 24 Publisher Subscriber P/Sa.a.S. Trusted domain Trusted domain Subscriber p s Trusted (?) domain p s Sunday 6 July 14
  • 29.
    ICDCS 2014 -Etienne Rivière Independence from Filtering Scheme • Attribute-based filtering widely studied • Represent content using discrete attributes • Subscriptions = conjunctions of discrete predicates on attributes values • Broker overlays typically rely on containment & aggregation capabilities of attribute-based filtering • Alternative filtering schemes • Encrypted filtering • ASPE (Choi et al., DEXA10) • Prefiltering (DEBS 2012) ➡ No guaranteed support for containment or aggregation • The architecture of a pub/sub engine should be independent from the filtering scheme(s) it supports 24 Publisher Subscriber P/Sa.a.S. Trusted domain Trusted domain Subscriber p s Trusted (?) domain p s Sunday 6 July 14
  • 30.
    ICDCS 2014 -Etienne Rivière Independence from Filtering Scheme • Attribute-based filtering widely studied • Represent content using discrete attributes • Subscriptions = conjunctions of discrete predicates on attributes values • Broker overlays typically rely on containment & aggregation capabilities of attribute-based filtering • Alternative filtering schemes • Encrypted filtering • ASPE (Choi et al., DEXA10) • Prefiltering (DEBS 2012) ➡ No guaranteed support for containment or aggregation • The architecture of a pub/sub engine should be independent from the filtering scheme(s) it supports 24 Publisher Subscriber P/Sa.a.S. Trusted domain Trusted domain Subscriber ? p s ? encryption encryption ? Untrusted domain ? p Sunday 6 July 14
  • 31.
    ICDCS 2014 -Etienne Rivière Events Paths and Support Libraries 25 libcluster Subscriber s TCP DCCP AP Anycast M Determine key to matching operator using clustering Store subscription in operator slice state libfilter Optional lib Library State O Operator slice StreamHub client C Static component M Unicast Sunday 6 July 14
  • 32.
    ICDCS 2014 -Etienne Rivière Events Paths and Support Libraries 25 libcluster Publisher Subscriber s TCP DCCP AP Anycast M p TCP DCCP AP Anycast M Determine key to matching operator using clustering Store subscription in operator slice state libfilter Filter publication, return matching subscribers list Broadcast M EP EP Unicast DCCP DCCP Multicast DCCP Subscriber p TCP p p ... Optional p,{s} Publication + subscribers lib Library State O Operator slice StreamHub client C Static component M Unicast Sunday 6 July 14
  • 33.
    ICDCS 2014 -Etienne Rivière StreamHub interface • Data Conversion and Connection Point(s) (DCCP) • Stateless component(s) running on server(s) with direct WAN connectivity • Maintain persistent TCP connections to/from publishers and subscribers • Convert between external/internal format 26 Publisher Subscriber StreamHub engine p s Data converter & connection point p DCCP p sPersistent TCP connections GPB format Internal format p Sunday 6 July 14
  • 34.
    ICDCS 2014 -Etienne Rivière Use an Overlay of Brokers? • Brokers organized in an overlay (mesh, tree) ; each broker performs all pub/sub operations • Storing subscriptions; matching, forwarding, notifying of publications • Complex maintenance of routing tables between brokers • Assumptions on the filtering scheme for inter-broker communication and publication forwarding: containment & aggregation • Scalability/throughput depend on workload, notification delays may vary 27 Publisher Subscriber s s s s Subscriber Subscriber Subscriber s s Publisher Publisher Broker Broker Broker s RT updates RT updates Sunday 6 July 14
  • 35.
    ICDCS 2014 -Etienne Rivière Use an Overlay of Brokers? • Brokers organized in an overlay (mesh, tree) ; each broker performs all pub/sub operations • Storing subscriptions; matching, forwarding, notifying of publications • Complex maintenance of routing tables between brokers • Assumptions on the filtering scheme for inter-broker communication and publication forwarding: containment & aggregation • Scalability/throughput depend on workload, notification delays may vary 27 Publisher Subscriber p s s s Subscriber Subscriber SubscriberPublisher Publisher p p Broker Broker Broker p p Sunday 6 July 14