Reactive Stream Processing
With Mantis
Neeraj Joshi - Senior Software Engineer, Edge Systems
Nick Mahilani - Senior Software Engineer, Edge Systems
October 4, Reactive Summit 2016
Monitoring a complex
distributed service @ scale
is hard
81+ Million Subscribers
Across the Globe
Streaming thousands of Titles
Over Millions of Devices
Powered by 100s of Micro-services
Combinatorial Explosion of Data !!!
Complexity and Comprehension
So, in order to manage complex environments,
need to rethink insights, shift the curve
An Insight system that can...
Auto-detect anomalies in high
volume, high cardinality data
An Insight system that can...
Auto-detect anomalies in high
volume, high cardinality data
Identify titles that have an abnormal failure
rate and highlight their common
characteristics
(only on certain devices using certain
CDNs etc)
An Insight system that can...
Aggregate rich data On-demand
An Insight system that can...
Aggregate rich data On-demand
Calculate latency percentiles for PS4 in UK
using firmware v1.0 and ui 1.2.1
On Demand
An Insight system that can...
Find your needle-in-the-haystack
in real-time
An Insight system that can...
Find your needle-in-the-haystack
in real-time
Find me all requests for customer X with
latency > 1 seconds
And at the same time
be cost effective!
Edge servers alone
generate
10 Tb/s of operational
data!!
How can we contain
the cost?
Reduce Data
Two Strategies
Optimize Resource Usage
What if?
We only stream what is needed & when it is needed?
Do we really need all
the data all the time?
Anomaly
Detection
Use-case
Look for abnormal trends in aggregate signal
Anomaly
Detection
Use-case
Look for abnormal trends in aggregate signal
Deeper analysis on filtered
events
Anomaly
Detection
Use-case
Look for abnormal trends in aggregate signal
Deeper analysis on filtered
events
Aggregate
data /
filtered data
⇒
Subset of
data
Dynamic Dashboards
use-case
Subset of data
Only subset of
fields required
Dynamic Dashboards
use-case
Aggregate data
Only
On-demand
Dynamic Dashboards
use-case
Ad-hoc Realtime Search
use-case
Looking for a
tiny subset of
data
What If ?
We only stream what is needed & when it is needed?
Reuse the data already streamed?
Does every consumer
really need different
data?
EdgeServers
Can we reuse Data?
Device
Events Q
EdgeServers
Device Health
Dashboard
Realtime Data
Aggregator Job
All Device Events
Can we reuse Data?
Device
Events Q
EdgeServers
Device Health
Dashboard
Realtime Data
Aggregator Job
All Device Events
Can we reuse Data?
Device
Events Q
Ad-hoc
Query
Search for
device1 events
Job
All Device Events
Device !=
“device1”
EdgeServers
Device Health
Dashboard
Realtime Data
Aggregator Job
All Device Events
Anomaly
Detection Job
Alerts
All Device Events
Can we reuse Data?
Device
Events Q
Ad-hoc
Query
Search for
device1 events
Job
All Device Events
Device !=
“device1”
EdgeServers
Device Health
Dashboard
Realtime Data
Aggregator Job
All Device Events
Anomaly
Detection Job
Alerts
All Device Events
3x fan out
Can we reuse Data?
Device
Events Q
Ad-hoc
Query
Search for
device1 events
Job
All Device Events
Device !=
“device1”
EdgeServers
Device Health
Dashboard
Realtime Data
Aggregator Job
All Device Events
Anomaly
Detection Job
Alerts
All Device Events
3x fan out
Can we reuse Data?
Device
Events Q
Ad-hoc
Query
Search for
device1 events
Job
All Device Events
Device !=
“device1”
Queryable
Events Job
EdgeServers
Device Health
Dashboard
Realtime Data
Aggregator Job
All Device Events
Anomaly
Detection Job
Alerts
All Device Events
3x fan out
Can we reuse Data?
Device
Events Q
Ad-hoc
Query
Search for
device1 events
Job
All Device Events
Device !=
“device1”
Queryable
Events Job
(Select status Where true)
Only get
“projected”
events
EdgeServers
Device Health
Dashboard
Realtime Data
Aggregator Job
All Device Events
Anomaly
Detection Job
Alerts
All Device Events
3x fan out
Can we reuse Data?
Device
Events Q
Ad-hoc
Query
Search for
device1 events
Job
All Device Events
Device !=
“device1”
Queryable
Events Job
(Select status Where true)
Only get
“projected”
events
EdgeServers
Device Health
Dashboard
Realtime Data
Aggregator Job
All Device Events
Anomaly
Detection Job
Alerts
All Device Events
3x fan out
Can we reuse Data?
Device
Events Q
Ad-hoc
Query
Search for
device1 events
Job
Queryable
Events Job
EdgeServers
Device Health
Dashboard
Realtime Data
Aggregator Job
All Device Events
Anomaly
Detection Job
Alerts
All Device Events
3x fan out
Can we reuse Data?
Device
Events Q
Ad-hoc
Query
Search for
device1 events
Job
Queryable
Events Job
(select * where
device
==
“device1”)
EdgeServers
Device Health
Dashboard
Realtime Data
Aggregator Job
All Device Events
Anomaly
Detection Job
Alerts
All Device Events
3x fan out
Can we reuse Data?
Device
Events Q
Ad-hoc
Query
Search for
device1 events
Job
Queryable
Events Job
Only get
“filtered”
events
EdgeServers
Device Health
Dashboard
Realtime Data
Aggregator Job
All Device Events
Anomaly
Detection Job
Alerts
All Device Events
2x fan out
Can we reuse Data?
Device
Events Q
Ad-hoc
Query
Search for
device1 events
Job
Queryable
Events Job
Only get
“filtered”
events
What If ?
Only stream what is needed & when it is needed?
Reuse the data already streamed?
Reuse the results?
EdgeServers
Device Health
Dashboard
Realtime Data
Aggregator Job
All Device Events
2x fan out
Can we reuse Results?
Device
Events Q
Ad-hoc
Query
Search for
device1 events
Job
Anomaly Detection
Job
Alerts
Queryable
Events Job
All Device Events
EdgeServers
Device Health
Dashboard
Realtime Data
Aggregator Job
All Device Events
2x fan out
Can we reuse Results?
Device
Events Q
Ad-hoc
Query
Search for
device1 events
Job
Anomaly Detection
Job
Alerts
Queryable
Events Job
EdgeServers
Device Health
Dashboard
Realtime Data
Aggregator Job
All Device Events
2x fan out
Can we reuse Results?
Device
Events Q
Ad-hoc
Query
Search for
device1 events
Job
Anomaly Detection
Job
Alerts
Queryable
Events Job
Reuse results
EdgeServers
Device Health
Dashboard
Realtime Data
Aggregator Job
All Device Events
1x fan out
Can we reuse Results?
Device
Events Q
Ad-hoc
Query
Search for
device1 events
Job
Anomaly Detection
Job
Alerts
Queryable
Events Job
Reuse results
EdgeServers
Device Health
Dashboard
Realtime Data
Aggregator Job
All Device Events
1x fan out
Streaming Micro-services
Device
Events Q
Ad-hoc
Query
Search for
device1 events
Job
Anomaly Detection
Job
Alerts
Queryable
Events Job
Smells like
Micro-Services!
What If ?
Only stream what is needed & when it is needed?
Reuse the data already streamed?
Auto-scale resources?
Reuse the results?
Do we really need
peak resources all the
time?
Number of active jobs are unpredictable
ActiveJobs
Increased activity during incidents
Data volume varies by time of day
We see 5 times more data at peak
Job Resources scale with data
Data volume Resources used
Mantis
● Small but extremely fast
shrimp
● A Reactive stream
processing system
Mantis
Only stream what is needed & when it is needed
Reuse the data & results?
Auto-scale resources?
Mantis
Only stream what is needed & when it is needed
Reuse the data & results?
Auto-scale resources?
Query based On-demand streaming of data
Mantis
Only stream what is needed & when it is needed
Reuse the data & results?
Auto-scale resources?
Query based On-demand streaming of data
Built-in Job Discovery and Job Chaining
Mantis
Only stream what is needed & when it is needed
Reuse the data & results?
Auto-scale resources?
Query based On-demand streaming of data
Built-in Job Discovery and Job Chaining
Job and Cluster Auto-scaling
+ Much More
● High throughput, low latency stream processing system focused on
Operational Insights
● Configurable data guarantees
● Long running & Transient jobs
● Flexible Functional programming with RxJava
Mantis deep-dive
● Query based On-demand Streaming of data
● Job Discovery and Job Chaining
● Auto-scaling Jobs and Clusters
● End-to-end Reactive Stream Semantics
Mantis Architecture
Mesos
Framework
Fenzo
Scheduler
Mantis Master
Mantis Agents
EC2 instance
EC2 instance
EC2 instance
Mantis Job
code runs in
Containers
Mantis Architecture
Mesos
Framework
Fenzo
Scheduler
Mantis Master
Mantis Agents
EC2 instance
EC2 instance
EC2 instance
Mantis API
Mantis Job
code runs in
Containers
Mantis UI
Mantis Job
● Source
○ Observable< Observable<T> >
● 1…N Stages
○ Observable<T> → Observable<R>
● Sink
○ Observable<R>
Mantis
● Query based On-demand Streaming of data
● Job Discovery and Job Chaining
● Auto-scaling Jobs and Clusters
● End-to-end Reactive Stream Semantics
Query Based On Demand Streaming
● Stream data only when needed and only what is needed
● Filter data at the source
● Cleanup after use
Data Source
QueryRequested
Data
Mantis Job
Mantis Query Language (MQL)
SELECT xid, errorCode
WHERE device-type == SONY_PS3
SAMPLE {"strategy": "RANDOM", "threshold": 200}
Projection
Filtering
Sampling
Query processing on Data producing app
API
MRE Mantis Real-time Events library
Query processing on Data producing app
API
MRE
QoE Analysis
Mantis Job
Mantis Real-time Events library
SELECT xid
WHERE type = rebuffer
Query processing on Data producing app
API
MRE
QoE Analysis
Mantis Job
SELECT xid
WHERE type = rebuffer
Mantis Real-time Events library
{ “xid”: 1234},
{ “xid”: 4567}
Query processing on Data producing app
API
MRE
QoE Analysis
Mantis Job
SELECT xid
WHERE type = rebuffer
Mantis Real-time Events library
{ “xid”: 1234},
{ “xid”: 4567}
Device Analysis
Mantis Job
SELECT * WHERE
device = XBox
{ “device”: “XBox”,
“IP”: 1.1.1.1,
“xid”:1111 }
Mantis
● Query based On-demand Streaming of data
● Job Discovery and Job Chaining
● Auto-scaling Jobs and Clusters
● End-to-end Reactive Stream Semantics
Job Discovery & Chaining
Aggregator Job
Worker 1
Aggregator Job
Worker 2
Anomaly Job
Worker 1
Anomaly Job
Worker 2
Mantis Master
Job Discovery & Chaining
Aggregator Job
Worker 1
Aggregator Job
Worker 2
Anomaly Job
Worker 1
Anomaly Job
Worker 2
Mantis Master
Subscribe to
Aggregator Job
scheduling info
stream
Job Discovery & Chaining
Aggregator Job
Worker 1
Aggregator Job
Worker 2
Anomaly Job
Worker 1
Anomaly Job
Worker 2
Aggregator Job scheduling info
{ worker1 : { host : 1.1.1.1,
port : 8888 },
… }
Mantis Master
Job Discovery & Chaining
Aggregator Job
Worker 1
Aggregator Job
Worker 2
Anomaly Job
Worker 1
Anomaly Job
Worker 2
Connect with
Mantis Query
Mantis Master
Job Discovery & Chaining
Aggregator Job
Worker 1
Aggregator Job
Worker 2
Anomaly Job
Worker 1
Anomaly Job
Worker 2
Filtered data Mantis Master
Aggregator Job
Worker 2
Mantis Master
Fault tolerance: Worker failure
Aggregator Job
Worker 1
Anomaly Job
Worker 1
Anomaly Job
Worker 2
Filtered data
Aggregator Job
Worker 2
Mantis Master
Fault tolerance: Worker failure
Aggregator Job
Worker 1
Anomaly Job
Worker 1
Anomaly Job
Worker 2
Filtered data
Aggregator Job
Worker 2
Fault tolerance: Worker failure
Aggregator Job
Worker 1
Anomaly Job
Worker 1
Anomaly Job
Worker 2
Aggregator Job
Worker 3
Mantis Master
Aggregator Job
Worker 2
Fault tolerance: Worker failure
Aggregator Job
Worker 1
Anomaly Job
Worker 1
Anomaly Job
Worker 2
Mantis Master
Aggregator Job
scheduling info
Aggregator Job
Worker 3
Mantis Master
Aggregator Job
Worker 2
Fault tolerance: Worker failure
Aggregator Job
Worker 1
Anomaly Job
Worker 1
Anomaly Job
Worker 2
Aggregator Job
Worker 3
Filtered data Mantis Master
Mantis Master
Aggregator v1
Worker 2
In Service Job updates
Aggregator v1
Worker 1
Anomaly Job
Worker 1
Anomaly Job
Worker 2
Filtered data
Aggregator v2
Worker 2
Aggregator v2
Worker 1
Aggregator v1
Worker 2
In Service Job updates
Aggregator v1
Worker 1
Anomaly Job
Worker 1
Anomaly Job
Worker 2
Mantis Master
Filtered data
Aggregator v2
Worker 2
Aggregator v2
Worker 1
Aggregator v2
scheduling info Mantis Master
Aggregator v1
Worker 2
In Service Job updates
Aggregator v1
Worker 1
Anomaly Job
Worker 1
Anomaly Job
Worker 2
Connect with Mantis Query
Aggregator v2
Worker 2
Aggregator v2
Worker 1
Mantis Master
Aggregator v1
Worker 2
In Service Job updates
Aggregator v1
Worker 1
Anomaly Job
Worker 1
Anomaly Job
Worker 2
Filtered data
Aggregator v2
Worker 2
Aggregator v2
Worker 1
Mantis Master
Job Chaining Example
Kafka partition - multiple consumers
0 N
Kafka
TopicPartition
Consumer 1
(device == XBox) Consumer 2
(type == rebuffer)
Consumer 3
(xid = 0afcedbxe)
Reuse Kafka Data Streams
0 N
Mantis Kafka
Consumer Job
Kafka
TopicPartition
Reuse Kafka Data Streams
0 N
SELECT *
WHERE device = XBox
SELECT *
WHERE xid = 0afcedbxe
SELECT *
WHERE type = re-buffer
Mantis Kafka
Consumer Job
Device Analysis Job QoE analysis Job
Adhoc Transaction
Analysis Job
Kafka
TopicPartition
Mantis
● Query based On-demand Streaming of data
● Job Discovery and Job Chaining
● Auto-scaling Jobs and Clusters
● End-to-end Reactive Stream Semantics
Mantis Agent Cluster Autoscaling
EC2 InstanceEC2 Instance
Job 1
Workers
Mantis Agent Cluster Autoscaling
EC2 InstanceEC2 Instance
Job 2
Workers
Mantis Agent Cluster Autoscaling
EC2 Instance EC2 InstanceEC2 Instance
Job 2
Workers
Mantis Agent Cluster Autoscaling
EC2 Instance EC2 InstanceEC2 Instance
Job 2
scale up
Mantis Agent Cluster Autoscaling
EC2 Instance EC2 InstanceEC2 Instance
Job 2
scale up
EC2 Instance
Bin Packing
● Simple round robin scheduling causes fragmentation
● Smarter bin-packing of jobs frees up mantis agents for scale down
Host 1 Host 2 Host 3 Host 4
v/s
Host 1 Host 2 Host 3 Host 4
Mantis
● Query based On-demand Streaming of data
● Job Discovery and Job Chaining
● Auto-scaling Jobs and Clusters
● End-to-end Reactive Stream Semantics
Back Pressure
End-to-end Reactive Streams
● RxJava operators compose backpressure within a single worker
● Reactive Socket for backpressure across network boundaries
● Application layer protocol for async non-blocking backpressure across
network boundaries
● Rich interaction modes
● Pluggable transport protocol
Reactive Socket
Node A Node B
Request N
Data
End-to-end Reactive Streams
filter()
map()
groupBy
Stage 1
data
data
demand
demand
Mantis Job
End-to-end Reactive Streams
filter()
map()
groupBy
Stage 1
data
data
demand
demand
Mantis Job
window()
reduce()
flatmap()
Stage 2
data
data
demand
demand
Reactive
Socket
End-to-end Reactive Streams
Stage 1
Reactive
Socket
Mantis Job 1 Mantis Job 2
Reactive Stream Processing
Reactive Stream Processing
Message Driven
Reactive Stream Processing
Message Driven
Elastic
Reactive Stream Processing
Message Driven
Elastic Resilient
Reactive Stream Processing
Message Driven
Elastic Resilient
Responsive
Mantis Today
● ~650 Jobs in production
● ~8 Million events / sec processed
● 80 Gb/s processed (instead of 10 Tb/s due to filtering) i.e. 99% less data !!
● The processed data gets reused by other jobs further reducing costs
● Auto-scaling jobs use up to 75% fewer resources compared to peak
References
● Mantis Blogpost
http://techblog.netflix.com/2016/03/stream-processing-with-mantis.html
● Resource Scheduling on Mesos
https://www.youtube.com/watch?v=uyGEgWAG9EQ
● Fenzo https://github.com/Netflix/Fenzo
● RxJava https://github.com/ReactiveX/RxJava
● Reactive Socket http://reactivesocket.io/
● RxNetty https://github.com/ReactiveX/RxNetty
Questions?
Reactive Stream Processing
with Mantis
Neeraj Joshi Nick Mahilani
@neerajrj @nick_mahilani

Reactive Stream Processing with Mantis