Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reactive Stream Processing with Mantis

1,072 views

Published on

Talk at Reactive Summit 2016 on on-demand streaming, job discovery & chaining and auto-scaling aspects of the Mantis Reactive Stream Processing platform at Netflix.

Published in: Software
  • Be the first to comment

Reactive Stream Processing with Mantis

  1. 1. Reactive Stream Processing With Mantis Neeraj Joshi - Senior Software Engineer, Edge Systems Nick Mahilani - Senior Software Engineer, Edge Systems October 4, Reactive Summit 2016
  2. 2. Monitoring a complex distributed service @ scale is hard
  3. 3. 81+ Million Subscribers Across the Globe
  4. 4. Streaming thousands of Titles
  5. 5. Over Millions of Devices
  6. 6. Powered by 100s of Micro-services
  7. 7. Combinatorial Explosion of Data !!!
  8. 8. Complexity and Comprehension
  9. 9. So, in order to manage complex environments, need to rethink insights, shift the curve
  10. 10. An Insight system that can... Auto-detect anomalies in high volume, high cardinality data
  11. 11. An Insight system that can... Auto-detect anomalies in high volume, high cardinality data Identify titles that have an abnormal failure rate and highlight their common characteristics (only on certain devices using certain CDNs etc)
  12. 12. An Insight system that can... Aggregate rich data On-demand
  13. 13. An Insight system that can... Aggregate rich data On-demand Calculate latency percentiles for PS4 in UK using firmware v1.0 and ui 1.2.1 On Demand
  14. 14. An Insight system that can... Find your needle-in-the-haystack in real-time
  15. 15. An Insight system that can... Find your needle-in-the-haystack in real-time Find me all requests for customer X with latency > 1 seconds
  16. 16. And at the same time be cost effective!
  17. 17. Edge servers alone generate 10 Tb/s of operational data!!
  18. 18. How can we contain the cost?
  19. 19. Reduce Data Two Strategies Optimize Resource Usage
  20. 20. What if? We only stream what is needed & when it is needed?
  21. 21. Do we really need all the data all the time?
  22. 22. Anomaly Detection Use-case Look for abnormal trends in aggregate signal
  23. 23. Anomaly Detection Use-case Look for abnormal trends in aggregate signal Deeper analysis on filtered events
  24. 24. Anomaly Detection Use-case Look for abnormal trends in aggregate signal Deeper analysis on filtered events Aggregate data / filtered data ⇒ Subset of data
  25. 25. Dynamic Dashboards use-case Subset of data
  26. 26. Only subset of fields required Dynamic Dashboards use-case
  27. 27. Aggregate data Only On-demand Dynamic Dashboards use-case
  28. 28. Ad-hoc Realtime Search use-case Looking for a tiny subset of data
  29. 29. What If ? We only stream what is needed & when it is needed? Reuse the data already streamed?
  30. 30. Does every consumer really need different data?
  31. 31. EdgeServers Can we reuse Data? Device Events Q
  32. 32. EdgeServers Device Health Dashboard Realtime Data Aggregator Job All Device Events Can we reuse Data? Device Events Q
  33. 33. EdgeServers Device Health Dashboard Realtime Data Aggregator Job All Device Events Can we reuse Data? Device Events Q Ad-hoc Query Search for device1 events Job All Device Events Device != “device1”
  34. 34. EdgeServers Device Health Dashboard Realtime Data Aggregator Job All Device Events Anomaly Detection Job Alerts All Device Events Can we reuse Data? Device Events Q Ad-hoc Query Search for device1 events Job All Device Events Device != “device1”
  35. 35. EdgeServers Device Health Dashboard Realtime Data Aggregator Job All Device Events Anomaly Detection Job Alerts All Device Events 3x fan out Can we reuse Data? Device Events Q Ad-hoc Query Search for device1 events Job All Device Events Device != “device1”
  36. 36. EdgeServers Device Health Dashboard Realtime Data Aggregator Job All Device Events Anomaly Detection Job Alerts All Device Events 3x fan out Can we reuse Data? Device Events Q Ad-hoc Query Search for device1 events Job All Device Events Device != “device1” Queryable Events Job
  37. 37. EdgeServers Device Health Dashboard Realtime Data Aggregator Job All Device Events Anomaly Detection Job Alerts All Device Events 3x fan out Can we reuse Data? Device Events Q Ad-hoc Query Search for device1 events Job All Device Events Device != “device1” Queryable Events Job (Select status Where true) Only get “projected” events
  38. 38. EdgeServers Device Health Dashboard Realtime Data Aggregator Job All Device Events Anomaly Detection Job Alerts All Device Events 3x fan out Can we reuse Data? Device Events Q Ad-hoc Query Search for device1 events Job All Device Events Device != “device1” Queryable Events Job (Select status Where true) Only get “projected” events
  39. 39. EdgeServers Device Health Dashboard Realtime Data Aggregator Job All Device Events Anomaly Detection Job Alerts All Device Events 3x fan out Can we reuse Data? Device Events Q Ad-hoc Query Search for device1 events Job Queryable Events Job
  40. 40. EdgeServers Device Health Dashboard Realtime Data Aggregator Job All Device Events Anomaly Detection Job Alerts All Device Events 3x fan out Can we reuse Data? Device Events Q Ad-hoc Query Search for device1 events Job Queryable Events Job (select * where device == “device1”)
  41. 41. EdgeServers Device Health Dashboard Realtime Data Aggregator Job All Device Events Anomaly Detection Job Alerts All Device Events 3x fan out Can we reuse Data? Device Events Q Ad-hoc Query Search for device1 events Job Queryable Events Job Only get “filtered” events
  42. 42. EdgeServers Device Health Dashboard Realtime Data Aggregator Job All Device Events Anomaly Detection Job Alerts All Device Events 2x fan out Can we reuse Data? Device Events Q Ad-hoc Query Search for device1 events Job Queryable Events Job Only get “filtered” events
  43. 43. What If ? Only stream what is needed & when it is needed? Reuse the data already streamed? Reuse the results?
  44. 44. EdgeServers Device Health Dashboard Realtime Data Aggregator Job All Device Events 2x fan out Can we reuse Results? Device Events Q Ad-hoc Query Search for device1 events Job Anomaly Detection Job Alerts Queryable Events Job All Device Events
  45. 45. EdgeServers Device Health Dashboard Realtime Data Aggregator Job All Device Events 2x fan out Can we reuse Results? Device Events Q Ad-hoc Query Search for device1 events Job Anomaly Detection Job Alerts Queryable Events Job
  46. 46. EdgeServers Device Health Dashboard Realtime Data Aggregator Job All Device Events 2x fan out Can we reuse Results? Device Events Q Ad-hoc Query Search for device1 events Job Anomaly Detection Job Alerts Queryable Events Job Reuse results
  47. 47. EdgeServers Device Health Dashboard Realtime Data Aggregator Job All Device Events 1x fan out Can we reuse Results? Device Events Q Ad-hoc Query Search for device1 events Job Anomaly Detection Job Alerts Queryable Events Job Reuse results
  48. 48. EdgeServers Device Health Dashboard Realtime Data Aggregator Job All Device Events 1x fan out Streaming Micro-services Device Events Q Ad-hoc Query Search for device1 events Job Anomaly Detection Job Alerts Queryable Events Job Smells like Micro-Services!
  49. 49. What If ? Only stream what is needed & when it is needed? Reuse the data already streamed? Auto-scale resources? Reuse the results?
  50. 50. Do we really need peak resources all the time?
  51. 51. Number of active jobs are unpredictable ActiveJobs Increased activity during incidents
  52. 52. Data volume varies by time of day We see 5 times more data at peak
  53. 53. Job Resources scale with data Data volume Resources used
  54. 54. Mantis ● Small but extremely fast shrimp ● A Reactive stream processing system
  55. 55. Mantis Only stream what is needed & when it is needed Reuse the data & results? Auto-scale resources?
  56. 56. Mantis Only stream what is needed & when it is needed Reuse the data & results? Auto-scale resources? Query based On-demand streaming of data
  57. 57. Mantis Only stream what is needed & when it is needed Reuse the data & results? Auto-scale resources? Query based On-demand streaming of data Built-in Job Discovery and Job Chaining
  58. 58. Mantis Only stream what is needed & when it is needed Reuse the data & results? Auto-scale resources? Query based On-demand streaming of data Built-in Job Discovery and Job Chaining Job and Cluster Auto-scaling
  59. 59. + Much More ● High throughput, low latency stream processing system focused on Operational Insights ● Configurable data guarantees ● Long running & Transient jobs ● Flexible Functional programming with RxJava
  60. 60. Mantis deep-dive ● Query based On-demand Streaming of data ● Job Discovery and Job Chaining ● Auto-scaling Jobs and Clusters ● End-to-end Reactive Stream Semantics
  61. 61. Mantis Architecture Mesos Framework Fenzo Scheduler Mantis Master Mantis Agents EC2 instance EC2 instance EC2 instance Mantis Job code runs in Containers
  62. 62. Mantis Architecture Mesos Framework Fenzo Scheduler Mantis Master Mantis Agents EC2 instance EC2 instance EC2 instance Mantis API Mantis Job code runs in Containers Mantis UI
  63. 63. Mantis Job ● Source ○ Observable< Observable<T> > ● 1…N Stages ○ Observable<T> → Observable<R> ● Sink ○ Observable<R>
  64. 64. Mantis ● Query based On-demand Streaming of data ● Job Discovery and Job Chaining ● Auto-scaling Jobs and Clusters ● End-to-end Reactive Stream Semantics
  65. 65. Query Based On Demand Streaming ● Stream data only when needed and only what is needed ● Filter data at the source ● Cleanup after use Data Source QueryRequested Data Mantis Job
  66. 66. Mantis Query Language (MQL) SELECT xid, errorCode WHERE device-type == SONY_PS3 SAMPLE {"strategy": "RANDOM", "threshold": 200} Projection Filtering Sampling
  67. 67. Query processing on Data producing app API MRE Mantis Real-time Events library
  68. 68. Query processing on Data producing app API MRE QoE Analysis Mantis Job Mantis Real-time Events library SELECT xid WHERE type = rebuffer
  69. 69. Query processing on Data producing app API MRE QoE Analysis Mantis Job SELECT xid WHERE type = rebuffer Mantis Real-time Events library { “xid”: 1234}, { “xid”: 4567}
  70. 70. Query processing on Data producing app API MRE QoE Analysis Mantis Job SELECT xid WHERE type = rebuffer Mantis Real-time Events library { “xid”: 1234}, { “xid”: 4567} Device Analysis Mantis Job SELECT * WHERE device = XBox { “device”: “XBox”, “IP”: 1.1.1.1, “xid”:1111 }
  71. 71. Mantis ● Query based On-demand Streaming of data ● Job Discovery and Job Chaining ● Auto-scaling Jobs and Clusters ● End-to-end Reactive Stream Semantics
  72. 72. Job Discovery & Chaining Aggregator Job Worker 1 Aggregator Job Worker 2 Anomaly Job Worker 1 Anomaly Job Worker 2 Mantis Master
  73. 73. Job Discovery & Chaining Aggregator Job Worker 1 Aggregator Job Worker 2 Anomaly Job Worker 1 Anomaly Job Worker 2 Mantis Master Subscribe to Aggregator Job scheduling info stream
  74. 74. Job Discovery & Chaining Aggregator Job Worker 1 Aggregator Job Worker 2 Anomaly Job Worker 1 Anomaly Job Worker 2 Aggregator Job scheduling info { worker1 : { host : 1.1.1.1, port : 8888 }, … } Mantis Master
  75. 75. Job Discovery & Chaining Aggregator Job Worker 1 Aggregator Job Worker 2 Anomaly Job Worker 1 Anomaly Job Worker 2 Connect with Mantis Query Mantis Master
  76. 76. Job Discovery & Chaining Aggregator Job Worker 1 Aggregator Job Worker 2 Anomaly Job Worker 1 Anomaly Job Worker 2 Filtered data Mantis Master
  77. 77. Aggregator Job Worker 2 Mantis Master Fault tolerance: Worker failure Aggregator Job Worker 1 Anomaly Job Worker 1 Anomaly Job Worker 2 Filtered data
  78. 78. Aggregator Job Worker 2 Mantis Master Fault tolerance: Worker failure Aggregator Job Worker 1 Anomaly Job Worker 1 Anomaly Job Worker 2 Filtered data
  79. 79. Aggregator Job Worker 2 Fault tolerance: Worker failure Aggregator Job Worker 1 Anomaly Job Worker 1 Anomaly Job Worker 2 Aggregator Job Worker 3 Mantis Master
  80. 80. Aggregator Job Worker 2 Fault tolerance: Worker failure Aggregator Job Worker 1 Anomaly Job Worker 1 Anomaly Job Worker 2 Mantis Master Aggregator Job scheduling info Aggregator Job Worker 3 Mantis Master
  81. 81. Aggregator Job Worker 2 Fault tolerance: Worker failure Aggregator Job Worker 1 Anomaly Job Worker 1 Anomaly Job Worker 2 Aggregator Job Worker 3 Filtered data Mantis Master
  82. 82. Mantis Master Aggregator v1 Worker 2 In Service Job updates Aggregator v1 Worker 1 Anomaly Job Worker 1 Anomaly Job Worker 2 Filtered data Aggregator v2 Worker 2 Aggregator v2 Worker 1
  83. 83. Aggregator v1 Worker 2 In Service Job updates Aggregator v1 Worker 1 Anomaly Job Worker 1 Anomaly Job Worker 2 Mantis Master Filtered data Aggregator v2 Worker 2 Aggregator v2 Worker 1 Aggregator v2 scheduling info Mantis Master
  84. 84. Aggregator v1 Worker 2 In Service Job updates Aggregator v1 Worker 1 Anomaly Job Worker 1 Anomaly Job Worker 2 Connect with Mantis Query Aggregator v2 Worker 2 Aggregator v2 Worker 1 Mantis Master
  85. 85. Aggregator v1 Worker 2 In Service Job updates Aggregator v1 Worker 1 Anomaly Job Worker 1 Anomaly Job Worker 2 Filtered data Aggregator v2 Worker 2 Aggregator v2 Worker 1 Mantis Master
  86. 86. Job Chaining Example
  87. 87. Kafka partition - multiple consumers 0 N Kafka TopicPartition Consumer 1 (device == XBox) Consumer 2 (type == rebuffer) Consumer 3 (xid = 0afcedbxe)
  88. 88. Reuse Kafka Data Streams 0 N Mantis Kafka Consumer Job Kafka TopicPartition
  89. 89. Reuse Kafka Data Streams 0 N SELECT * WHERE device = XBox SELECT * WHERE xid = 0afcedbxe SELECT * WHERE type = re-buffer Mantis Kafka Consumer Job Device Analysis Job QoE analysis Job Adhoc Transaction Analysis Job Kafka TopicPartition
  90. 90. Mantis ● Query based On-demand Streaming of data ● Job Discovery and Job Chaining ● Auto-scaling Jobs and Clusters ● End-to-end Reactive Stream Semantics
  91. 91. Mantis Agent Cluster Autoscaling EC2 InstanceEC2 Instance Job 1 Workers
  92. 92. Mantis Agent Cluster Autoscaling EC2 InstanceEC2 Instance Job 2 Workers
  93. 93. Mantis Agent Cluster Autoscaling EC2 Instance EC2 InstanceEC2 Instance Job 2 Workers
  94. 94. Mantis Agent Cluster Autoscaling EC2 Instance EC2 InstanceEC2 Instance Job 2 scale up
  95. 95. Mantis Agent Cluster Autoscaling EC2 Instance EC2 InstanceEC2 Instance Job 2 scale up EC2 Instance
  96. 96. Bin Packing ● Simple round robin scheduling causes fragmentation ● Smarter bin-packing of jobs frees up mantis agents for scale down Host 1 Host 2 Host 3 Host 4 v/s Host 1 Host 2 Host 3 Host 4
  97. 97. Mantis ● Query based On-demand Streaming of data ● Job Discovery and Job Chaining ● Auto-scaling Jobs and Clusters ● End-to-end Reactive Stream Semantics
  98. 98. Back Pressure
  99. 99. End-to-end Reactive Streams ● RxJava operators compose backpressure within a single worker ● Reactive Socket for backpressure across network boundaries
  100. 100. ● Application layer protocol for async non-blocking backpressure across network boundaries ● Rich interaction modes ● Pluggable transport protocol Reactive Socket Node A Node B Request N Data
  101. 101. End-to-end Reactive Streams filter() map() groupBy Stage 1 data data demand demand Mantis Job
  102. 102. End-to-end Reactive Streams filter() map() groupBy Stage 1 data data demand demand Mantis Job window() reduce() flatmap() Stage 2 data data demand demand Reactive Socket
  103. 103. End-to-end Reactive Streams Stage 1 Reactive Socket Mantis Job 1 Mantis Job 2
  104. 104. Reactive Stream Processing
  105. 105. Reactive Stream Processing Message Driven
  106. 106. Reactive Stream Processing Message Driven Elastic
  107. 107. Reactive Stream Processing Message Driven Elastic Resilient
  108. 108. Reactive Stream Processing Message Driven Elastic Resilient Responsive
  109. 109. Mantis Today ● ~650 Jobs in production ● ~8 Million events / sec processed ● 80 Gb/s processed (instead of 10 Tb/s due to filtering) i.e. 99% less data !! ● The processed data gets reused by other jobs further reducing costs ● Auto-scaling jobs use up to 75% fewer resources compared to peak
  110. 110. References ● Mantis Blogpost http://techblog.netflix.com/2016/03/stream-processing-with-mantis.html ● Resource Scheduling on Mesos https://www.youtube.com/watch?v=uyGEgWAG9EQ ● Fenzo https://github.com/Netflix/Fenzo ● RxJava https://github.com/ReactiveX/RxJava ● Reactive Socket http://reactivesocket.io/ ● RxNetty https://github.com/ReactiveX/RxNetty
  111. 111. Questions? Reactive Stream Processing with Mantis Neeraj Joshi Nick Mahilani @neerajrj @nick_mahilani

×