Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Bay Area Spark Meetup
05/19/2015
@
Spark Streaming
Resiliency
Prasanna Padmanabhan & Bharat Venkat
Personalization Infrastructure
● Deployment Setup
● Background
Agenda
● Use cases for Real Time Stream Processing
● Creating Chaos
● Motivations for Spar...
Agenda
● Background
● Use cases for Real Time Stream Processing
● Motivations for Spark
● Creating Chaos
● Spark Streaming...
Netflix is a logging company
that also happens to stream videos
Scale at Netflix
● 400 Billion events per day
● 8 Million events/sec during peak
● Numerous types of events (UI
Events, Pl...
What do we do with it?
● Event logs are captured into Hadoop (EMR)
● Run ETL jobs using Hive/Presto to
○ Provide input to ...
Agenda
● Background
● Use cases for Real Time Stream Processing
● Motivations for Spark
● Creating Chaos
● Spark Streaming...
Use Cases for Stream Processing
Recommendations based on collective real time signals
Use Cases for Stream Processing
Faster identification of Data Anomalies and Regressions
Bad iPhone push
Agenda
● Background
● Use cases for Real Time Stream Processing
● Motivations for Spark
● Creating Chaos
● Spark Streaming...
Motivations for Spark
● Popular compute engine for
batch processing
● Widely used for Offline
Experimentations at Netflix
...
Motivations for Spark
Single platform to build batch and real-time applications
S3
Micro Services
Spark
Spark Streaming
Re...
Agenda
● Background
● Use cases for Real Time Stream Processing
● Motivations for Spark
● Creating Chaos
● Spark Streaming...
Challenges in Cloud
● Ephemeral Resources
● Cannot rely on local state
● No fixed IP
Chaos Monkey Approach
● Simulate failures by randomly
killing components
● Failures inevitably happen when
least desired
●...
Can Spark Streaming survive
Chaos Monkey?
Agenda
● Background
● Use cases for Real Time Stream Processing
● Motivations for Spark
● Creating Chaos
● Spark Streaming...
Spark Components
Spark Driver
Cluster Manager
(Mesos, YARN,
Standalone)
Task Task
Worker Node
Executor
Task Task
Worker No...
Spark Driver
Spark Driver
Cluster Manager
(Mesos, YARN,
Standalone)
Task Task
Worker Node
Executor
Task Task
Worker Node
E...
Cluster Manager
Spark Driver
Cluster Manager
(Mesos, YARN,
Standalone)
Task Task
Worker Node
Executor
Task Task
Worker Nod...
Spark Worker
Spark Driver
Cluster Manager
(Mesos, YARN,
Standalone)
Task Task
Worker Node
Executor
Task Task
Worker Node
E...
How does streaming work?
● Data Streams are processed in batches
● Each batch processed in Spark
● Results are pushed out ...
Agenda
● Background
● Use cases for Real Time Stream Processing
● Motivations for Spark
● Creating Chaos
● Spark Streaming...
Application Details
● Process subset of UI Events from Kafka
● Compute aggregate metrics
● Publish metrics to Atlas
● Spar...
Standalone Cluster Manager
● Provide resource management and resiliency
● All in one package
○ Built-in, easy to deploy
○ ...
Deployment
Agenda
● Background
● Use cases for Real Time Stream Processing
● Motivations for Spark
● Creating Chaos
● Spark Streaming...
Stream Resiliency
● Streaming application
continues to run
● Partial data loss during
failure is acceptable
Driver Resiliency (Client Mode)
WorkerMaster
Worker
Worker
Client
Driver
./spark-submit --deploy-mode “client”
Driver Resiliency (Client Mode)
WorkerMaster
Worker
Worker
Client
Driver
Entire Application is killed
Driver Resiliency (Client Mode)
WorkerMaster
Worker
Worker
Client
Driver
Driver Resiliency (Cluster Mode)
(with supervise)
WorkerMaster
Worker
Worker
Client
./spark-submit --deploy-mode
“cluster”...
Driver Resiliency (Cluster Mode)
(with supervise)
WorkerMaster
Worker
Worker
Client
Driver
Driver runs in the worker
Driver Resiliency (Cluster Mode)
(with supervise)
WorkerMaster
Worker
Worker
Client
Driver
Driver Resiliency (Cluster Mode)
(with supervise)
WorkerMaster
Worker
Worker
Client
Driver
Driver is started in a new
work...
Driver Resiliency (Cluster Mode)
(with supervise)
WorkerMaster
Worker
Worker
Client
Driver
Driver is started in a new
work...
Master Resiliency (Single Master)
WorkerMaster
Worker
Worker
Client
Entire Application is killed
Master Resiliency (Single Master)
WorkerMaster
Worker
Worker
Client
Master Resiliency (Multi Master)
Worker
Worker
Worker
Client
Standby MasterActive Master
No impact
Master Resiliency (Multi Master)
Worker
Worker
Worker
Client
Standby MasterActive Master
Master Resiliency (Multi Master)
Worker
Worker
Worker
Client
Standby MasterActive Master
Master Resiliency (Multi Master)
Worker
Worker
Worker
Client
Standby MasterActive Master Active Master
Standby becomes Act...
Master Resiliency (Multi Master)
Worker
Worker
Worker
Client
Standby MasterActive Master Active Master
Standby becomes Act...
Executor runs as child
process of Worker
Worker Resiliency
WorkerMaster
Worker
Worker
Client
ExecutorDriver
Worker
Worker Resiliency
WorkerMaster
Worker
Worker
Client
ExecutorDriver
Worker
Worker Resiliency
WorkerMaster
Worker
Worker
Client
ExecutorDriver
Driver and Executor are also
killed
Worker
Worker Resiliency
WorkerMaster
Worker
Worker
Client
ExecutorDriver
Worker is relaunched
Worker
Worker Resiliency
WorkerMaster
Worker
Worker
Client
ExecutorDriver
Driver and Executor are also
killed
Worker is relaunche...
Worker Resiliency
WorkerMaster
Worker
Worker
Client
ExecutorDriver
Driver and Executor are also
killed
Worker is relaunche...
Executor Resiliency
WorkerMaster
Worker
Worker
Client
Driver ExecutorExecutor
Executor Resiliency
WorkerMaster
Worker
Worker
Client
Driver Executor
Executor Resiliency
WorkerMaster
Worker
Worker
Client
Driver Executor
Executor is relaunched
Executor
Executor Resiliency
WorkerMaster
Worker
Worker
Client
Driver Executor
Executor is relaunched
Executor
Tasks in flight are ...
Executor Resiliency
WorkerMaster
Worker
Worker
Client
Driver Executor
Executor is relaunched
Executor
Tasks in flight are ...
Resiliency Results
Summary
Agenda
● Background
● Use cases for Real Time Stream Processing
● Motivations for Spark
● Creating Chaos
● Spark Streaming...
Future
● Lambda Architecture
● Operational Enhancements
○ Dynamic scaling
○ Additional spark instrumentation
● http://bit.ly/persinfra
(Senior Software Engineer - Personalization Infra)
We are hiring!
Upcoming SlideShare
Loading in …5
×

Spark Streaming Resiliency (Bay Area Spark Meetup)

1,975 views

Published on

Netflix is a data-driven organization that places emphasis on data quality, availability and agility to capture and process that data. Some of our recommendation algorithms are computed as events happen in real time. Such streaming applications are long running tasks that need to be resilient. This is especially true in a cloud deployment due to the ephemeral nature of resources. In this talk, we will cover the What, the Why and the How of our resiliency exercise with Spark Streaming in an AWS cloud deployment. A Netflix ChaosMonkey based approach, which randomly terminated instances or processes, was employed to simulate failures. We hope that this exercise will help build confidence in the resiliency on Spark Streaming for similar contexts.

Published in: Technology
  • Be the first to comment

Spark Streaming Resiliency (Bay Area Spark Meetup)

  1. 1. Bay Area Spark Meetup 05/19/2015 @
  2. 2. Spark Streaming Resiliency Prasanna Padmanabhan & Bharat Venkat Personalization Infrastructure
  3. 3. ● Deployment Setup ● Background Agenda ● Use cases for Real Time Stream Processing ● Creating Chaos ● Motivations for Spark ● Spark Streaming Primer ● Injecting Chaos in Spark ● Future
  4. 4. Agenda ● Background ● Use cases for Real Time Stream Processing ● Motivations for Spark ● Creating Chaos ● Spark Streaming Primer ● Deployment Setup ● Injecting Chaos in Spark ● Future
  5. 5. Netflix is a logging company
  6. 6. that also happens to stream videos
  7. 7. Scale at Netflix ● 400 Billion events per day ● 8 Million events/sec during peak ● Numerous types of events (UI Events, Play Events, Impression events etc)
  8. 8. What do we do with it? ● Event logs are captured into Hadoop (EMR) ● Run ETL jobs using Hive/Presto to ○ Provide input to pre-compute recommendations ○ User behavior analysis ○ Data analysis and Reporting
  9. 9. Agenda ● Background ● Use cases for Real Time Stream Processing ● Motivations for Spark ● Creating Chaos ● Spark Streaming Primer ● Deployment Setup ● Injecting Chaos in Spark ● Future
  10. 10. Use Cases for Stream Processing Recommendations based on collective real time signals
  11. 11. Use Cases for Stream Processing Faster identification of Data Anomalies and Regressions Bad iPhone push
  12. 12. Agenda ● Background ● Use cases for Real Time Stream Processing ● Motivations for Spark ● Creating Chaos ● Spark Streaming Primer ● Deployment Setup ● Injecting Chaos in Spark ● Future
  13. 13. Motivations for Spark ● Popular compute engine for batch processing ● Widely used for Offline Experimentations at Netflix ● Improves agility with Interactive queries Interactive Experimenter’s Notebook
  14. 14. Motivations for Spark Single platform to build batch and real-time applications S3 Micro Services Spark Spark Streaming Recommender Systems Batch Data Streaming Data
  15. 15. Agenda ● Background ● Use cases for Real Time Stream Processing ● Motivations for Spark ● Creating Chaos ● Spark Streaming Primer ● Deployment Setup ● Injecting Chaos in Spark ● Future
  16. 16. Challenges in Cloud ● Ephemeral Resources ● Cannot rely on local state ● No fixed IP
  17. 17. Chaos Monkey Approach ● Simulate failures by randomly killing components ● Failures inevitably happen when least desired ● Lather, Rinse, Repeat!
  18. 18. Can Spark Streaming survive Chaos Monkey?
  19. 19. Agenda ● Background ● Use cases for Real Time Stream Processing ● Motivations for Spark ● Creating Chaos ● Spark Streaming Primer ● Deployment Setup ● Injecting Chaos in Spark ● Future
  20. 20. Spark Components Spark Driver Cluster Manager (Mesos, YARN, Standalone) Task Task Worker Node Executor Task Task Worker Node Executor . . .
  21. 21. Spark Driver Spark Driver Cluster Manager (Mesos, YARN, Standalone) Task Task Worker Node Executor Task Task Worker Node Executor . . . Main Program, DAG Scheduler
  22. 22. Cluster Manager Spark Driver Cluster Manager (Mesos, YARN, Standalone) Task Task Worker Node Executor Task Task Worker Node Executor . . . Resource Allocation
  23. 23. Spark Worker Spark Driver Cluster Manager (Mesos, YARN, Standalone) Task Task Worker Node Executor Task Task Worker Node Executor . . . Runs Worker Process & Monitors Executors
  24. 24. How does streaming work? ● Data Streams are processed in batches ● Each batch processed in Spark ● Results are pushed out in batch
  25. 25. Agenda ● Background ● Use cases for Real Time Stream Processing ● Motivations for Spark ● Creating Chaos ● Spark Streaming Primer ● Deployment Setup ● Injecting Chaos in Spark ● Future
  26. 26. Application Details ● Process subset of UI Events from Kafka ● Compute aggregate metrics ● Publish metrics to Atlas ● Spark 1.2.0
  27. 27. Standalone Cluster Manager ● Provide resource management and resiliency ● All in one package ○ Built-in, easy to deploy ○ Troubleshoot issues with single team (Databricks)
  28. 28. Deployment
  29. 29. Agenda ● Background ● Use cases for Real Time Stream Processing ● Motivations for Spark ● Creating Chaos ● Spark Streaming Primer ● Deployment Setup ● Injecting Chaos in Spark ● Future
  30. 30. Stream Resiliency ● Streaming application continues to run ● Partial data loss during failure is acceptable
  31. 31. Driver Resiliency (Client Mode) WorkerMaster Worker Worker Client Driver ./spark-submit --deploy-mode “client”
  32. 32. Driver Resiliency (Client Mode) WorkerMaster Worker Worker Client Driver
  33. 33. Entire Application is killed Driver Resiliency (Client Mode) WorkerMaster Worker Worker Client Driver
  34. 34. Driver Resiliency (Cluster Mode) (with supervise) WorkerMaster Worker Worker Client ./spark-submit --deploy-mode “cluster” --supervise
  35. 35. Driver Resiliency (Cluster Mode) (with supervise) WorkerMaster Worker Worker Client Driver Driver runs in the worker
  36. 36. Driver Resiliency (Cluster Mode) (with supervise) WorkerMaster Worker Worker Client Driver
  37. 37. Driver Resiliency (Cluster Mode) (with supervise) WorkerMaster Worker Worker Client Driver Driver is started in a new worker
  38. 38. Driver Resiliency (Cluster Mode) (with supervise) WorkerMaster Worker Worker Client Driver Driver is started in a new worker
  39. 39. Master Resiliency (Single Master) WorkerMaster Worker Worker Client
  40. 40. Entire Application is killed Master Resiliency (Single Master) WorkerMaster Worker Worker Client
  41. 41. Master Resiliency (Multi Master) Worker Worker Worker Client Standby MasterActive Master
  42. 42. No impact Master Resiliency (Multi Master) Worker Worker Worker Client Standby MasterActive Master
  43. 43. Master Resiliency (Multi Master) Worker Worker Worker Client Standby MasterActive Master
  44. 44. Master Resiliency (Multi Master) Worker Worker Worker Client Standby MasterActive Master Active Master Standby becomes Active
  45. 45. Master Resiliency (Multi Master) Worker Worker Worker Client Standby MasterActive Master Active Master Standby becomes Active
  46. 46. Executor runs as child process of Worker Worker Resiliency WorkerMaster Worker Worker Client ExecutorDriver Worker
  47. 47. Worker Resiliency WorkerMaster Worker Worker Client ExecutorDriver Worker
  48. 48. Worker Resiliency WorkerMaster Worker Worker Client ExecutorDriver Driver and Executor are also killed Worker
  49. 49. Worker Resiliency WorkerMaster Worker Worker Client ExecutorDriver Worker is relaunched Worker
  50. 50. Worker Resiliency WorkerMaster Worker Worker Client ExecutorDriver Driver and Executor are also killed Worker is relaunched Driver and executor are also relaunched Worker
  51. 51. Worker Resiliency WorkerMaster Worker Worker Client ExecutorDriver Driver and Executor are also killed Worker is relaunched Driver and executor are also relaunched Worker
  52. 52. Executor Resiliency WorkerMaster Worker Worker Client Driver ExecutorExecutor
  53. 53. Executor Resiliency WorkerMaster Worker Worker Client Driver Executor
  54. 54. Executor Resiliency WorkerMaster Worker Worker Client Driver Executor Executor is relaunched Executor
  55. 55. Executor Resiliency WorkerMaster Worker Worker Client Driver Executor Executor is relaunched Executor Tasks in flight are rescheduled
  56. 56. Executor Resiliency WorkerMaster Worker Worker Client Driver Executor Executor is relaunched Executor Tasks in flight are rescheduled
  57. 57. Resiliency Results
  58. 58. Summary
  59. 59. Agenda ● Background ● Use cases for Real Time Stream Processing ● Motivations for Spark ● Creating Chaos ● Spark Streaming Primer ● Deployment Setup ● Injecting Chaos in Spark ● Future
  60. 60. Future ● Lambda Architecture ● Operational Enhancements ○ Dynamic scaling ○ Additional spark instrumentation
  61. 61. ● http://bit.ly/persinfra (Senior Software Engineer - Personalization Infra) We are hiring!

×