Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Beaming flink to the cloud @ netflix ff 2016-monal-daxini

585 views

Published on

Netflix is a data driven company and we process over 700 billion streaming events per day with at-least once processing semantics in the cloud. To enable extracting intelligence from this unbounded stream easily we are building Stream Processing as a Service (SPaaS) infrastructure so that the user can focus on extracting value and not have to worry about boilerplate infrastructure and scale.

We will share our experience in building a scalable SPaaS using Flink, Apache Beam and Kafka as the foundation layer to process over 1.3 PB of event data without service disruption.

Published in: Technology
  • Be the first to comment

Beaming flink to the cloud @ netflix ff 2016-monal-daxini

  1. 1. Monal Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink to the Cloud @ Netflix Sep 12 2016
  2. 2. World’s Leading Internet Streaming Service (Global launch Jan 6, 2016)
  3. 3. ● 83+ Million Members, 190+ Countries ● 1000+ device types ● 35% of downstream Internet traffic Netflix Service Scale
  4. 4. Netflix Service Scale - Daily viewing hours 125,000,000,000+ Whoa!
  5. 5. Events Processed / day 1,000,000,000,000+ 1.4 PB That’s a huge number!
  6. 6. Event Scale Peak ● 1T unique events ingested/day ● 16M / sec ● 43GB / sec ● 10MB / message Daily Averages ● 1T+ events processed ● 600B unique events ingested ● 1.4 PB / day ● 4K / event
  7. 7. 99.99% + Availability / Four 9s Keystone Scale
  8. 8. Keystone Events Trend 1/2014 80B / day 1/2015 300B / day 1/2016 1T+ / day
  9. 9. Evolution SPaaS
  10. 10. Goal - Migrate 1.3 PB of event data to a new Pipeline in flight, while ensuring data diff < 0.1%
  11. 11. Few years ago (Season 1) EMR Event Producers
  12. 12. A year ago.. (Season 2) Event Producer Stream Consumers EMR Consumer Kafka Suro Router Event Producer Suro Kafka Suro Proxy
  13. 13. Season 3
  14. 14. Keystone Keystone Stream Processing (SPaaS) Keystone Management Keystone Messaging Schema Support 100% in AWS
  15. 15. Create DuploⓇ Blocks: Let reusability drive new value Our Philosophy
  16. 16. Q4 2015 Event Producers Sinks SPaaS
  17. 17. Keystone Management
  18. 18. Per Stream Auto Dashboard
  19. 19. Keystone SPaaS Stream Consumers Router EMR Fronting Kafka Consumer Kafka Event Producer KSProxy Control Plane Self Service UI 100% in AWS 24 x 7 Region failover
  20. 20. Event flow Keystone Pipeline As a Service
  21. 21. Keystone Stream Consumers Router EMR Fronting Kafka Event Producer Consumer Kafka KSProxy KSLib Control Plane Self Service UI
  22. 22. Keystone Stream Consumers Router EMR Fronting Kafka Event Producer Consumer Kafka Control Plane Self Service UI
  23. 23. Keystone Stream Consumers Router EMR Fronting Kafka Event Producer Consumer Kafka Control Plane Self Service UI
  24. 24. Keystone Stream Consumers Router EMR Fronting Kafka Event Producer Consumer Kafka Control Plane Self Service UI
  25. 25. Keystone Stream Consumers Router EMR Fronting Kafka Event Producer Consumer Kafka Control Plane Self Service UI
  26. 26. Routing
  27. 27. Keystone Stream Consumers Samza Router EMR Fronting Kafka Event Producer Consumer Kafka Control Plane Self Service UI Checkpoint Cluster
  28. 28. Details.. ○ Massively parallel use-case ■ Per element processing - declarative filtering & projection ○ Stateless except Kafka offset checkpointing state
  29. 29. Routing Infrastructure + Checkpointing Cluster + 0.9.1Go C language
  30. 30. EC2 Instances Zookeeper (Instance Id assignment) Job Job Job NodeAgent Checkpointing Cluster ASG Immutable Job Config Self-Service UI / Control Plane
  31. 31. Custom Go Executor ./runJob Logs Snapshots Attach Volumes ./runJob ./runJob Reconcile Loop - 1 min Health Check What’s running on the host? Zookeeper (Instance Id assignment)
  32. 32. Yes! You inferred right! No Mesos & No Yarn
  33. 33. Keystone Salient Features
  34. 34. Keystone Scale ● 4000+ Kafka brokers ● 14000+ Samza Jobs ○ Running in docker containers ○ On 1600+ nodes
  35. 35. Keystone Salient Features ● At-least-once delivery semantics* ● Multi-Tenant ● Self Service ● Scalable ● Fault Tolerant 100% in AWS
  36. 36. Keystone Salient Features ● Event payload is immutable ● Inject event metadata ○ guid, timestamp, host, app ● Custom extensible wire protocol ● Kafka producer wrapper
  37. 37. Keystone Salient Features ● Kafka Cluster failover ○ Kafka Kong ● Routing regional / failover ● Scales based on historical traffic ○ Externally driven
  38. 38. Season 4 Plot & Pilot...
  39. 39. Stream Processing As a Service (SPaaS) (more capable)
  40. 40. SPaaS Vision (plot) ● Multi-tenant support for stateful stream processing apps ● Autoscaling managed infrastructure ● Support for schemas ● Self Service Tooling
  41. 41. SPaaS Architecture (plot) SPaaS Manager Titus Container Runtime Framework Specific API or Common API (Beam) [ Dockerized Job ] 1. Create 2. Submit 3. Launch Runner Flink / Spark / Mantis Running Job 1. Submit Job DSL (SQL) Tooling/ Dashboard
  42. 42. Why Apache Beam? ○ Portable API layer to build sophisticated data processing apps ■ Support multiple execution engines ○ Unified model API over bounded and unbounded data sources ○ Millwheel, FlumeJava, Dataflow model lineage SPaaS - “Beam Me Up, Scotty ! "
  43. 43. Iterative build out: then ● First - Flink on Titus in VPC, AWS ○ Titus is a cloud runtime platform for container based jobs ● Next - Apache Beam and Flink runner SPaaS - Pilot
  44. 44. 2. Keystone SPaaS-Flink Pilot Use Cases Stream Consumers Router EMR Fronting Kafka Event Producer Consumer Kafka Demux MergeControl Plane Self Service UI
  45. 45. 2. 1. Keystone SPaaS-Flink Pilot Use Cases Stream Consumers Router EMR Fronting Kafka Event Producer Consumer Kafka 3. Demux MergeControl Plane Self Service UI
  46. 46. Broker Router - Massively parallel use case Router
  47. 47. Titus Job Task Manager IP Titus Host 4 Titus Host 5 Flink (1.2) Program Deployment (prod shadow) Zookeeper Job Manager (standby) Job Manager (master) Task Manager Titus Host 1 IP Titus Host 2 …. Task Manager Titus Host 3 IP Titus Job IPIP AWS VPC ENI
  48. 48. Titus High Level Architecture Titus UITitus UI Docker Registry Docker Registry Rhea container container container docker Titus Agent metrics agent container container SPaaS-Flink Titus executor logging agent zfsmesos agent docker RheaTitus API Cassandra Titus Master Job Management & Scheduler S3 Zookeeper Docker Registry EC2 Autoscaling API Mesos Master Titus UI (CI/CD) Fenzo
  49. 49. Titus UI
  50. 50. CD
  51. 51. Dashboard
  52. 52. Keystone SPaaS-Flink Pilot Use Case - 2
  53. 53. Keystone SPaaS-Flink Pilot Use Case - 2
  54. 54. Flink Router perf test (YMMV) ○ Note ■ The tests were performed on a specific use case, ■ running in a specific environment, and with ■ with one specific event stream, and setup.
  55. 55. 1. Keystone Stream Consumers Samza Router EMR Fronting Kafka Event Producer Consumer Kafka Control Plane Self Service UI
  56. 56. Details.. ○ Different runtimes for Flink & Samza routers ○ Massively parallel use-case ■ Per element processing ○ Focused on net outcomes
  57. 57. Titus Job Task Manager IP Titus Host 4 Titus Host 5 Flink (1.2) Router Zookeeper Job Manager (standby) Job Manager (master) Task Manager Titus Host 1 IP Titus Host 2 …. Task Manager Titus Host 3 IP Titus Job IPIP AWS VPC ENI Backed state
  58. 58. Flink Router outcome (YMMV) ○ Cost ≅17% savings ○ Memory utilization ≅16% better ○ Cpu utilization ≅ 40% better ○ Network utilization ≅ 10% better ○ Msg. throughput ≅ 1% (avg) - 4% (peak) better
  59. 59. Awaiting Flink Features
  60. 60. Titus Job Task Manager IP Titus Host 4 Titus Host 5 Fine Grained Recovery FLIP-1 Zookeeper Job Manager (standby) Job Manager (master) Task Manager Titus Host 1 IP Titus Host 2 …. Task Manager Titus Host 3 IP Titus Job IPIP X Flink Job Restarts
  61. 61. Awaiting Flink Features (now) ● Checkpoints and savepoints ○ FLINK-4484 - Unify, Persistent checkpoints, periodic savepoints ○ Compatible across Flink version upgrades ○ Inspection tool for debugging
  62. 62. Awaiting Flink Features (now) ● Atomic savepoint and stop (pause) the job ● Dynamic Scaling ○ Resume from savepoint with different parallelism ○ Job elasticity ● FLINK-4545 Adjust TaskManager network buffer when scaling DONE
  63. 63. Awaiting Flink Features (now) ● Cluster Runtime and Elasticity (FLIP - 6) ● Extending Window Function Metadata (FLIP-2) ● Large State support ○ Incremental checkpointing ○ hot standby
  64. 64. Awaiting Flink Features ● Metric tags as key-value pairs - FLINK-4245 ● FLIP-9 Window Trigger DSL ● FLIP-11 Table API ● Side Inputs / Side Outputs (handle late data) DONE
  65. 65. Ponder over Could Stream Processing Engine enable building non-analytics Applications?
  66. 66. More brain food... ● Netflix Keystone Pipeline Evolution ● Netflix Kafka in Keystone Pipeline ● Samza Meetup Presentation ● Titus talk ● Netflix OSS

×