Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017

1,246 views

Published on

Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events, amounting to 3 PB, flowing through the Keystone infrastructure to help improve customer experience and glean business insights. The self-serve Keystone stream processing service processes these messages in near real-time with at-least once semantics in the cloud. This enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. I’ll share the details about this platform, and our experience building it.

Published in: Technology
  • Be the first to comment

AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Netflix Keystone SPaaS S t r e a m P r o c e s s i n g A s a S e r v i c e A B D 3 2 0 Monal Daxini @monaldax #reInvent #Netflix Stream Processing Infrastructure
  2. 2. @monaldax ● Data Engineer Why stream processing, and what does the platform offer? ● Data Leader Product / vision of a stream processing platform ● Platform engineer How we build and operate a stream processing platform? What Do I Get Out Of This Talk? Organized based on different roles or perspectives @monaldax
  3. 3. @monaldax ● I will focus on stream processing platform for business insights, which my team builds, mostly based on Flink ● I won’t ● Address operational insights for which we have different systems ● Compare stream processing engines, or cover stream processing concepts
  4. 4. @monaldax Why Stream Processing? @monaldax
  5. 5. @monaldax ● Low latency business insights and analytics ● Processing data as it arrives helps spread workload over time, & reduce processing redundancy ● Need to process unbounded data sets becoming increasingly common Why Real Time Data?
  6. 6. @monaldax ● Enable users to focus on data and business insights, and not worry about building stream processing infrastructure and tooling Why Build A Stream Processing Platform?
  7. 7. @monaldax What Does A Stream Processing Platform Offer?
  8. 8. @monaldax Platform Needs To Offer Robust Way To Process Streams Allowing To Tradeoff Between Ease, Capability, & Flexibility SPaaS
  9. 9. @monaldax Point & Click Routing, Filtering, Projection Streaming Jobs ● Support Streaming SQL Future ● Interactive exploration of streams for quick prototyping Future Stream Processing as a Service platform offers
  10. 10. @monaldax Point & Click Routing, Filtering, Projection
  11. 11. @monaldax Event Producers Sinks Ingest Pipelines Are The Backbone Of A Real-time Data Infrastructure SERVERLESS Turnkey 100% in AWS
  12. 12. @monaldax Keystone Pipeline– Provision A Managed Data Stream 📽
  13. 13. @monaldax * We would eventually like to move away from xpath & our custom parser Keystone Self-serve – Message Formats
  14. 14. @monaldax Keystone Self-Serve – Optional Projection 📽
  15. 15. @monaldax Keystone Self-serve – Elasticsearch Sink Config
  16. 16. @monaldax Keystone Self-serve – Kafka Sink Partition Key Support
  17. 17. @monaldax Keystone - Configure 1 Data Stream, A Filter, & 3 Sinks
  18. 18. Event Producer Create Kafka Topic, And Three Separate Jobs SPaaS Router Fronting Kafka KSGateway Consumer Kafka KCW Elasticsearch 3 Jobs1 Topic Keystone Management 1 Topic @monaldax
  19. 19. Event Flow: Producer Uses Kafka Client Wrapper Or Proxy SPaaS Router Fronting Kafka Event Producer KSGateway Consumer Kafka Keystone Management KCW Elasticsearch @monaldax
  20. 20. Event Flow: Events Queued In Kafka SPaaS RouterFronting Kafka Event Producer KSGateway Consumer Kafka KCW Elasticsearch 3 instances Keystone Management @monaldax
  21. 21. Event Flow: Each Router Reads From Source, Optionally Applies Filter & Projection SPaaS RouterFronting Kafka Event Producer KSGateway Consumer Kafka KCW Elasticsearch 3 instances Keystone Management @monaldax
  22. 22. Event Flow: Each Router Writes To Their Respective Sinks SPaaS RouterFronting Kafka Event Producer KSGateway Consumer Kafka KCW Elasticsearch 3 instances Non-Keyed Keyed Supported Keystone Management @monaldax
  23. 23. Dashboard Generated For Provisioned Streams
  24. 24. Searchable Router Job Logs
  25. 25. Flink Job Web UI @monaldax k
  26. 26. @monaldax Keystone Router Admin Links
  27. 27. Data Stream Operations is Managed • Fully managed scaling • Managed capacity planning • 24 X 7 availability [Scale] • Garbage collect unused streams @monaldax
  28. 28. Keystone Pipeline - The Road Ahead • Additional components – UDFs, Data Hygiene, Data Alerting, etc • Component chaining in the UI • Schema Support • Data Lineage • Cost attribution @monaldax
  29. 29. @monaldax Point & Click Routing, Filtering, Projection (prod) Streaming Jobs
  30. 30. Why A Streaming Job? • When we need more flexibility and power than what Point & Click pipeline offers, use stream processing jobs. @monaldax
  31. 31. Generate Streaming Job From Template @monaldax
  32. 32. Generated Jenkins Build
  33. 33. Run And Debug Locally In The IDE @monaldax
  34. 34. Create A New Streaming Job Config For Deployment @monaldax
  35. 35. Deploying A Streaming Job In Test @monaldax
  36. 36. Deploying A Streaming Job In Other Environments @monaldax
  37. 37. Deployment Status Of A Sample Streaming Job
  38. 38. Streaming Job Actions & Links @monaldax
  39. 39. Streaming Job Dashboard – Platform Metrics Auto-updated @monaldax
  40. 40. Searchable Streaming-Job Logs @monaldax
  41. 41. @monaldax ● Use case specific consulting ● Recipes ● Examples and Documentation In Addition, Consulting & Documentation
  42. 42. @monaldax Types of Streaming Jobs
  43. 43. Broadly, Two Categories Of Streaming Jobs • Stateless • No state maintained across events • Stateful • State maintained across events @monaldax
  44. 44. Event Producer Streaming Job In Context Of Keystone Pipeline SPaaS RouterFronting Kafka KSGateway Consumer Kafka Keystone Management KCW Elasticsearch Streaming Job @monaldax
  45. 45. Image adapted from: Stephan Ewen Stateless Stream Processor – No Internal State @monaldax
  46. 46. Stateless Stream Processor – External State Image adapted from: Stephan Ewen@monaldax
  47. 47. Stateless Example: Generating Plays Feed For Personalization, And Discovery Of Shows
  48. 48. @monaldax Stateless Streaming Job Use Case: High Level Architecture Enriching And Identifying Certain Plays Playback History Service Video Metadata Streaming Job Play Logs Live Service Lookup Data
  49. 49. Stateful Stream Processing Image adapted from: Stephan Ewen@monaldax
  50. 50. Stateful Example: Creating Search Sessions
  51. 51. Search Personalization – Custom Windowing On Out-of-order Events ...... S ES ……….Session 2: S Hours S E Session 1: SE … @monaldax
  52. 52. Streaming Application Flink Engine Local State Stateful Streaming Application With Local State, Checkpoints, And Savepoints Sinks Savepoints (Explicitly Triggered) Checkpoints (Automatic) Sources @monaldax
  53. 53. Streaming Job (Flink) Savepoint Tooling Support • Amazon S3 based multi-tenant storage management • Auto savepoint and resume from savepoint on redeploy • Resume from an existing savepoint @monaldax
  54. 54. Streaming Job (Flink) High Level Features • Stateless jobs • Event enrichment support by accessing services using platform thick clients • Stateful jobs 100s of GB, with larger state support in the works • Reusable blocks (in progress) • Job development, deployment, and monitoring tooling (alpha) @monaldax
  55. 55. Streaming Jobs - The Road Ahead • Easy resource provisioning estimates • Flink support for reading and writing from data warehouse, backfill • Continue to evolve tooling and support for large state • Reusable Components - sources, sinks, operators, schema support, data hygiene • Tooling support for Spark Streaming @monaldax
  56. 56. @monaldax Scale?
  57. 57. Prod – Trending Events & Scale With Events Flowing To Hive, Elasticsearch, Kafka ≅ 80B to 1.3T • 1.3T+ events processed per day • 600B to 1T unique per day • 2+ PB in 4.5+ PB out per day • Peak: 12M events in / sec & 36 GB / sec @monaldax
  58. 58. @monaldax Keystone Router Stream Processing Jobs Scale m4.4xl
  59. 59. @monaldax How Do We Do It?
  60. 60. @monaldax RTDI Consists Of 4 Systems. Keystone Pipeline Runs 24 X 7, & Does Not Impact Members Ability To Play Videos Keystone Stream Processing (SPaaS) Keystone Management Keystone Messaging 24 x 7 - Dev - Test - Prod Granular shadowing
  61. 61. Event Producer Components & Streaming Jobs SPaaS RouterFronting Kafka KSGateway Consumer Kafka Keystone Management KCW Hive Elasticsearch Streaming Job @monaldax
  62. 62. Event Producer Event Producer Library SPaaS RouterFronting Kafka KSGateway Consumer Kafka Keystone Management KCW Hive Elasticsearch Streaming Job @monaldax
  63. 63. • Inject event metadata - GUID, timestamp, host, app • Transparent and dynamic traffic routing for producers • Chaski - Custom binary data wrapper within Keystone pipeline • Multiple serialization support & Additional metadata • Netflix ecosystem integration – Eureka, Archaius, Atlas Producer Library - Kafka Client Wrapper @monaldax
  64. 64. Streaming Job Event Producer Boundary Of Custom Binary Data Wrapper SPaaS RouterFronting Kafka KSGateway Consumer Kafka Keystone Management KCW Hive Elasticsearch @monaldax
  65. 65. • Automated Kafka producer buffer (60s) tuning based on traffic • Best effort delivery, Prioritizes host application availability • acks=1, Do not block to send events, Unclean leader election • Non-keyed messages, retry send to available partitions • 99.9%+ delivery Producer Library - Kafka Client Wrapper @monaldax
  66. 66. Event Producer Ksgateway - Event Proxy For Non-java Clients, REST & GRPC SPaaS RouterFronting Kafka KSGateway Consumer Kafka Keystone Management KCW Hive Elasticsearch Streaming Job @monaldax
  67. 67. Event Producer Kafka Clusters (0.10) on Amazon EC2 SPaaS RouterFronting Kafka KSGateway Consumer Kafka Keystone Management KCW Elasticsearch Streaming Job @monaldax
  68. 68. • Have message sizes > 1MB and up to 10MB • Large Scale Keystone Ingest pipelines results in large fan out • Lower Latency – used for ad-hoc messaging as well • Open source – enhance, patch, or extend • Cons: It’s not Managed Why Kafka? @monaldax
  69. 69. Scale for Large Fan-out and Isolation - Cascading Topology Fronting Kafka Consumer Kafka Consumer @monaldax
  70. 70. Alternative: Logical Stream (Topic) Spread Across Multiple Topics Across Multiple Clusters (WIP) Multi-Cluster Producer Multi-Cluster Consumer @monaldax
  71. 71. • Dedicated Zookeeper cluster per Kafka cluster • Small Clusters < 200 brokers, partitions <= 10K • Partitions distributed evenly across brokers • Rack-aware replica assignment, brokers spread in 3 Zones • 2 copies & Unclean leader election on • Non-transactional Kafka Deployment Strategies – Version 0.10 (YMMV) @monaldax
  72. 72. • 36+ Kafka & Zookeeper clusters • 4000+ brokers (EC2), 700+ topics • 3000+ d2.xl, 900+ i2.2xl • Highly available 99.99%+ • Retention 2hr, 4hr, 8hr, 24hr Kafka Clusters Scale @monaldax
  73. 73. Event Producer Stream Processing Platform Router Fronting Kafka KSGateway Consumer Kafka Keystone Management KCW Elasticsearch Stream Consumers @monaldax
  74. 74. High-level Stream Processing Platform Architecture - Routers Keystone Management Point & Click Router Streaming Job Container Runtime 1. Create Streaming Job 2. Launch Job with Config, Source, Sink, Filters, Projections, etc. 3. Launch Containers • Immutable Image • Automated, system driven config overrides @monaldax
  75. 75. • Keystone pipeline is built on Flink Routers • Each Flink Router is a stream processing job • Router provisioning based on incoming traffic or estimates • Runs on containers atop EC2 • Island mode - single AWS Region Streaming Jobs 1.3.2 @monaldax
  76. 76. High-level Stream Processing Platform Architecture Streaming Jobs Keystone Management Point & Click or Streaming Job Container Runtime 1. Create Streaming Job 2. Launch Job with Config overrides 3. Launch Containers • Immutable Image • User driven config overrides @monaldax
  77. 77. Stream Processing Platform - Layered cake Amazon EC2 Titus Container Runtime Stream Processing Platform (Flink Streaming Engine, Config Management) Reusable Components Source & Sink Connectors, Filtering, Projection, etc. Routers (Streaming Job) Streaming Jobs @monaldax
  78. 78. @monaldax Flink Job Cluster In HA Mode Zookeeper Job Manager Leader (WebUI) Task Manager Task Manager Task Manager Job Manager (WebUI) One dedicated Zookeeper cluster for all streamig Jobs
  79. 79. Flink Task Slots & Automatic Operator Chaining Image: Flink 1.2 documentation@monaldax
  80. 80. @monaldax Flink Job Cluster In HA Mode With Checkpoints Zookeeper Job Manager (Leader) Task Manager Task Manager Task Manager Job Manager State Checkpoints State Metadata Checkpoints
  81. 81. Flink Checkpoints Similar To 2 Phase Commit Image: Flink 1.2 documentation@monaldax
  82. 82. @monaldax Titus Job Task Manager IP Titus Host 4 Titus Host 5 Checkpoints Are Taken Often Zookeeper Job Manager (standby) Job Manager (master) Task Manager Titus Host 1 IP Titus Host 1 …. Task Manager Titus Host 2 IP Titus Job IPIP AWS VPC State - Checkpoints - Kafka Offset Save
  83. 83. @monaldax Titus Job Task Manager IP Titus Host 4 Titus Host 5 Checkpoints Are Taken Often. A Container Could Fail… Zookeeper Job Manager (standby) Job Manager (master) Task Manager Titus Host 1 IP Titus Host 1 …. Task Manager Titus Host 2 IP Titus Job IPIP AWS VPC State - Checkpoints - Kafka Offset Save X
  84. 84. @monaldax Titus Job Task Manager IP Titus Host 4 Titus Host 5 Zookeeper Job Manager (standby) Job Manager (master) Task Manager Titus Host 1 IP Titus Host 2 …. Task Manager Titus Host 3 IP Titus Job IPIP AWS VPC State - Checkpoints - Kafka OffsetRestore Failed Container Automatically Replaced. State Restored To Last Checkpoint, Partially Recovery Supported Replacement container
  85. 85. Event Producer and Streaming Jobs Management SPaaS RouterFronting Kafka KSGateway Consumer Kafka Keystone Management KCW Hive Elasticsearch Streaming Job @monaldax
  86. 86. @monaldax Keystone Management Current Architecture - Imperative Composable Joblets Composable Joblets
  87. 87. @monaldax Keystone Management New Architecture (WIP) – Declarative
  88. 88. @monaldax Keystone Management New Architecture (WIP)
  89. 89. • The ability to pass data along the chain of Joblets within a Job • Locks and semaphores on resources spanning across jobs • Customization and integration into Netflix ecosystem – Eureka, etc. Keystone Management Unique Features @monaldax
  90. 90. @monaldax How Do We Operate It? Scale Operations Using Systems Not Humans
  91. 91. • No separate Ops team • No separate QA team • No separate Dev team • It’s all done by developers of the Real Time Data Infrastructure We Run What We Build! @monaldax
  92. 92. • We rely on metrics, monitoring, alerting & paging, & automation • Separate metrics system – Atlas • Separate alert configuration and alert actions system • Options for separate system to run cross-system automation tasks We Leverage Other Netflix Systems @monaldax
  93. 93. Easy Alert Configuration And Status @monaldax
  94. 94. Easy View Of Fired Alerts @monaldax
  95. 95. Streaming Job Event Producer Operating Ksgateway - Event Proxy For Non-Java Clients SPaaS RouterFronting Kafka KSGateway Consumer Kafka Keystone Management Hive Elasticsearch • Stateless Service • Scaled Using Elastic Load Balancing and Auto Scaling Group • Pre-scaled for planned increase in traffic @monaldax
  96. 96. Streaming Job Event Producer Event Producer Related Monitoring And Alerts SPaaS RouterFronting Kafka KSGateway Consumer Kafka Keystone Management KCW Elasticsearch @monaldax
  97. 97. @monaldax Monitoring Producer, Alert On Drop Rate
  98. 98. Event Producer Kafka Clusters SPaaS RouterFronting Kafka KSGateway Consumer Kafka Keystone Management KCW Hive Elasticsearch Streaming Job @monaldax
  99. 99. @monaldax Kafka Failover - Fronting Kafka Clusters
  100. 100. @monaldax Fully Automated Kafka Cluster Failover – As Fast As 5 Minutes
  101. 101. @monaldax Kafka Cluster & Routers In Healthy State Flink Router Fronting Kafka Event Producer
  102. 102. @monaldax Issue With Kafka Cluster Flink Router Fronting Kafka Event Producer X
  103. 103. @monaldax Launch Backup Kafka Cluster With Same Number Of Instances, But Smaller Instance Type Flink Router Fronting Kafka Event Producer Bring up failover Kafka cluster Copy metadata from Zookeeper X
  104. 104. @monaldax Change Producer Config To Produce To Failover Cluster, And Launch Routers For Failover Traffic Flink Router Fronting Kafka Event Producer Failover Flink Router X
  105. 105. @monaldax Change Producer Config To Original Cluster, And Finish Draining Events From Backup Flink Router Flink Router Fronting Kafka Event Producer Failover Flink Router
  106. 106. @monaldax Decommission Backup Cluster And Router Once Original Cluster Is Fixed, Or A Replacement Cluster Is Live Flink Router Fronting Kafka Event Producer Failover Flink Router X X
  107. 107. @monaldax Flink Router Fronting Kafka Event Producer Back To Steady State With Click Of A Button
  108. 108. • Failover currently supported for Fronting Kafka clusters only • We are working on multi-consumer client with support for keyed message to support failover of consumer Kafka clusters. Consumer Kafka Clusters @monaldax
  109. 109. Planned & Regular Kafka Kong This Automation Also Serves As Kafka Kong, A Tool That Follows Principles Of Chaos Engineering @monaldax
  110. 110. • Over provision for variations and traffic for failover • Broker health & outlier detection and auto termination • 99 percentile response time • Broker TCP timeouts, errors, retransmissions • Producer’s send latency Kafka Operation Strategies (YMMV) @monaldax
  111. 111. • Scale up by • Adding partitions – to new brokers, requires no keyed messages • Partition reassignment – in small batches with custom tool • Scale down by • Create New topics / New clusters • Create new clusters - use Kafka failover automation Kafka Operation Strategies (YMMV) @monaldax
  112. 112. Event Producer Stream Processing Platform Router Fronting Kafka KSGateway Consumer Kafka Keystone Management KCW Elasticsearch Flink Streaming Job @monaldax
  113. 113. • Container replacement • Checkpoints and Savepoints • Keep retrying if event data format is valid • Isolation – issue with one sink does not impact another Routers & Streaming Job Fault Tolerance By Design @monaldax
  114. 114. • Provision new or updated streams • Bulk updates and terminate routers and re-deployment • Automatic partial recovery allows zero-touch migration of underlying container infrastructure • Manual – KSRunbook Router Deployment Automation @monaldax
  115. 115. Manual Intervention, We Have Runbook. Goal Is To Automate And Keep Runbook Small @monaldax
  116. 116. • Per stream provisioning based on past weeks traffic or bit rate estimate • Provision buffer capacity • Run 1 additional container for latency sensitive consumers • Manual, % increase, easy to compute and deploy • Plan capacity to handle service failover, and holiday peaks Router Capacity Planning And Provisioning @monaldax
  117. 117. Admin Tooling To Scale Up Manually, Or To Deploy A New Build @monaldax
  118. 118. Application Metrics – Router Message Flow @monaldax
  119. 119. Application Metrics – Router filtering @monaldax
  120. 120. Platform-level Metrics – Kafka Offset Metrics
  121. 121. System Metrics - Router JVM Metrics @monaldax
  122. 122. Alerts– Hive Sink Router @monaldax
  123. 123. @monaldax Flink Streaming Job ● Split between application and infrastructure ● Metrics and monitoring and ● Alerts ● Paging and on-call rotations ● Platform customers follow the same “We build it we run it model”
  124. 124. Example Streaming Job Application Level Simulated Metrics
  125. 125. Example Streaming Job System Level Simulated Metrics
  126. 126. @monaldax Operations – The road ahead ● True auto scaling ● Bootstrap capacity planning for stateful streaming jobs ● Automated Canary tooling & Data parity ● Point and Click components quick testing, and performance profiling ● E.g., - iterating over a Filter definition
  127. 127. @monaldax I Want To Learn More ● http://bit.ly/mLOOP - Deep dive into Unbounded Data Processing Systems ● http://bit.ly/m17FF - Keynote – Stream Processing with Flink at Netflix ● http://bit.ly/2BoYAq0 - Multi-tenant Multi-cluster Kafka Messaging Service
  128. 128. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU! M o n a l D a x i n i @ m o n a l d a x Que sti o ns?

×