Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(BDT318) How Netflix Handles Up To 8 Million Events Per Second

39,820 views

Published on

In this session, Netflix provides an overview of Keystone, their new data pipeline. The session covers how Netflix migrated from Suro to Keystone, including the reasons behind the transition and the challenges of zero loss while processing over 400 billion events daily. The session covers in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.

Published in: Technology

(BDT318) How Netflix Handles Up To 8 Million Events Per Second

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Peter Bakas, Director of Engineering, Event and Data Pipelines, Netflix October 2015 BDT318 Netflix Keystone How Netflix Handles Data Streams Up to 8 Million Events Per Second
  2. 2. Peter Bakas @ Netflix : Cloud Platform Engineering - Event and Data Pipelines @ Ooyala : Analytics, Discovery, Platform Engineering & Infrastructure @ Yahoo : Display Advertising, Behavioral Targeting, Payments @ PayPal : Site Engineering and Architecture @ Play : Advisor to various Startups (Data, Security, Containers) Who is this guy?
  3. 3. What to Expect from the Session • Architectural design and principles for Keystone • Current state of technologies that Keystone is leveraging • Best practices in operating Kafka and Samza
  4. 4. Why are we here?
  5. 5. Publish, Collect, Process, Aggregate & Move Data
  6. 6. @ Cloud Scale
  7. 7. • 550 billion events per day • 8.5 million events & 21 GB per second during peak • 1+ PB per day • Hundreds of event types By the numbers
  8. 8. How did we get here?
  9. 9. Chukwa
  10. 10. Chukwa/Suro + Real-Time
  11. 11. Chukwa/Suro + Real-Time
  12. 12. Now what?!!
  13. 13. Keystone
  14. 14. Keystone
  15. 15. Split Fronting Kafka Clusters Normal-priority (majority) • 2 copies, 12 hour retention High-priority (streaming activities etc.) • 3 copies, 24 hour retention
  16. 16. Split Fronting Kafka Clusters Instance type - D2XL • Large disk (6TB) with 450-475MB/s of sequential I/O throughput measured • Large memory (30GB) • Medium network capability (~ 700Mbps) • Replication lag starts to show when bytes in above 18MB/second per broker with thousands of partition
  17. 17. • PR is available to Apache Kafka • https://github.com/apache/kafka/pull/132 • https://issues.apache.org/jira/browse/KAFKA-1215 • Improved availability • Reduce cost of maintenance Kafka Zone Aware Replica Assignment
  18. 18. Keystone
  19. 19. Control Plane + Data Plane • Control plane for router is job manager • Infrastructure is data plane • Declarative, reconciliation driven • Smart scheduling managing tradeoffs • Auto Scaling based on traffic • Fault tolerance • Application (router) faults • AWS hardware faults
  20. 20. Keystone
  21. 21. Routing Service - Samza
  22. 22. Routing Service - Samza
  23. 23. Routing Service - Samza Amazon S3 Routing • ~5800 long running containers for Amazon S3 routing • ~500 C3-4XL AWS instances for Amazon S3 routing Elasticsearch Routing • ~850 long running containers for Elasticsearch routing • ~70 C3-4XL AWS instances for Elasticsearch routing Kafka Routing • ~3200 long running containers for Kafka routing • ~280 C3-4XL AWS instances for Kafka routing
  24. 24. Routing Service - Samza Container Footprint: • 2G - 5G memory • 160 mbps max network bandwidth • 1 CPU Share • 20G disk for buffer & logs • Processes 1-12 partitions • Periodically reports health to infrastructure
  25. 25. Routing Service - Samza Observed Numbers: • Avg memory usage of ~1.8G per container • Avg memory usage per node ~20G(Range: 7G - 25G) • Avg CPU utilization of 8% per node • Avg NetworkIn 256Mbps per node • Avg NetworkOut 156Mbps per node
  26. 26. Routing Service - Samza Publish to Amazon S3 sink: • Every 200mb or 2 mins • S3 average upload latency 200ms Producer to Router latency: • 30 percentile topics under 500 ms • 70 percentile topics under 1 sec • 90 percentile under 2 sec • Overall average about 2.5 seconds Kafka to Router consumer lag (est time to catch up): • 65 percentile under 500ms • 90 percentile under 5 seconds
  27. 27. + Alterations • Internal build of Samza version 0.9.1 • Fixed SAMZA-41 in 0.9.1 • Support static partition range assignment • Added SAMZA-775 in 0.9.1 • Prefetch buffer specified based on heap to use • Backported SAMZA-655 to 0.9.1 • Environment variable configuration rewriter • Backported SAMZA-540 to version 0.9.1 • Expose latency related metrics in OffsetManager • Integration with Netflix Alert & Monitoring systems
  28. 28. Keystone
  29. 29. Real-time processing
  30. 30. Real-time processing
  31. 31. Real-time processing
  32. 32. Real-time processing
  33. 33. • Streaming jobs to analyze movie plays, A/B tests, etc. • Direct API for Kafka in 1.3 • Observed 2x performance improvement compared to 1.2 • Additional improvement possible with prefetching and connection pooling (not available yet) • Campaigned for backpressure support • Result - Spark 1.5 release has community developed back pressure support SPARK-7398
  34. 34. Great. How do I use it?
  35. 35. Annotation-based event definition @Resource(type = ConsumerStorageType.DB, name = "S3Diagnostics") public class S3Diagnostics implements Annotatable { .... S3Diagnostics s3Diagnostics = new S3Diagnostics(); .... LogManager.logEvent(s3Diagnostics); // log this diagnostic event Java
  36. 36. { "eventName" : "ksproxytest", "payload" : { "k1" : "v1", "k2" : "v2" } } Non-Java : Keystone Proxy
  37. 37. Wire format • Extensible • Currently supports JSON • Will support Avro • Encapsulated as a shareable jar • Immutable message through the pipeline
  38. 38. Producer Resilience • Outage should never disrupt existing instances from serving business purpose • Outage should never prevent new instances from starting up • After service is restored, event producing should resume automatically
  39. 39. Fail, but never block block.on.buffer.full=false handle potential blocking of first metadata request
  40. 40. Trust me, it works!
  41. 41. Keystone Dashboard
  42. 42. Keystone Dashboard
  43. 43. Keystone Dashboard
  44. 44. Trust, but verify!
  45. 45. • Broker monitoring • Alert on offline broker from ZooKeeper • Consumer monitoring • Alert on consumer lag/stuck and unconsumed partitions • Heart-beating • Produce/consume messages and measure latency • Broker performance testing • Produce tens of thousands messages per second on single instance • Create multiple consumer groups to test consumer impact on broker Auditor
  46. 46. Auditor - Broker Monitoring
  47. 47. Consumer Offset Stuck Consumer Unconsumed Partitions Auditor - Consumer Monitoring Consumer Lag
  48. 48. Meet Winston
  49. 49. New Internal Automation Engine: • Collect diagnostic information based on alerts & other operational events • Help services self heal • Reduce MTTR • Reduce pager fatigue • Improve productivity for developer Winston
  50. 50. Winston
  51. 51. How do you like your Kaffee?
  52. 52. Kaffee
  53. 53. Kaffee
  54. 54. Kaffee
  55. 55. What’s next?
  56. 56. • Performance tuning + optimizations • Self service • Schemas + registry • Event discovery + visualization • Open Source Auditor/Kaffee Near Term
  57. 57. And then???
  58. 58. Global real-time data stream + stream processing network
  59. 59. Office Hours Wed 4:00PM – 5:30PM @ Booth pbakas@netflix.com @peter_bakas
  60. 60. Remember to complete your evaluations!
  61. 61. Thank you!

×