Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Netflix Way to deal with Big Data Problems

2,824 views

Published on

Netflix is a data driven company with a unique culture. Come take a holistic tour of the Big Data ecosystem, and how Netflix culture catalyzes the development of systems. Then ogle at how we quickly evolved and scaled the event pipeline to a 1 trillion events per day and over 1.4 PB of event data without service disruption, and a small team.

Published in: Software

The Netflix Way to deal with Big Data Problems

  1. 1. The way to deal with Big data problems Monal Daxini March 2016
  2. 2. Monal Daxini Real Time Data Infrastructure Senior Software Engineer, Netflix https://www.linkedin.com/in/monaldaxini @monaldax #Netflix #Keystone
  3. 3. We help Produce, Store, Process, Move Events @ scale
  4. 4. Tell me more... ● Big Data Ecosystem @ Netflix ● How we built a scalable event pipeline - Keystone - in a year ○ Replaced legacy system without service disruption with a small team. ● Netflix Culture ○ Relevant tenets tagged on the slides
  5. 5. Global Launch - Jan 6, 2016
  6. 6. Over 75M Members 190 Countries 125M hours/day → 11B hours / quarter 14,269 years / day → 1,255,707 years / quarter 1000+ devices 37% of Internet traffic at peak
  7. 7. Netflix Is a Data Driven Company Content Product Marketing Finance Business Development Talent Infrastructure ←CultureofAnalytics→
  8. 8. Data @ Netflix Data at Rest (batch) Data in Motion (streaming)
  9. 9. Big Data Systems - batch Ingestion / Kafka -> Ursula, Aegisthus Storage / S3, Teradata, Redshift, Druid Processing / Pig, Hive, Presto, Spark Reporting / Microstrategy, Tableau, Sting Scheduling / UC4 Interface / Big Data Portal, Kragle (API) Opensource& CommunityDriven
  10. 10. Big Data Systems - batch
  11. 11. Scale - batch AWS S3 (instead of HDFS) 40 PB (S3) Compressed Of which 13 PB events data
  12. 12. Big Data Systems - streaming Data Pipeline - Keystone Playback & edge operational insight - Mantis Stream Processing* - Spark Streaming Metrics & monitoring - Atlas LooselyCoupled HighlyAligned Opensource& CommunityDriven
  13. 13. What does culture have to do with big data?
  14. 14. Netflix Culture Deck Netflix Culture Freedom & Responsibility
  15. 15. "It may well be the most important document ever to come out of the Valley." 1 Sheryl Sandberg COO, Facebook 1 Business Insider, 2013
  16. 16. A NETFLIX ORIGINAL SERVICE How we built an internal facing 1 trillion / day stream processing cloud platform in a year, and how culture played a pivotal role Freedom & Responsibility
  17. 17. Years ago...
  18. 18. In the Old Days ... EMR Event Producers
  19. 19. Chukwa/Suro + Real-Time Branch
  20. 20. About a year ago ...
  21. 21. Chukwa / Suro + Real-Time Branch Event Producer Druid Stream Consumers EMR Consumer Kafka Suro Router Event Producer Suro Kafka Suro Proxy
  22. 22. Support at-least-once processing Scale, Ease of Operations Replace dormant open source software - Chukwa Enable future value adds - Stream Processing As a Service Seamless transition to the new platform ContextNotControl
  23. 23. Migrate Events to a new Pipeline In flight, while not losing more that 0.1% of them ContextNotControl HighlyAligned LooselyCoupled
  24. 24. Jan 2016
  25. 25. Keystone Stream Consumers Samza Router EMR Fronting Kafka Consumer Kafka Control Plane Event Producer KSProxy
  26. 26. 1 trillion events ingested per day during holiday season 1+ trillion events processed every day 350 billion a year ago 600+ billion events ingested per day Keystone - Scale - Streaming
  27. 27. 11 million events (24 GB per second) peak Upto 10MB payload / Avg 4K 1.3 PB / day Keystone - Scale - Streaming
  28. 28. Events & Producers
  29. 29. Keystone Stream Consumers Samza Router EMR Fronting Kafka Event Producer Consumer Kafka Control Plane
  30. 30. Event Payload is Immutable At-least-once semantics* * Once the event makes it to Kafka, there are disaster scenarios where this breaks.
  31. 31. Injected Event Metadata ● GUID ● Timestamp ● Host ● App
  32. 32. Keystone Extensible Wire Protocol ● Backwards and forwards compatibility ● Supports JSON, AVRO on the horizon ● Invisible to source & sinks ● Efficient - 10 bytes overhead per message ○ because message size - hundreds of bytes to 10MB
  33. 33. Netflix Kafka Producer ● Best effort delivery - ack = 1 ● Prefer drop event than disrupting producer app ● Resume event production after Kafka cluster restore ● Integration with Netflix Ecosystem ● Configurable topic to Kafka clusters route
  34. 34. Fronting Kafka Clusters
  35. 35. Keystone Stream Consumers Samza Router EMR Fronting Kafka Event Producer Consumer Kafka Control Plane
  36. 36. ● Pioneer Tax ● Started with 0.7 ● In prod with 0.8.2 ● Move to 0.9 & VPC in progress Kafka in the Cloud
  37. 37. Based on topics assigned ● Normal-priority (majority) ● High-priority (streaming activities etc.) Fronting Kafka Topic Classification
  38. 38. ● ≅3200 d2.xl brokers for regular, failover, & consumer kafka clusters ● 125 Zookeeper nodes (25 ensembles) ○ Independent zookeeper cluster per Kafka cluster ● 24 island clusters, 8 per region ○ 3 ASGs per cluster, 1 ASG per zone ○ 24 warm standby 3 node failover clusters Scale - Kafka (prod)
  39. 39. ● No dynamic topic creation ● Two copies ● Zone aware assignment of Topic partitions and replica Fronting Kafka Topics
  40. 40. In a distributed system make sure you understand limitations and failures, even if you don’t know all the features. - Monal Daxini
  41. 41. Because an unknown feature does not cause failure, however an unknown failure mode or limitation can bring down your system. - Monal Daxini
  42. 42. In addition, we do Kafka Kong once a week
  43. 43. Fronting Kafka Failover Self Service Tool BlamelessCulture
  44. 44. Fronting Kafka Failover
  45. 45. Fronting Kafka Failover
  46. 46. Kafka Management UI (Beta) Open sourcing on the road map Opensource& CommunityDriven
  47. 47. Kafka Auditor Open sourcing on the road map Opensource& CommunityDriven
  48. 48. Kafka Auditor - One pre cluster ● Broker monitoring ● Consumer monitoring ● Heart-beat & Continuous message latency ● On-demand Broker performance testing ● Built as a service deployable on single or multiple instances
  49. 49. Kafka Cluster Size -Tips ● Per Cluster Stay under 10k partitions & 200 brokers ● Leave approx. 40% free disk space on each broker
  50. 50. ● Started with AWS zone aware partition assignments ● We have discovered and filed several bugs ○ Details - Upcoming in Netflix Tech blog Kafka Contributions Opensource& CommunityDriven
  51. 51. Routing Service
  52. 52. Keystone Stream Consumers Samza Router EMR Fronting Kafka Event Producer Consumer Kafka Control Plane
  53. 53. Routing Infrastructure + Checkpointing Cluster + 0.9.1Go C language
  54. 54. Router Job Manager (Control Plane) EC2 Instances Zookeeper (Instance Id assignment) Job Job Job ksnode Checkpointing Cluster ASG
  55. 55. Custom Go Executor ./runJob Logs Snapshots Attach Volumes ./runJob ./runJob Reconcile Loop - 1 min Health Check What’s running in ksnode? Zookeeper (Instance Id assignment)
  56. 56. Logs ZFS Volume Snapshots Custom Go Executor . /runJo b . /runJo b . /runJo b Go Tools Server Client Tools Stream Logs Browse through rotated logs by date Ksnode Tooling
  57. 57. Yes! You inferred right! No Mesos & No Yarn
  58. 58. Distributed Systems are Hard Keep it Simple Minimize Moving Parts
  59. 59. ● 13,000 docker containers (samza jobs) ○ 7,000 - S3 Sink ○ 4,500 - Consumer Kafka sink ○ 1,500 - Elasticsearch sink ● 1,300 AWS C3-4XL instances Scale - Routing Service
  60. 60. More Info - Samza Meetup (10/2015) Samza ver 0.9.1 Contributions Opensource& CommunityDriven
  61. 61. Target & Achieved <= 0.1% diff bw Chukwa & Keystone pipeline, over 2.6 PB of data / day Chukwa & Keystone Pipeline Shadowing
  62. 62. Metrics & Monitoring
  63. 63. Keystone Stream Consumers Samza Router EMR Fronting Kafka Consumer Kafka Control Plane Event Producer KSProxy
  64. 64. Customer Facing per topic end-to-end dashboard
  65. 65. Dev facing infrastructure end-to-end dashboard
  66. 66. Scaling Avenues
  67. 67. ● Exposed cost attribution per event producers & topic ○ E.g. one producer reduced throughput by 600% ● Automation - frees up additional resources Scaling Up by Scaling Down Oxymoron?
  68. 68. ● No dedicated product or project managers ● No separate devops or operational team ● Our team works this way, and most other infrastructure teams as well ● Netflix does have product and project managers, each team decides independently We build and run what you saw today! YoubuildIt! Yourunit! HighPerformance
  69. 69. ● This does not mean we are constantly overworked ○ we make wise and simple choices and ○ lean towards automation & self-healing systems. We build and run what you saw today! YoubuildIt! Yourunit! HighPerformance
  70. 70. Not DevOps, but move towards NoOps You build it! You run it!
  71. 71. ● High Performance culture ● Communication ● Honest & respectful feedback ● No culture of process adherence, but ○ Creativity & Self Discipline ○ Freedom and Responsibility
  72. 72. Open source and community participation is an integral part of our strategy and culture
  73. 73. ● Collaborate with other internet scale tech companies ● Prevent lock-in of closed source products ● Need the flexibility to achieve scalability / functionality Why use open source?
  74. 74. ● Help shape direction of projects ● Don’t want to fork and diverge ● Give back to the community ● Attract top talent Why contribute back ?
  75. 75. ● Share our goodness ● Set industry standard ● Community can help evolve the tool Why contribute our own tool ?
  76. 76. Looking into the future?
  77. 77. Streaming Processing As a Service ● multi-tenant polyglot support of streaming engines like Spark Streaming, Mantis, Samza, and may be Flink Future steps Opensource& CommunityDriven
  78. 78. Messaging As a Service ● Kafka & Others ● Spark Streaming, Mantis, Samza, and may be Flink. Future steps Opensource& CommunityDriven
  79. 79. Data thruway ● Support for schemas - registry, discovery, validation. Self Service Tooling Future steps Opensource& CommunityDriven
  80. 80. More brain food... Netflix OSS Samza Meetup Presentation Netflix Tech Blog Spark Summit 2015 Talk

×