Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kafka At Scale in the Cloud

8,868 views

Published on

Netflix changed its data pipeline architecture recently to use Kafka as the gateway for data collection for all applications which processes hundreds of billions of messages daily. This session will discuss the motivation of moving to Kafka, the architecture and improvements we have added to make Kafka work in AWS. We will also share the lessons learned and future plans.

Published in: Engineering
  • Be the first to comment

Kafka At Scale in the Cloud

  1. 1. #kafkasummit @allenxwang Kafka At Scale In The Cloud 1 Allen Wang @ Netflix
  2. 2. #kafkasummit @allenxwang The State Of Kafka in Netflix ● 700 billion unique events ingested / day ● 1 trillion unique events / day at peak of last holiday season ● 1+ trillion events processed every day ● 11 million events ingested / sec @ peak ● 24 GB / sec @ peak ● 1.3 Petabyte / day 2
  3. 3. #kafkasummit @allenxwang The State Of Kafka in Netflix ● Managing 4,000+ brokers ● Currently on 0.8.2.1 transitioning to 0.9 3
  4. 4. #kafkasummit @allenxwang ● Keystone - Unified event publishing, collection, routing for batch and stream processing ○ 85% of the Kafka data volume ● Ad-hoc messaging ○ 15% of the Kafka data volume ● Characteristics ○ Non-transactional ○ Message delivery failure does not affect user experience Use Cases 4
  5. 5. #kafkasummit @allenxwang Keystone Data Pipeline Stream Consumers Samza Router EMR Fronting Kafka Event Producer Consumer Kafka Control Plane HTTP PROXY 5
  6. 6. #kafkasummit @allenxwang Design Principles ● Best effort delivery ● First priority is availability of client applications ● Allow minor message drop from producers ○ 99.99% delivery SLA ● Non-keyed messages ● Transparent and dynamic traffic routing for producers 6
  7. 7. #kafkasummit @allenxwang Key Configurations ● acks = 1 ○ Reduce the chance that the producer buffer gets full ● block.on.buffer.full = false ○ Do not block the client application for sending events ● unclean.leader.election.enable = true ○ Maximize availability for producers ○ Consumers may lose data or get duplicates or both 7
  8. 8. #kafkasummit @allenxwang Challenges ● Being stateful in an environment that favors stateless services ○ Unpredictable instance life cycle ○ Transient network issues 8
  9. 9. #kafkasummit @allenxwang Challenges ● Serving traffic for massive stateless services that autoscale 9 Regional failover Regional failover
  10. 10. #kafkasummit @allenxwang Challenges ● Topics have unpredictable data volume 10 Unintentional increase after IOS app release
  11. 11. #kafkasummit @allenxwang The Effect Of Outliers 11
  12. 12. #kafkasummit @allenxwang Outliers ● Origins of outliers ○ Bad hardware ○ Noisy neighbours ○ Uneven workload ● Symptoms of outliers ○ Significantly higher response time ○ Frequent TCP timeouts/retransmissions 12
  13. 13. #kafkasummit @allenxwang Direct Effect of Outliers ● Slow broker response leads to producer buffer exhaustion and message drop 13
  14. 14. #kafkasummit @allenxwang Cascading Effect of Outliers Event Producer Kafka Buffer exhausted and message drop Slow replication Broker with networking problem Disk read causes slow responses 14 X X X
  15. 15. #kafkasummit @allenxwang 15 A True Story
  16. 16. #kafkasummit @allenxwang Keystone went live 10/30/2015. The very next day ... 16
  17. 17. #kafkasummit @allenxwang Multiple ZooKeeper servers became unhealthy ZooKeeper quorum lost Producers dropped messages ZooKeeper quorum recovered Producers recovered? 17
  18. 18. #kafkasummit @allenxwang Message drop resumed for two largest Kafka clusters with 300+ brokers Controllers bounced and changed Began rolling restart 18
  19. 19. #kafkasummit @allenxwang Lessons Learned ● There are times things can go wrong ... and no turning back ● Reduce the complex ● Find a way to start over fresh 19
  20. 20. #kafkasummit @allenxwang Deployment Strategy ● Prefer multiple small clusters ○ Largest cluster has less than 200 brokers ● Limit the total number of partitions for a cluster to 10,000 ● Strive for even distribution of replicas ● Have dedicated ZooKeeper cluster for each Kafka cluster 20
  21. 21. #kafkasummit @allenxwang Deployment Configuration 21 Fronting Kafka Clusters Consumer Kafka Clusters Number of clusters 24 12 Total number of instances 3,000+ 900+ Instance type d2.xl i2.2xl Replication factor 2 2 Retention period 8 to 24 hours 2 to 4 hours
  22. 22. #kafkasummit @allenxwang Tools of Trade 22
  23. 23. #kafkasummit @allenxwang 23 Broker ID Management ● Using ZooKeeper persisted node ● Increment broker ID using curator locking recipe ● Checking AWS Auto Scale Group for broker ID reuse
  24. 24. #kafkasummit @allenxwang Rack Aware Replica Assignment ● All of our clusters span three AWS availability zones (rack) ● Distribute replicas of same partition to different AWS availability zones ● We contributed back ○ KIP-36: Rack aware replica assignment ○ Apache Kafka Github Pull Request #132 ○ Part of 0.10 release 24
  25. 25. #kafkasummit @allenxwang Scaling Strategy ● We overprovision for daily and failover traffic ● Scale up for organic traffic growth ● Methodologies ○ Adding partitions ○ Partition reassignment 25
  26. 26. #kafkasummit @allenxwang Adding Partitions To New Brokers ● Fast way to expand capacity ● Prerequisites ○ No keyed messages ● Caveat ○ TopicCommand may add partition to existing brokers ○ Created our own tool to guarantee adding partitions only to new brokers 26
  27. 27. #kafkasummit @allenxwang Partition Reassignment ● The good news ○ Generally applicable to all situations ● The bad news ○ Time consuming ○ Huge replication traffic that affects producers and consumers ● What we do ○ Create a tool to divide reassignments into small batches to limit replication traffic 27
  28. 28. #kafkasummit @allenxwang When Things Go Wrong When Things Go Wrong 28
  29. 29. #kafkasummit @allenxwang Save The Penguins 29
  30. 30. #kafkasummit @allenxwang Solution - Failover ● Taking advantage of cloud elasticity ● Cold standby Kafka cluster with minimal initial capacity and ready to scale up ● Different ZooKeeper cluster with no state ● Replication factor = 1 30
  31. 31. #kafkasummit @allenxwang Failover Samza RouterFronting KafkaEvent Producer X 31
  32. 32. #kafkasummit @allenxwang Failover ● Time is the essence - failover as fast as 5 minutes Fully Automated 32
  33. 33. #kafkasummit @allenxwang After Failover ● Fix it! ● Or rebuild it ● Offline maintenance ● Fail back when ready 33
  34. 34. #kafkasummit @allenxwang Monitor, Monitor, Monitor 34
  35. 35. #kafkasummit @allenxwang Outlier Detection ● Metrics based ○ Broker’s 99 percentile response time ○ Broker TCP timeouts, errors, retransmissions ○ Producer’s send latency ● Action ○ Broker termination 35
  36. 36. #kafkasummit @allenxwang 36 Same broker shown as outlier for multiple metrics
  37. 37. #kafkasummit @allenxwang 37 Visualizing Outliers
  38. 38. #kafkasummit @allenxwang Kafka Monitoring Service ● Broker monitoring (Are you there?) ● Heart-beating & continuous message latency monitoring (Are you healthy?) ● Consumer partition count and offset monitoring (Are you delivering?) 38
  39. 39. #kafkasummit @allenxwang Visualizing Kafka Metadata ● Undesirable tree view for ZooKeeper 39
  40. 40. #kafkasummit @allenxwang Keystone Dashboard - Metadata View 40
  41. 41. #kafkasummit @allenxwang Keystone Dashboard - Metadata view 41
  42. 42. #kafkasummit @allenxwang Blogs for Keystone Data Pipeline http://techblog.netflix.com/search?q=keystone 42
  43. 43. #kafkasummit @allenxwang 43

×