SlideShare a Scribd company logo
1 of 39
Download to read offline
@allenxwang
From Three Nines to Five Nines
A Kafka Journey
Allen Wang
At 10,000 Feet
Minimize your data loss under these conditions
● Huge volume of data
● Limited configuration options
● Less ideal and constantly changing environment
● Balanced against cost
The State Of Kafka in Netflix
● Daily average
○ 1 trillion events
○ 3 Petabyte of data processed
● At peak
○ 1.26 trillion events / day
○ 20 million events / sec
○ 55 GB / sec
The State Of Kafka in Netflix
● Managing 3,000+ brokers and ~50 clusters
● Currently on 0.9
● In AWS VPC
Powered By Kafka
A NETFLIX ORIGINAL SERVICE
Keystone Data Pipeline
Stream
Consumers
Router
EMR
Fronting
Kafka
Event
Producer
Consumer
Kafka
Management
HTTP
PROXY
Deployment Configuration
Fronting Kafka Clusters Consumer Kafka Clusters
Number of clusters 24 15
Total number of instances 1700+ 1100+
Instance type d2.2xl i2.2xl
Replication factor 2 2
Retention period 8 to 24 hours 2 to 4 hours
A Peek into the Data
● Business related
○ Session information
○ Device logs
○ Feedback to recommendation and streaming algorithms
● System and infrastructure related
○ Application logs and distributed tracing
The Data Loss Philosophy
● Not all data are created equal
● The spectrum of data loss
● Lossless data delivery is not a necessity and should
be always balanced against cost
0.1% 0.5% 1% 5% Percent loss
Data Loss Measurement
● Use producer send callback API
● Related counters
○ Send attempt
○ Send success
○ Send fail → Lost record
● Data loss rate = lost record / send attempt
Design Principles
● Priority is application availability and user
experience
○ Non-blocking event producing
● Minimize data loss into fronting Kafka at reasonable
cost
Key Configurations
● acks = 1 for producing
○ Reduce the chance that the producer buffer gets full
● max.block.ms = 0
● 2 replicas → 20% cost saving compared to 3
replicas
● Allow unclean leader election
○ Maximize availability for producers
○ Potential duplicates/loss for consumers
The Cloud Reality
● Unpredictable instance lifecycle
● Unstable networking
○ Noisy neighbours
○ Cold start
● Little control over clients
ZooKeeper And Controller
● Inconsistent controller state upon session timeout
● Broker’s inability to recover from temporary
ZooKeeper outage
● Can cause big incidences and hard to identify root
cause
Our Producer Data Delivery SLA
● Started from 99.9%
○ Loss was a little higher than the original Chukwa pipeline
○ “At three nines, we lose more data than you generate”
● Some big incidences …
Oh Boy ...
Nowadays ...
● Two week’s data from the peak of last holiday
season
○ 8.4M lost events for all 7.6T attempts → 99.99989%
A Typical Day
Why Messages Are Dropped
● Producer buffer full
● Root causes
○ Slow response from broker
○ Metadata stale / unavailable
○ Client side problems (hardware, traffic)
What Has Been Done
● Improve broker availability
○ Optimize broker deployment strategy
○ Get rid of the “bad guys” - elimination of broker outliers
○ Move to AWS VPC - Better networking
● Automated producer configuration optimization
● When in trouble - failover!
Change in Deployment Strategy
● Kafka clusters
○ Big clusters with 500 brokers → Small to medium clusters
with 20 to 100 brokers
● ZooKeeper
○ Shared ZooKeeper cluster for all Kafka clusters →
Dedicated ZooKeeper cluster for each fronting Kafka cluster
● Data balancing
○ Uneven distribution of partitions → even distribution of
partitions among brokers
Rack Aware Partition Assignment
● Our contribution to Kafka 0.10
● Replicas of each partition is guaranteed to be
placed on different “racks”
○ Rack is logical and represent your failure protection domain
● Improved availability
○ OK to lose multiple brokers in the same rack
Partition Assignment Without Considering Rack
Rack 0 Rack 1
0
Broker 0 Broker 1 Broker 2 Broker 3
3 0 1 1 2 2 3
N = Partition N for a topic with 2 replicas
0 ← Off line partition
Rack Aware Partition Assignment
Rack 0 Rack 1
0
Broker 0 Broker 1 Broker 2 Broker 3
3 1 2 0 1 2 3
N = Partition N for a topic with 2 replicas
No offline partition
Overcome the “Co-location” Problem
● Multiple brokers “killed” at the same time by AWS.
Why?
● Definition
○ Multiple brokers in the same cluster are located on the
same physical host in cloud
● Impact reduced by Rack Aware Partition
Assignment
● Manually apply the trick of “detach” from ASG
Outliers
● Origins of outliers
○ Bad hardware
○ Noisy neighbours
○ Uneven workload
● Symptoms of outliers
○ Significantly higher response time
○ Frequent TCP timeouts/retransmissions
Cascading Effect of Outliers
Event
Producer
Kafka
Buffer exhausted
and message
drop Slow replication
Broker with
networking
problem
Disk read
causes slow
responses
X
X
X
The Art Of Outlier Detection
29
Same broker
shown as
outlier for
multiple
metrics
30
Visualizing
Outliers
To Kill or Not To Kill, That Is the Question
● The dilemma of terminating brokers
● Automated termination with time based
suppression
○ Use 99th percentile of produce and fetch response time
○ Static threshold
○ Limit one per 24 hours per cluster
Move To AWS VPC
● Huge improvement of networking vs. EC2 classic
○ Less transient networking errors
○ Lower latency
○ Tolerate higher packet per second
Producer Tuning
● Buffer size tuning
○ Handle transient traffic spike
○ The goal: buffer size large enough to hold 10 seconds of
send data
● “Eager” vs. “lazy” initialization of producers
● Re-instantiate the producer
● Termination of bad clients
When Things Go Wrong
When Things Go Wrong
When Things Go Wrong - Failover
● Taking advantage of cloud elasticity
● Cold standby Kafka cluster with 0 instances and
ready to scale up
● Different ZooKeeper cluster with no state
● Replication factor = 1
Failover
RouterFronting
Kafka
Event
Producer
X
Consumer
Kafka
Copy topic metadata
Consumer
Failover
● Time is the essence - failover as fast as 5 minutes
Fully
Automated
@allenxwang
Keystone Tech Blogs
http://techblog.netflix.com/search/label/keystone
@allenxwang

More Related Content

What's hot

Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Lucas Jellema
 

What's hot (20)

Latency SLOs Done Right
Latency SLOs Done RightLatency SLOs Done Right
Latency SLOs Done Right
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to Kubernetes
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Stream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream SharingStream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream Sharing
 
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafka
 
Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources
 
kafka
kafkakafka
kafka
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Kubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory GuideKubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory Guide
 

Similar to From Three Nines to Five Nines - A Kafka Journey

Mininet: Moving Forward
Mininet: Moving ForwardMininet: Moving Forward
Mininet: Moving Forward
ON.Lab
 

Similar to From Three Nines to Five Nines - A Kafka Journey (20)

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
 
Kraken mesoscon 2018
Kraken mesoscon 2018Kraken mesoscon 2018
Kraken mesoscon 2018
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
 
Instaclustr Kafka Meetup Sydney Presentation
Instaclustr Kafka Meetup Sydney PresentationInstaclustr Kafka Meetup Sydney Presentation
Instaclustr Kafka Meetup Sydney Presentation
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ Netflix
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
 
Mininet: Moving Forward
Mininet: Moving ForwardMininet: Moving Forward
Mininet: Moving Forward
 

Recently uploaded

Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
RafigAliyev2
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 

Recently uploaded (20)

Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 

From Three Nines to Five Nines - A Kafka Journey

  • 1. @allenxwang From Three Nines to Five Nines A Kafka Journey Allen Wang
  • 2. At 10,000 Feet Minimize your data loss under these conditions ● Huge volume of data ● Limited configuration options ● Less ideal and constantly changing environment ● Balanced against cost
  • 3. The State Of Kafka in Netflix ● Daily average ○ 1 trillion events ○ 3 Petabyte of data processed ● At peak ○ 1.26 trillion events / day ○ 20 million events / sec ○ 55 GB / sec
  • 4. The State Of Kafka in Netflix ● Managing 3,000+ brokers and ~50 clusters ● Currently on 0.9 ● In AWS VPC
  • 5. Powered By Kafka A NETFLIX ORIGINAL SERVICE
  • 7. Deployment Configuration Fronting Kafka Clusters Consumer Kafka Clusters Number of clusters 24 15 Total number of instances 1700+ 1100+ Instance type d2.2xl i2.2xl Replication factor 2 2 Retention period 8 to 24 hours 2 to 4 hours
  • 8. A Peek into the Data ● Business related ○ Session information ○ Device logs ○ Feedback to recommendation and streaming algorithms ● System and infrastructure related ○ Application logs and distributed tracing
  • 9. The Data Loss Philosophy ● Not all data are created equal ● The spectrum of data loss ● Lossless data delivery is not a necessity and should be always balanced against cost 0.1% 0.5% 1% 5% Percent loss
  • 10. Data Loss Measurement ● Use producer send callback API ● Related counters ○ Send attempt ○ Send success ○ Send fail → Lost record ● Data loss rate = lost record / send attempt
  • 11. Design Principles ● Priority is application availability and user experience ○ Non-blocking event producing ● Minimize data loss into fronting Kafka at reasonable cost
  • 12. Key Configurations ● acks = 1 for producing ○ Reduce the chance that the producer buffer gets full ● max.block.ms = 0 ● 2 replicas → 20% cost saving compared to 3 replicas ● Allow unclean leader election ○ Maximize availability for producers ○ Potential duplicates/loss for consumers
  • 13. The Cloud Reality ● Unpredictable instance lifecycle ● Unstable networking ○ Noisy neighbours ○ Cold start ● Little control over clients
  • 14. ZooKeeper And Controller ● Inconsistent controller state upon session timeout ● Broker’s inability to recover from temporary ZooKeeper outage ● Can cause big incidences and hard to identify root cause
  • 15. Our Producer Data Delivery SLA ● Started from 99.9% ○ Loss was a little higher than the original Chukwa pipeline ○ “At three nines, we lose more data than you generate” ● Some big incidences …
  • 17. Nowadays ... ● Two week’s data from the peak of last holiday season ○ 8.4M lost events for all 7.6T attempts → 99.99989%
  • 19. Why Messages Are Dropped ● Producer buffer full ● Root causes ○ Slow response from broker ○ Metadata stale / unavailable ○ Client side problems (hardware, traffic)
  • 20. What Has Been Done ● Improve broker availability ○ Optimize broker deployment strategy ○ Get rid of the “bad guys” - elimination of broker outliers ○ Move to AWS VPC - Better networking ● Automated producer configuration optimization ● When in trouble - failover!
  • 21. Change in Deployment Strategy ● Kafka clusters ○ Big clusters with 500 brokers → Small to medium clusters with 20 to 100 brokers ● ZooKeeper ○ Shared ZooKeeper cluster for all Kafka clusters → Dedicated ZooKeeper cluster for each fronting Kafka cluster ● Data balancing ○ Uneven distribution of partitions → even distribution of partitions among brokers
  • 22. Rack Aware Partition Assignment ● Our contribution to Kafka 0.10 ● Replicas of each partition is guaranteed to be placed on different “racks” ○ Rack is logical and represent your failure protection domain ● Improved availability ○ OK to lose multiple brokers in the same rack
  • 23. Partition Assignment Without Considering Rack Rack 0 Rack 1 0 Broker 0 Broker 1 Broker 2 Broker 3 3 0 1 1 2 2 3 N = Partition N for a topic with 2 replicas 0 ← Off line partition
  • 24. Rack Aware Partition Assignment Rack 0 Rack 1 0 Broker 0 Broker 1 Broker 2 Broker 3 3 1 2 0 1 2 3 N = Partition N for a topic with 2 replicas No offline partition
  • 25. Overcome the “Co-location” Problem ● Multiple brokers “killed” at the same time by AWS. Why? ● Definition ○ Multiple brokers in the same cluster are located on the same physical host in cloud ● Impact reduced by Rack Aware Partition Assignment ● Manually apply the trick of “detach” from ASG
  • 26. Outliers ● Origins of outliers ○ Bad hardware ○ Noisy neighbours ○ Uneven workload ● Symptoms of outliers ○ Significantly higher response time ○ Frequent TCP timeouts/retransmissions
  • 27. Cascading Effect of Outliers Event Producer Kafka Buffer exhausted and message drop Slow replication Broker with networking problem Disk read causes slow responses X X X
  • 28. The Art Of Outlier Detection
  • 29. 29 Same broker shown as outlier for multiple metrics
  • 31. To Kill or Not To Kill, That Is the Question ● The dilemma of terminating brokers ● Automated termination with time based suppression ○ Use 99th percentile of produce and fetch response time ○ Static threshold ○ Limit one per 24 hours per cluster
  • 32. Move To AWS VPC ● Huge improvement of networking vs. EC2 classic ○ Less transient networking errors ○ Lower latency ○ Tolerate higher packet per second
  • 33. Producer Tuning ● Buffer size tuning ○ Handle transient traffic spike ○ The goal: buffer size large enough to hold 10 seconds of send data ● “Eager” vs. “lazy” initialization of producers ● Re-instantiate the producer ● Termination of bad clients
  • 34. When Things Go Wrong When Things Go Wrong
  • 35. When Things Go Wrong - Failover ● Taking advantage of cloud elasticity ● Cold standby Kafka cluster with 0 instances and ready to scale up ● Different ZooKeeper cluster with no state ● Replication factor = 1
  • 37. Failover ● Time is the essence - failover as fast as 5 minutes Fully Automated