SlideShare a Scribd company logo
@allenxwang
From Three Nines to Five Nines
A Kafka Journey
Allen Wang
At 10,000 Feet
Minimize your data loss under these conditions
● Huge volume of data
● Limited configuration options
● Less ideal and constantly changing environment
● Balanced against cost
The State Of Kafka in Netflix
● Daily average
○ 1 trillion events
○ 3 Petabyte of data processed
● At peak
○ 1.26 trillion events / day
○ 20 million events / sec
○ 55 GB / sec
The State Of Kafka in Netflix
● Managing 3,000+ brokers and ~50 clusters
● Currently on 0.9
● In AWS VPC
Powered By Kafka
A NETFLIX ORIGINAL SERVICE
Keystone Data Pipeline
Stream
Consumers
Router
EMR
Fronting
Kafka
Event
Producer
Consumer
Kafka
Management
HTTP
PROXY
Deployment Configuration
Fronting Kafka Clusters Consumer Kafka Clusters
Number of clusters 24 15
Total number of instances 1700+ 1100+
Instance type d2.2xl i2.2xl
Replication factor 2 2
Retention period 8 to 24 hours 2 to 4 hours
A Peek into the Data
● Business related
○ Session information
○ Device logs
○ Feedback to recommendation and streaming algorithms
● System and infrastructure related
○ Application logs and distributed tracing
The Data Loss Philosophy
● Not all data are created equal
● The spectrum of data loss
● Lossless data delivery is not a necessity and should
be always balanced against cost
0.1% 0.5% 1% 5% Percent loss
Data Loss Measurement
● Use producer send callback API
● Related counters
○ Send attempt
○ Send success
○ Send fail → Lost record
● Data loss rate = lost record / send attempt
Design Principles
● Priority is application availability and user
experience
○ Non-blocking event producing
● Minimize data loss into fronting Kafka at reasonable
cost
Key Configurations
● acks = 1 for producing
○ Reduce the chance that the producer buffer gets full
● max.block.ms = 0
● 2 replicas → 20% cost saving compared to 3
replicas
● Allow unclean leader election
○ Maximize availability for producers
○ Potential duplicates/loss for consumers
The Cloud Reality
● Unpredictable instance lifecycle
● Unstable networking
○ Noisy neighbours
○ Cold start
● Little control over clients
ZooKeeper And Controller
● Inconsistent controller state upon session timeout
● Broker’s inability to recover from temporary
ZooKeeper outage
● Can cause big incidences and hard to identify root
cause
Our Producer Data Delivery SLA
● Started from 99.9%
○ Loss was a little higher than the original Chukwa pipeline
○ “At three nines, we lose more data than you generate”
● Some big incidences …
Oh Boy ...
Nowadays ...
● Two week’s data from the peak of last holiday
season
○ 8.4M lost events for all 7.6T attempts → 99.99989%
A Typical Day
Why Messages Are Dropped
● Producer buffer full
● Root causes
○ Slow response from broker
○ Metadata stale / unavailable
○ Client side problems (hardware, traffic)
What Has Been Done
● Improve broker availability
○ Optimize broker deployment strategy
○ Get rid of the “bad guys” - elimination of broker outliers
○ Move to AWS VPC - Better networking
● Automated producer configuration optimization
● When in trouble - failover!
Change in Deployment Strategy
● Kafka clusters
○ Big clusters with 500 brokers → Small to medium clusters
with 20 to 100 brokers
● ZooKeeper
○ Shared ZooKeeper cluster for all Kafka clusters →
Dedicated ZooKeeper cluster for each fronting Kafka cluster
● Data balancing
○ Uneven distribution of partitions → even distribution of
partitions among brokers
Rack Aware Partition Assignment
● Our contribution to Kafka 0.10
● Replicas of each partition is guaranteed to be
placed on different “racks”
○ Rack is logical and represent your failure protection domain
● Improved availability
○ OK to lose multiple brokers in the same rack
Partition Assignment Without Considering Rack
Rack 0 Rack 1
0
Broker 0 Broker 1 Broker 2 Broker 3
3 0 1 1 2 2 3
N = Partition N for a topic with 2 replicas
0 ← Off line partition
Rack Aware Partition Assignment
Rack 0 Rack 1
0
Broker 0 Broker 1 Broker 2 Broker 3
3 1 2 0 1 2 3
N = Partition N for a topic with 2 replicas
No offline partition
Overcome the “Co-location” Problem
● Multiple brokers “killed” at the same time by AWS.
Why?
● Definition
○ Multiple brokers in the same cluster are located on the
same physical host in cloud
● Impact reduced by Rack Aware Partition
Assignment
● Manually apply the trick of “detach” from ASG
Outliers
● Origins of outliers
○ Bad hardware
○ Noisy neighbours
○ Uneven workload
● Symptoms of outliers
○ Significantly higher response time
○ Frequent TCP timeouts/retransmissions
Cascading Effect of Outliers
Event
Producer
Kafka
Buffer exhausted
and message
drop Slow replication
Broker with
networking
problem
Disk read
causes slow
responses
X
X
X
The Art Of Outlier Detection
29
Same broker
shown as
outlier for
multiple
metrics
30
Visualizing
Outliers
To Kill or Not To Kill, That Is the Question
● The dilemma of terminating brokers
● Automated termination with time based
suppression
○ Use 99th percentile of produce and fetch response time
○ Static threshold
○ Limit one per 24 hours per cluster
Move To AWS VPC
● Huge improvement of networking vs. EC2 classic
○ Less transient networking errors
○ Lower latency
○ Tolerate higher packet per second
Producer Tuning
● Buffer size tuning
○ Handle transient traffic spike
○ The goal: buffer size large enough to hold 10 seconds of
send data
● “Eager” vs. “lazy” initialization of producers
● Re-instantiate the producer
● Termination of bad clients
When Things Go Wrong
When Things Go Wrong
When Things Go Wrong - Failover
● Taking advantage of cloud elasticity
● Cold standby Kafka cluster with 0 instances and
ready to scale up
● Different ZooKeeper cluster with no state
● Replication factor = 1
Failover
RouterFronting
Kafka
Event
Producer
X
Consumer
Kafka
Copy topic metadata
Consumer
Failover
● Time is the essence - failover as fast as 5 minutes
Fully
Automated
@allenxwang
Keystone Tech Blogs
http://techblog.netflix.com/search/label/keystone
@allenxwang

More Related Content

What's hot

Apache kafka
Apache kafkaApache kafka
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Data Loss and Duplication in Kafka
Data Loss and Duplication in KafkaData Loss and Duplication in Kafka
Data Loss and Duplication in Kafka
Jayesh Thakrar
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
emreakis
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
Ketan Gote
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
Avinash Ramineni
 
The RED Method: How to monitoring your microservices.
The RED Method: How to monitoring your microservices.The RED Method: How to monitoring your microservices.
The RED Method: How to monitoring your microservices.
Grafana Labs
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
Gleb Kanterov
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
 
import rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Pythonimport rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Python
groveronline
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...
kawamuray
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
Chhavi Parasher
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Henning Jacobs
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
kafka
kafkakafka
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
Amir Sedighi
 

What's hot (20)

Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Data Loss and Duplication in Kafka
Data Loss and Duplication in KafkaData Loss and Duplication in Kafka
Data Loss and Duplication in Kafka
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
 
The RED Method: How to monitoring your microservices.
The RED Method: How to monitoring your microservices.The RED Method: How to monitoring your microservices.
The RED Method: How to monitoring your microservices.
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
import rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Pythonimport rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Python
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
 
LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
kafka
kafkakafka
kafka
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 

Similar to From Three Nines to Five Nines - A Kafka Journey

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Peter Bakas
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
Ankur Bansal
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
confluent
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Steven Wu
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
HostedbyConfluent
 
Kraken mesoscon 2018
Kraken mesoscon 2018Kraken mesoscon 2018
Kraken mesoscon 2018
joeyzhang1989928
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Monal Daxini
 
Instaclustr Kafka Meetup Sydney Presentation
Instaclustr Kafka Meetup Sydney PresentationInstaclustr Kafka Meetup Sydney Presentation
Instaclustr Kafka Meetup Sydney Presentation
Ben Slater
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
Hotstar
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ Netflix
Ido Shilon
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
HostedbyConfluent
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Ankur Bansal
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
DataWorks Summit
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016
Peter Bakas
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
Amazon Web Services
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Monal Daxini
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
 

Similar to From Three Nines to Five Nines - A Kafka Journey (20)

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
 
Kraken mesoscon 2018
Kraken mesoscon 2018Kraken mesoscon 2018
Kraken mesoscon 2018
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
 
Instaclustr Kafka Meetup Sydney Presentation
Instaclustr Kafka Meetup Sydney PresentationInstaclustr Kafka Meetup Sydney Presentation
Instaclustr Kafka Meetup Sydney Presentation
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ Netflix
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
 

Recently uploaded

Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
bmucuha
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 

Recently uploaded (20)

Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 

From Three Nines to Five Nines - A Kafka Journey

  • 1. @allenxwang From Three Nines to Five Nines A Kafka Journey Allen Wang
  • 2. At 10,000 Feet Minimize your data loss under these conditions ● Huge volume of data ● Limited configuration options ● Less ideal and constantly changing environment ● Balanced against cost
  • 3. The State Of Kafka in Netflix ● Daily average ○ 1 trillion events ○ 3 Petabyte of data processed ● At peak ○ 1.26 trillion events / day ○ 20 million events / sec ○ 55 GB / sec
  • 4. The State Of Kafka in Netflix ● Managing 3,000+ brokers and ~50 clusters ● Currently on 0.9 ● In AWS VPC
  • 5. Powered By Kafka A NETFLIX ORIGINAL SERVICE
  • 7. Deployment Configuration Fronting Kafka Clusters Consumer Kafka Clusters Number of clusters 24 15 Total number of instances 1700+ 1100+ Instance type d2.2xl i2.2xl Replication factor 2 2 Retention period 8 to 24 hours 2 to 4 hours
  • 8. A Peek into the Data ● Business related ○ Session information ○ Device logs ○ Feedback to recommendation and streaming algorithms ● System and infrastructure related ○ Application logs and distributed tracing
  • 9. The Data Loss Philosophy ● Not all data are created equal ● The spectrum of data loss ● Lossless data delivery is not a necessity and should be always balanced against cost 0.1% 0.5% 1% 5% Percent loss
  • 10. Data Loss Measurement ● Use producer send callback API ● Related counters ○ Send attempt ○ Send success ○ Send fail → Lost record ● Data loss rate = lost record / send attempt
  • 11. Design Principles ● Priority is application availability and user experience ○ Non-blocking event producing ● Minimize data loss into fronting Kafka at reasonable cost
  • 12. Key Configurations ● acks = 1 for producing ○ Reduce the chance that the producer buffer gets full ● max.block.ms = 0 ● 2 replicas → 20% cost saving compared to 3 replicas ● Allow unclean leader election ○ Maximize availability for producers ○ Potential duplicates/loss for consumers
  • 13. The Cloud Reality ● Unpredictable instance lifecycle ● Unstable networking ○ Noisy neighbours ○ Cold start ● Little control over clients
  • 14. ZooKeeper And Controller ● Inconsistent controller state upon session timeout ● Broker’s inability to recover from temporary ZooKeeper outage ● Can cause big incidences and hard to identify root cause
  • 15. Our Producer Data Delivery SLA ● Started from 99.9% ○ Loss was a little higher than the original Chukwa pipeline ○ “At three nines, we lose more data than you generate” ● Some big incidences …
  • 17. Nowadays ... ● Two week’s data from the peak of last holiday season ○ 8.4M lost events for all 7.6T attempts → 99.99989%
  • 19. Why Messages Are Dropped ● Producer buffer full ● Root causes ○ Slow response from broker ○ Metadata stale / unavailable ○ Client side problems (hardware, traffic)
  • 20. What Has Been Done ● Improve broker availability ○ Optimize broker deployment strategy ○ Get rid of the “bad guys” - elimination of broker outliers ○ Move to AWS VPC - Better networking ● Automated producer configuration optimization ● When in trouble - failover!
  • 21. Change in Deployment Strategy ● Kafka clusters ○ Big clusters with 500 brokers → Small to medium clusters with 20 to 100 brokers ● ZooKeeper ○ Shared ZooKeeper cluster for all Kafka clusters → Dedicated ZooKeeper cluster for each fronting Kafka cluster ● Data balancing ○ Uneven distribution of partitions → even distribution of partitions among brokers
  • 22. Rack Aware Partition Assignment ● Our contribution to Kafka 0.10 ● Replicas of each partition is guaranteed to be placed on different “racks” ○ Rack is logical and represent your failure protection domain ● Improved availability ○ OK to lose multiple brokers in the same rack
  • 23. Partition Assignment Without Considering Rack Rack 0 Rack 1 0 Broker 0 Broker 1 Broker 2 Broker 3 3 0 1 1 2 2 3 N = Partition N for a topic with 2 replicas 0 ← Off line partition
  • 24. Rack Aware Partition Assignment Rack 0 Rack 1 0 Broker 0 Broker 1 Broker 2 Broker 3 3 1 2 0 1 2 3 N = Partition N for a topic with 2 replicas No offline partition
  • 25. Overcome the “Co-location” Problem ● Multiple brokers “killed” at the same time by AWS. Why? ● Definition ○ Multiple brokers in the same cluster are located on the same physical host in cloud ● Impact reduced by Rack Aware Partition Assignment ● Manually apply the trick of “detach” from ASG
  • 26. Outliers ● Origins of outliers ○ Bad hardware ○ Noisy neighbours ○ Uneven workload ● Symptoms of outliers ○ Significantly higher response time ○ Frequent TCP timeouts/retransmissions
  • 27. Cascading Effect of Outliers Event Producer Kafka Buffer exhausted and message drop Slow replication Broker with networking problem Disk read causes slow responses X X X
  • 28. The Art Of Outlier Detection
  • 29. 29 Same broker shown as outlier for multiple metrics
  • 31. To Kill or Not To Kill, That Is the Question ● The dilemma of terminating brokers ● Automated termination with time based suppression ○ Use 99th percentile of produce and fetch response time ○ Static threshold ○ Limit one per 24 hours per cluster
  • 32. Move To AWS VPC ● Huge improvement of networking vs. EC2 classic ○ Less transient networking errors ○ Lower latency ○ Tolerate higher packet per second
  • 33. Producer Tuning ● Buffer size tuning ○ Handle transient traffic spike ○ The goal: buffer size large enough to hold 10 seconds of send data ● “Eager” vs. “lazy” initialization of producers ● Re-instantiate the producer ● Termination of bad clients
  • 34. When Things Go Wrong When Things Go Wrong
  • 35. When Things Go Wrong - Failover ● Taking advantage of cloud elasticity ● Cold standby Kafka cluster with 0 instances and ready to scale up ● Different ZooKeeper cluster with no state ● Replication factor = 1
  • 37. Failover ● Time is the essence - failover as fast as 5 minutes Fully Automated