From Three Nines to Five Nines - A Kafka Journey

@allenxwang
From Three Nines to Five Nines
A Kafka Journey
Allen Wang

At 10,000 Feet
Minimize your data loss under these conditions
● Huge volume of data
● Limited configuration options
● Less ideal and constantly changing environment
● Balanced against cost

The State Of Kafka in Netflix
● Daily average
○ 1 trillion events
○ 3 Petabyte of data processed
● At peak
○ 1.26 trillion events / day
○ 20 million events / sec
○ 55 GB / sec

The State Of Kafka in Netflix
● Managing 3,000+ brokers and ~50 clusters
● Currently on 0.9
● In AWS VPC

Powered By Kafka
A NETFLIX ORIGINAL SERVICE

Keystone Data Pipeline
Stream
Consumers
Router
EMR
Fronting
Kafka
Event
Producer
Consumer
Kafka
Management
HTTP
PROXY

Deployment Configuration
Fronting Kafka Clusters Consumer Kafka Clusters
Number of clusters 24 15
Total number of instances 1700+ 1100+
Instance type d2.2xl i2.2xl
Replication factor 2 2
Retention period 8 to 24 hours 2 to 4 hours

A Peek into the Data
● Business related
○ Session information
○ Device logs
○ Feedback to recommendation and streaming algorithms
● System and infrastructure related
○ Application logs and distributed tracing

The Data Loss Philosophy
● Not all data are created equal
● The spectrum of data loss
● Lossless data delivery is not a necessity and should
be always balanced against cost
0.1% 0.5% 1% 5% Percent loss

Data Loss Measurement
● Use producer send callback API
● Related counters
○ Send attempt
○ Send success
○ Send fail → Lost record
● Data loss rate = lost record / send attempt

Design Principles
● Priority is application availability and user
experience
○ Non-blocking event producing
● Minimize data loss into fronting Kafka at reasonable
cost

Key Configurations
● acks = 1 for producing
○ Reduce the chance that the producer buffer gets full
● max.block.ms = 0
● 2 replicas → 20% cost saving compared to 3
replicas
● Allow unclean leader election
○ Maximize availability for producers
○ Potential duplicates/loss for consumers

The Cloud Reality
● Unpredictable instance lifecycle
● Unstable networking
○ Noisy neighbours
○ Cold start
● Little control over clients

ZooKeeper And Controller
● Inconsistent controller state upon session timeout
● Broker’s inability to recover from temporary
ZooKeeper outage
● Can cause big incidences and hard to identify root
cause

Our Producer Data Delivery SLA
● Started from 99.9%
○ Loss was a little higher than the original Chukwa pipeline
○ “At three nines, we lose more data than you generate”
● Some big incidences …

Nowadays ...
● Two week’s data from the peak of last holiday
season
○ 8.4M lost events for all 7.6T attempts → 99.99989%

Why Messages Are Dropped
● Producer buffer full
● Root causes
○ Slow response from broker
○ Metadata stale / unavailable
○ Client side problems (hardware, traffic)

What Has Been Done
● Improve broker availability
○ Optimize broker deployment strategy
○ Get rid of the “bad guys” - elimination of broker outliers
○ Move to AWS VPC - Better networking
● Automated producer configuration optimization
● When in trouble - failover!

Change in Deployment Strategy
● Kafka clusters
○ Big clusters with 500 brokers → Small to medium clusters
with 20 to 100 brokers
● ZooKeeper
○ Shared ZooKeeper cluster for all Kafka clusters →
Dedicated ZooKeeper cluster for each fronting Kafka cluster
● Data balancing
○ Uneven distribution of partitions → even distribution of
partitions among brokers

Rack Aware Partition Assignment
● Our contribution to Kafka 0.10
● Replicas of each partition is guaranteed to be
placed on different “racks”
○ Rack is logical and represent your failure protection domain
● Improved availability
○ OK to lose multiple brokers in the same rack

Partition Assignment Without Considering Rack
Rack 0 Rack 1
0
Broker 0 Broker 1 Broker 2 Broker 3
3 0 1 1 2 2 3
N = Partition N for a topic with 2 replicas
0 ← Off line partition

Rack Aware Partition Assignment
Rack 0 Rack 1
0
Broker 0 Broker 1 Broker 2 Broker 3
3 1 2 0 1 2 3
N = Partition N for a topic with 2 replicas
No offline partition

Overcome the “Co-location” Problem
● Multiple brokers “killed” at the same time by AWS.
Why?
● Definition
○ Multiple brokers in the same cluster are located on the
same physical host in cloud
● Impact reduced by Rack Aware Partition
Assignment
● Manually apply the trick of “detach” from ASG

Outliers
● Origins of outliers
○ Bad hardware
○ Noisy neighbours
○ Uneven workload
● Symptoms of outliers
○ Significantly higher response time
○ Frequent TCP timeouts/retransmissions

Cascading Effect of Outliers
Event
Producer
Kafka
Buffer exhausted
and message
drop Slow replication
Broker with
networking
problem
Disk read
causes slow
responses
X
X
X

29
Same broker
shown as
outlier for
multiple
metrics

To Kill or Not To Kill, That Is the Question
● The dilemma of terminating brokers
● Automated termination with time based
suppression
○ Use 99th percentile of produce and fetch response time
○ Static threshold
○ Limit one per 24 hours per cluster

Move To AWS VPC
● Huge improvement of networking vs. EC2 classic
○ Less transient networking errors
○ Lower latency
○ Tolerate higher packet per second

Producer Tuning
● Buffer size tuning
○ Handle transient traffic spike
○ The goal: buffer size large enough to hold 10 seconds of
send data
● “Eager” vs. “lazy” initialization of producers
● Re-instantiate the producer
● Termination of bad clients

When Things Go Wrong
When Things Go Wrong

When Things Go Wrong - Failover
● Taking advantage of cloud elasticity
● Cold standby Kafka cluster with 0 instances and
ready to scale up
● Different ZooKeeper cluster with no state
● Replication factor = 1

Failover
RouterFronting
Kafka
Event
Producer
X
Consumer
Kafka
Copy topic metadata
Consumer

Failover
● Time is the essence - failover as fast as 5 minutes
Fully
Automated

@allenxwang
Keystone Tech Blogs
http://techblog.netflix.com/search/label/keystone

From Three Nines to Five Nines - A Kafka Journey

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to From Three Nines to Five Nines - A Kafka Journey

Similar to From Three Nines to Five Nines - A Kafka Journey (20)

Recently uploaded

Recently uploaded (20)

From Three Nines to Five Nines - A Kafka Journey