This document summarizes Netflix's use of Kafka in their data pipeline. It discusses how Netflix evolved from using S3 and EMR to introducing Kafka and Kafka producers and consumers to handle 400 billion events per day. It covers challenges of scaling Kafka clusters and tuning Kafka clients and brokers. Finally, it outlines Netflix's roadmap which includes contributing to open source projects like Kafka and testing failure resilience.
Overview of Netflix's data pipeline with Kafka. Netflix processes 400 billion events a day.
Overview of Netflix's data pipeline with Kafka. Netflix processes 400 billion events a day.
Transition from older data systems to a new pipeline with Kafka at its core, enhancing efficiency.
Strategies for managing consumers and data retention with various priority levels for event types.
Focus on maintaining service during Kafka outages and ensuring performance in cloud environments.
Insights on running Kafka on AWS, including health checks, metric reporting, and future roadmaps.
Utilizing an Auditor service for performance monitoring and configuration of producers and consumers.
Addressing challenges with ZooKeeper, scaling issues, and tuning strategies for optimal performance.
Insights on running Kafka on AWS, including health checks, metric reporting, and future roadmaps.Plans for enhancing Kafka resilience, contributing to the community, and ongoing improvements.
Producer Resilience
● Kafkaoutage should never disrupt existing
instances from serving business purpose
● Kafka outage should never prevent new
instances from starting up
● After kafka cluster restored, event producing
should resume automatically
18.
Fail but NeverBlock
● block.on.buffer.full=false
● handle potential blocking of first meta data
request
● Periodical check whether KafkaProducer
was opened successfully
What Does ItTake to Run In Cloud
● Support elasticity
● Respond to scaling events
● Resilience to failures
o Favors architecture without single point of failure
o Retries, smart routing, fallback ...
21.
Kafka in AWS- How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our roadmap
Bootstrap
● Broker IDassignment
o Instances obtain sequential numeric IDs using Curator’s locks recipe
persisted in ZK
o Cleans up entry for terminated instances and reuse its ID
o Same ID upon restart
● Bootstrap Kafka properties from Archaius
o Files
o System properties/Environment variables
o Persisted properties service
● Service registration
o Register with Eureka for internal service discovery
o Register with AWS Route53 DNS service
24.
Metric Reporting
● Weuse Servo and Atlas from NetflixOSS
Kafka
MetricReporter
(Yammer → Servo adaptor)
JMX
Atlas Service
Health check service
●Use Curator to periodically read ZooKeeper
data to find signs of unhealthiness
● Export metrics to Servo/Atlas
● Expose the service via embedded Jetty
27.
Kafka in AWS- How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our roadmap
28.
ZooKeeper
● Dedicated 5node cluster for our data
pipeline services
● EIP based
● SSD instance
29.
Auditor
● Highly configurableproducers and
consumers with their own set of topics and
metadata in messages
● Built as a service deployable on single or
multiple instances
● Runs as producer, consumer or both
● Supports replay of preconfigured set of
messages
ZooKeeper Client
● Challenges
oBroker/consumer cannot survive ZooKeeper cluster
rolling push due to caching of private IP
o Temporary DNS lookup failure at new session
initialization kills future communication
37.
ZooKeeper Client
● Solutions
oCreated our internal fork of Apache ZooKeeper
client
o Periodically refresh private IP resolution
o Fallback to last good private IP resolution upon DNS
lookup failure
Strategy #1 AddPartitions to New
Brokers
● Caveat
o Most of our topics do not use keyed messages
o Number of partitions is still small
o Require high level consumer
40.
Strategy #1 AddPartitions to new
brokers
● Challenges: existing admin tools does not
support atomic adding partitions and
assigning to new brokers
41.
Strategy #1 AddPartitions to new
brokers
● Solutions: created our own tool to do it in
one ZooKeeper change and repeat for all or
selected topics
● Reduced the time to scale up from a few
hours to a few minutes
42.
Strategy #2 MovePartitions
● Should work without precondition, but ...
● Huge increase of network I/O affecting
incoming traffic
● A much longer process than adding
partitions
● Sometimes confusing error messages
● Would work if pace of replication can be
controlled
43.
Scale down strategy
●There is none
● Look for more support to automatically move
all partitions from a set of brokers to a
different set
44.
Client tuning
● Producer
oBatching is important to reduce CPU and network
I/O on brokers
o Stick to one partition for a while when producing for
non-keyed messages
o “linger.ms” works well with sticky partitioner
● Consumer
o With huge number of consumers, set proper
fetch.wait.max.ms to reduce polling traffic on broker
45.
Effect of batching
partitionerbatched records
per request
broker cpu util
[1]
random without
lingering
1.25 75%
sticky without
lingering
2.0 50%
sticky with 100ms
lingering
15 33%
[1] 10 MB & 10K msgs / second per broker, 1KB per message
46.
Broker tuning
● UseG1 collector
● Use large page cache and memory
● Increase max file descriptor if you have
thousands of producers or consumers
47.
Kafka in AWS- How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our roadmap
48.
Road map
● Workwith Kafka community on rack/zone
aware replica assignment
● Failure resilience testing
o Chaos Monkey
o Chaos Gorilla
● Contribute to open source
o Kafka
o Schlep -- our messaging library including SQS and
Kafka support
o Auditor