This document discusses building a fault-tolerant Kafka cluster on AWS to handle 2.5 billion requests per day. It covers choosing AWS instance types and broker counts, spreading brokers across availability zones, configuring replication and partitioning, automating fault tolerance, adding metrics and alerts, and testing the cluster's resilience. Key decisions include broker placement, topic partitioning, Zookeeper ensemble sizing, and automation to dynamically reassign partitions and change configurations in response to failures or added capacity.
2. A Little bit about our Production
● 2.5 Billion requests per day and
growing
● Located at AWS
● Micro service architecture
● Kafka is our main message bus
● Most of the code Is written in
Clojure
● Almost all of the services are
consuming and/or producing
from/to Kafka
3. What This lecture include
● Quick overview on Kafka
● Why did we choose Kafka?
● Decisions to make when building a Kafka cluster
● Planing for fault tolerance
● Setting the defaults
● Automate the fault tolerance
● Reassign partitions and changing retention on the fly
● Adding Metrics
● Testing the cluster
● Demo for managing the cluster
4. Quick Kafka overview
● Open Source Message Bus developed by Linkedin
● Designed as a distributed system
● Offers high throughput for both publishing and subscribing
● Persist messages on disk
● Supports multi-subscribers and automatically balances the consumers during failure
TERMS
● A stream of messages of a particular type is defined as a topic
● A Message is defined as a payload and a Topic is a category to which messages are
published
● A Producer can be anyone who can publish messages to a Topic
● The published messages stored at a set of servers called Brokers or cluster
● A Consumer can subscribe to one or more Topics and consume messages from
brokers
5. Why Kafka?
● It fit's our Architecture for stream of
events: most of our services consume, run
logic and then act or produce a new event
● We need resilient solution since it's our
main message bus
● It's nicely scale out
● Same messages are often consumed by
different services, that enabled natively by
Kafka. Messages are not deleted when
consumed but after retention period
● Our large and growing number of
message require high throughput
6. Key decisions when building a cluster
● Which instance type to use?
● How many brokers do we need?
● How to spread brokers between AZ?
● Whats the right defaults regarding retention, number of
partitions, replication factor, flush intervals, etc
● What the right setting for each topic
● Log directories split up
● Zookeeper ensemble size
7. Planing fault tolerance
● Launch enough brokers to support
failures
● Spread brokers between AZ
● Set the replication factor to match at
least the number of AZ
● Grantee that each partition is spread
between all configured AZ
● Make sure that Zookeeper instances
are spread between AZ
● Add automation to add new brokers
fast
● Add alerts for failures
8. Automate the fault tolerance
● Auto calculation the spread of brokers per topic that will guarantee
at least one broker in each AZ and evenly spread partitions between
the brokers to balance load.
● Generate a JSON with the data above that compatible with Kafka
reassign format.
● Script the above steps per topic and get topic,partitions as
parameters.
● Add the automation above also to new created topics.
● Automate broker addition with chef and aws prepared AMI.
● Enable auto leader rebalance
9. DEMO
● Show basic commands / scripts / Kafka Usage
● Show internal scripts to automate brokers split
up of a topic between AZ
● Show how to reassign partitions and keep fault
tolerance
● Show how to change retention
● Show how to run console consumer for testing
● Show AppsFlyer cluster / Kafawebview /
metrics / dashboards
10. Collecting metrics, Building
dashboard and alerts
● Metrics are being send to statsd and
graphite via AirBnb reporter
● Additional application metrics are
being sent by internal service that
measure lag
● We create dashboard with all the
relevant metrics
● Alerts are being setup upon the
relevant metrics to monitor
● Health check is being set for each
broker
11. Testing the cluster
● Build a dashboard to see the testing
effects
● Stopping one/two brokers
● Kill an entire AZ
● Stop one ZK
● Reassign partitions in runtime
● Change retention in Runtime
● Generate additional load and check
performance
● Do combination of all the above