What’s up With Availability in Kafka? With Justine Olshan | Current 2022
How do we define and measure availability in a distributed system? A great thing about distributed systems is that they are built to tolerate failures in a way that limits downtime to users. However, this means that availability is a bit more complicated than ""the system is up"" or ""the system is down.""
Even if the system is built to tolerate failures, we may see individual components lose availability due to:
* cloud provider outages
* high latencies
* load balancer and/or routing issues
* storage failures
* hardware issues
Using Apache Kafka and Confluent Cloud as a case study, we will dig deeper into how to define good SLOs and SLAs for distributed systems. From there we will discuss ways to improve availability and the changes we made to Confluent Cloud to improve on Kafka's availability story.
2. Imagine this scenario….
I wasn’t able to talk to Apache Kafka®
for 30 minutes!!
What do you mean? The servers were
all up and running.
Well I know that my application was
down! So something was wrong!
3. How do we define expectations?
● Service Level Indicator (SLI)
○ A measurement on a service
● Service Level Objective (SLO)
○ A goal for how we want our service to behave
● Service Level Agreement (SLA)
○ An understanding about expectations for the
service
SRE fundamentals 2021: SLIs vs SLAs vs SLOs
6. Comparing Shutdowns…
Apache Kafka Made Simple: A First Glimpse of a Kafka Without ZooKeeper
5 Common Pitfalls When Using Apache Kafka
7. Gaps in Kafka’s Availability Story
● External network connectivity issues
○ Load balancers failing
○ Cloud provider outage
● Storage stuck on leader
● Intermittent issues
● High latency
?
8. What is up with Kafka?
● Metrics that truly measure availability
○ Can users interact with their data?
● Can we produce/consume? With a cluster or an individual partition?
● Can connections be made?
● Can we replicate?
● From there: define SLI, SLO, SLA
9. ● Detect misbehaving brokers and take action!
○ Transfer leadership – bin/kafka-reassign-partitions.sh
○ Restart or replace
● Detect misbehaving brokers and take action!
○ Transfer leadership – bin/kafka-reassign-partitions.sh
Leader (1)
Follower (1) Follower (2)
Leader (2)
How can we mitigate unavailability?
10. Confluent has cool tools in cloud!
● Broker Leadership Priority APIs
● Automatic External Network Mitigation
● Automatic Stuck Storage Mitigation
Note: Confluent Cloud is the only
place to take advantage
of all these availability features!
12. Automatic External Network Mitigation
Symptoms:
● External (user) connections and traffic lost
● Internal (replication, ZooKeeper) connections and traffic remain
Mitigation:
● Use external traffic and explicit pings
● Automatically demote when external traffic lost
● Automatically promote when external traffic returns
13. Automatic Stuck Storage Mitigation
Symptoms:
● Storage threads on a leader get stuck, leader can’t replicate
● Followers fall out of ISR
● Leader crashes resulting in offline partitions
Mitigation:
● Detect when threads get stuck
● Automatically restart the broker, leaders move
● Leadership won’t return unless the broker comes up healthy
14. Reimagine this scenario….
Our monitoring noticed external
connectivity loss to part of Kafka. We
limited the unavailability by moving
your data to an available part of the
system. Hopefully this caused
minimal downtime for your clients.
Got it. Thanks for keeping my cluster
available and meeting SLA!
15. Get started with Confluent Cloud to take
advantage of the availability features
mentioned today!
https://developer.confluent.io