"Every Kafka admin’s worst nightmare is to be woken up by their client application’s teams for increased latencies. One of the main culprits? Degraded infrastructure.
Degraded infrastructure refers to the partial or full unavailability of broker components, such as storage volumes or network. Degradation of storage, in particular, can lead to slower reads and writes on the broker, negatively impact performance, and can quickly devolve into unavailability.
In this talk, we will discuss how we have tackled this problem head-on with a fully automated degraded storage detection and remediation system. We’ll highlight the importance of monitoring storage performance and take a deep-dive into how we formulated the detection algorithm, created and fine-tuned our monitors, and tested this pipeline from end-to-end. We will also discuss the tools and processes developed to mitigate storage degradation. Finally, we’ll share our insights on how this streamlined detection and mitigation system improved performance and availability of clusters in Confluent Cloud."
Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degraded Storage in Kafka
1. Rittika Adhikari
Senior Software Engineer
Don’t Let Degradation Bring
You Down!
Automatically Detect and Remediate
Degraded Storage in Apache Kafka®
Twitter | @tikachu99
2. Twitter | @tikachu99
Agenda
What is Degraded
Storage?
Persistent vs. Transient
Degraded Storage
Impact
Mitigating Degraded
Storage
How Does Kafka Write to
the Log?
Twitter | @tikachu99
3. How does Kafka write to the Log?
…
Cluster
Broker 1 Broker 2
Twitter | @tikachu99
4. How does Kafka write to the Log?
Producers
…
Cluster
Broker 1 Broker 2
Twitter | @tikachu99
5. How does Kafka write to the Log?
Producers
v
…
Cluster
Broker 1 Broker 2
Consumers
Twitter | @tikachu99
6. How does Kafka write to the Log?
Broker
TP-1
TP-2
.
.
.
When you write to a topic, you’re
really writing to a set of
topic-partitions…
Twitter | @tikachu99
7. How does Kafka write to the Log?
Broker
TP-1
TP-2
.
.
.
Segment 00
Segment 09
Segment 03
Segment 06
…
Topic-partitions are
broken down into
segments.
Twitter | @tikachu99
8. How does Kafka write to the Log?
Broker
TP-1
TP-2
.
.
.
Segment 00
Segment 09
Segment 03
Segment 06
…
Twitter | @tikachu99
00.timeindex
00.index
00.log
The log file stores
messages up to a
specific offset.
9. How does Kafka write to the Log?
Broker
TP-1
TP-2
.
.
.
Segment 00
Segment 09
Segment 03
Segment 06
…
Twitter | @tikachu99
00.log
00.timeindex
00.index
The index file stores
a mapping of
message id to
offset.
10. How does Kafka write to the Log?
Broker
TP-1
TP-2
.
.
.
Segment 00
Segment 09
Segment 03
Segment 06
…
Twitter | @tikachu99
00.log
00.index
00.timeindex
The time index file
stores a mapping of
message timestamp
to offset.
11. How does Kafka write to the Log?
Broker
TP-1
TP-2
.
.
.
Segment 00
Segment 09
Segment 03
Segment 06
…
r0
r1
r2
r5
r4
r3
r8
r7
r6
r10
r9
non-active
segment
active
segment
Twitter | @tikachu99
12. How does Kafka write to the Log?
Broker
TP-1
TP-2
.
.
.
Segment 00
Segment 09
Segment 03
Segment 06
…
r0
r1
r2
r5
r4
r3
r8
r7
r6
r10
r9
non-active
segment
active
segment
Twitter | @tikachu99
13. How does Kafka write to the Log?
Broker
TP-1
TP-2
.
.
.
Segment 00
Segment 09
Segment 03
Segment 06
…
r0
r1
r2
r5
r4
r3
r8
r7
r6
r10
r9
produce
message
to TP-1
Twitter | @tikachu99
14. Measuring Kafka Log Performance
14
E2E Latency Storage Latency
The amount of time it
takes e2e to produce and
consume a record.
The amount of time it
takes to write to disk.
Twitter | @tikachu99
15. Measuring Kafka Log Performance
15
E2E Latency Storage Latency
The amount of time it
takes e2e to produce and
consume a record
The amount of time it
takes to write to disk.
Twitter | @tikachu99
E2E Latency (99%)
16. Measuring Kafka Log Performance
16
E2E Latency Storage Latency
The amount of time it
takes e2e to produce and
consume a record.
The amount of time it
takes to write to disk.
Twitter | @tikachu99
17. Measuring Kafka Log Performance
17
E2E Latency Storage Latency
The amount of time it
takes e2e to produce and
consume a record.
Twitter | @tikachu99
Storage Latency (99%)
18. Twitter | @tikachu99
Agenda
What is Degraded
Storage?
Persistent vs. Transient
Degraded Storage
Impact
Mitigating Degraded
Storage
How Does Kafka Write to
the Log?
Twitter | @tikachu99
21. Twitter | @tikachu99
Symptoms of Degraded Storage
21
This is an example of
a subtitle with two
lines
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor
incididunt ut labore et dolore
magna aliqua. Leo vel orci
porta non pulvinar neque
laoreet. Gravida quis blandit
turpis cursus. Ipsum nunc
aliquet bibendum enim
facilisis.
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor
incididunt ut labore et dolore
magna aliqua. Leo vel orci
porta non pulvinar neque
laoreet. Gravida quis blandit
turpis cursus. Ipsum nunc al
Partial / Fully
Unavailable Disk
Disk provided by
cloud provider is
“spotty”
Slower Reads /
Writes
Broker read /
writes (when not
from filesystem
cache) are
impacted
Impacted
Performance
Higher e2e and
storage latency
22. Twitter | @tikachu99
Symptoms of Degraded Storage
22
This is an example of
a subtitle with two
lines
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor
incididunt ut labore et dolore
magna aliqua. Leo vel orci
porta non pulvinar neque
laoreet. Gravida quis blandit
turpis cursus. Ipsum nunc
aliquet bibendum enim
facilisis.
Partial / Fully
Unavailable Disk
Disk provided by
cloud provider is
“spotty”
Slower Reads /
Writes
Broker read /
writes (when not
from filesystem
cache) are
impacted
Impacted
Performance
Higher e2e and
storage latency
23. Twitter | @tikachu99
Symptoms of Degraded Storage
23
This is an example of
a subtitle with two
lines
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor
incididunt ut labore et dolore
magna aliqua. Leo vel orci
porta non pulvinar neque
laoreet. Gravida quis blandit
turpis cursus. Ipsum nunc
aliquet bibendum enim
facilisis.
Partial / Fully
Unavailable Disk
Disk provided by
cloud provider is
“spotty”
Slower Reads /
Writes
Broker read /
writes (when not
from filesystem
cache) are
impacted
Impacted
Performance
Higher e2e and
storage latencies
24. Symptoms of Degraded Storage
24
Partial / Fully
Unavailable Disk
Slower Reads /
Writes
Disk provided by
cloud provider is
“spotty”
Broker read / writes
(when not from FS
cache) are
impacted
Twitter | @tikachu99
E2E Latency (99%)
This is an example of
a subtitle with two
lines
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor
incididunt ut labore et dolore
magna aliqua. Leo vel orci
porta non pulvinar neque
laoreet. Gravida quis blandit
turpis cursus. Ipsum nunc
aliquet bibendum enim
facilisis.
Impacted
Performance
Higher e2e and
storage latencies
25. Symptoms of Degraded Storage
25
Partial / Fully
Unavailable Disk
Slower Reads /
Writes
Disk provided by
cloud provider is
“spotty”
Broker read / writes
(when not from FS
cache) are
impacted
Twitter | @tikachu99
Storage Latency (99.9%)
This is an example of
a subtitle with two
lines
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor
incididunt ut labore et dolore
magna aliqua. Leo vel orci
porta non pulvinar neque
laoreet. Gravida quis blandit
turpis cursus. Ipsum nunc
aliquet bibendum enim
facilisis.
Impacted
Performance
Higher e2e and
storage latencies
26. Frequency of Degraded Storage
How many incidents of degraded storage do you
think Confluent Cloud has experienced in the
last 30 days?
Twitter | @tikachu99
27. Frequency of Degraded Storage
500 incidents in the
past 30 days
Twitter | @tikachu99
29. Twitter | @tikachu99
Agenda
What is Degraded
Storage?
Persistent vs. Transient
Degraded Storage
Impact
Mitigating Degraded
Storage
How Does Kafka Write to
the Log?
31. Twitter | @tikachu99
Persistent Degraded Storage
31
➔ long-term instances of degraded infrastructure
➔ e2e latencies are not consistently bad
32. Twitter | @tikachu99
Persistent Degraded Storage
32
➔ long-term instances of degraded infrastructure
➔ e2e latencies are not consistently bad
➔ storage latencies may temporarily recover, but
never fully return to a good state
35. Twitter | @tikachu99
Transient Degraded Storage
35
➔ short-term instances of degraded infrastructure
➔ may recover within a short period of time due to
transient factors
37. Twitter | @tikachu99
Agenda
What is Degraded
Storage?
Persistent vs. Transient
Degraded Storage
Impact
Mitigating Degraded
Storage
How Does Kafka Write to
the Log?
38. Twitter | @tikachu99
Mitigating Degraded Storage
38
38
On-Prem Confluent Cloud
Identify brokers with bad disks and
reassign leadership using
kafka-reassign-partitions.sh
or kafka-leader-election.sh
Automatically detect
problematic brokers with
bad disks.
Mark the broker as
“degraded” and trigger
mitigation.
(Maybe) Replace disk.
39. Twitter | @tikachu99
Mitigating Degraded Storage
39
39
On-Prem
Identify brokers with bad disks and
reassign leadership using
kafka-reassign-partitions.sh
or kafka-leader-election.sh
TP-1
(L)
TP-2
(F)
TP-1
(F)
TP-2
(L)
Broker 1 Broker 2
40. Twitter | @tikachu99
Mitigating Degraded Storage
40
40
On-Prem
Identify brokers with bad disks and
reassign leadership using
kafka-reassign-partitions.sh
or kafka-leader-election.sh
TP-1
(L)
TP-2
(F)
TP-1
(F)
TP-2
(L)
Broker 1 Broker 2
41. Twitter | @tikachu99
Mitigating Degraded Storage
41
41
On-Prem
Identify brokers with bad disks and
reassign leadership using
kafka-reassign-partitions.sh
or kafka-leader-election.sh
TP-1
(L)
TP-2
(F)
TP-1
(F)
TP-2
(L)
TP-1
(L)
TP-2
(L)
TP-1
(F)
TP-2
(F)
Broker 1 Broker 2
TP-1
(F)
TP-2
(L)
Broker 2
42. Twitter | @tikachu99
Mitigating Degraded Storage
42
42
On-Prem
Identify brokers with bad disks and
reassign leadership using
kafka-reassign-partitions.sh
or kafka-leader-election.sh
TP-1
(L)
TP-2
(F)
TP-1
(F)
TP-2
(L)
TP-1
(L)
TP-2
(L)
TP-1
(F)
TP-2
(F)
Broker 1 Broker 2
New leader
elected
TP-1
(F)
TP-2
(L)
Broker 2
43. Twitter | @tikachu99
Mitigating Degraded Storage
43
43
On-Prem
Identify brokers with bad disks and
reassign leadership using
kafka-reassign-partitions.sh
or kafka-leader-election.sh
TP-1
(L)
TP-2
(F)
TP-1
(F)
TP-2
(L)
TP-1
(L)
TP-2
(L)
TP-1
(F)
TP-2
(F)
Broker 1 Broker 2
New leader
elected
Old leader made
follower
TP-1
(F)
TP-2
(L)
Broker 2
44. Twitter | @tikachu99
Mitigating Degraded Storage
44
44
On-Prem Confluent Cloud
Identify brokers with bad
disks and reassign
leadership!
Automatically detect
problematic brokers with
bad disks.
Mark the broker as
“degraded” and trigger
mitigation.
(Maybe) Replace disk.
46. Twitter | @tikachu99
Automated Detection & Remediation
46
Monitors for
unhealthy brokers
Marks unhealthy
broker as “degraded”
Automation
47. Twitter | @tikachu99
Automated Detection & Remediation
47
Monitors for
unhealthy brokers
Marks unhealthy
broker as “degraded”
Unhealthy broker’s
disk recovers
Automation
48. Twitter | @tikachu99
Automated Detection & Remediation
48
Monitors for
unhealthy brokers
Marks unhealthy
broker as “degraded”
Unhealthy broker’s
disk recovers
Marks recovered
broker as “healthy”
Automation
Automation
49. Twitter | @tikachu99
Automated Detection & Remediation
49
Monitors for
unhealthy brokers
Marks unhealthy
broker as “degraded”
Unhealthy broker’s
disk recovers
Marks recovered
broker as “healthy”
Automation
Automation
60. Twitter | @tikachu99
Agenda
What is Degraded
Storage?
Persistent vs. Transient
Degraded Storage
Impact
Mitigating Degraded
Storage
How Does Kafka Write to
the Log?
73. Twitter | @tikachu99
The Battle Against Degraded Infra…
73
😳
🥸
😌
Storage Latency (99.9%)
Storage Latency (99%)
74. Twitter | @tikachu99
The Battle Against Degraded Infra…
74
🤑
😳
🥸
😌
Storage Latency (99.9%)
Storage Latency (99%)
75. Twitter | @tikachu99
In Summary…
75
➔ there are two flavors of degraded storage:
persistent and transient
76. Twitter | @tikachu99
In Summary…
76
➔ there are two flavors of degraded storage:
persistent and transient
➔ degraded storage often goes unnoticed and
affects tail latencies
77. Twitter | @tikachu99
In Summary…
77
➔ there are two flavors of degraded storage:
persistent and transient
➔ degraded storage often goes unnoticed and
affects tail latencies
➔ don’t let degradation bring you down