SlideShare a Scribd company logo
1 of 78
Download to read offline
Rittika Adhikari
Senior Software Engineer
Don’t Let Degradation Bring
You Down!
Automatically Detect and Remediate
Degraded Storage in Apache Kafka®
Twitter | @tikachu99
Twitter | @tikachu99
Agenda
What is Degraded
Storage?
Persistent vs. Transient
Degraded Storage
Impact
Mitigating Degraded
Storage
How Does Kafka Write to
the Log?
Twitter | @tikachu99
How does Kafka write to the Log?
…
Cluster
Broker 1 Broker 2
Twitter | @tikachu99
How does Kafka write to the Log?
Producers
…
Cluster
Broker 1 Broker 2
Twitter | @tikachu99
How does Kafka write to the Log?
Producers
v
…
Cluster
Broker 1 Broker 2
Consumers
Twitter | @tikachu99
How does Kafka write to the Log?
Broker
TP-1
TP-2
.
.
.
When you write to a topic, you’re
really writing to a set of
topic-partitions…
Twitter | @tikachu99
How does Kafka write to the Log?
Broker
TP-1
TP-2
.
.
.
Segment 00
Segment 09
Segment 03
Segment 06
…
Topic-partitions are
broken down into
segments.
Twitter | @tikachu99
How does Kafka write to the Log?
Broker
TP-1
TP-2
.
.
.
Segment 00
Segment 09
Segment 03
Segment 06
…
Twitter | @tikachu99
00.timeindex
00.index
00.log
The log file stores
messages up to a
specific offset.
How does Kafka write to the Log?
Broker
TP-1
TP-2
.
.
.
Segment 00
Segment 09
Segment 03
Segment 06
…
Twitter | @tikachu99
00.log
00.timeindex
00.index
The index file stores
a mapping of
message id to
offset.
How does Kafka write to the Log?
Broker
TP-1
TP-2
.
.
.
Segment 00
Segment 09
Segment 03
Segment 06
…
Twitter | @tikachu99
00.log
00.index
00.timeindex
The time index file
stores a mapping of
message timestamp
to offset.
How does Kafka write to the Log?
Broker
TP-1
TP-2
.
.
.
Segment 00
Segment 09
Segment 03
Segment 06
…
r0
r1
r2
r5
r4
r3
r8
r7
r6
r10
r9
non-active
segment
active
segment
Twitter | @tikachu99
How does Kafka write to the Log?
Broker
TP-1
TP-2
.
.
.
Segment 00
Segment 09
Segment 03
Segment 06
…
r0
r1
r2
r5
r4
r3
r8
r7
r6
r10
r9
non-active
segment
active
segment
Twitter | @tikachu99
How does Kafka write to the Log?
Broker
TP-1
TP-2
.
.
.
Segment 00
Segment 09
Segment 03
Segment 06
…
r0
r1
r2
r5
r4
r3
r8
r7
r6
r10
r9
produce
message
to TP-1
Twitter | @tikachu99
Measuring Kafka Log Performance
14
E2E Latency Storage Latency
The amount of time it
takes e2e to produce and
consume a record.
The amount of time it
takes to write to disk.
Twitter | @tikachu99
Measuring Kafka Log Performance
15
E2E Latency Storage Latency
The amount of time it
takes e2e to produce and
consume a record
The amount of time it
takes to write to disk.
Twitter | @tikachu99
E2E Latency (99%)
Measuring Kafka Log Performance
16
E2E Latency Storage Latency
The amount of time it
takes e2e to produce and
consume a record.
The amount of time it
takes to write to disk.
Twitter | @tikachu99
Measuring Kafka Log Performance
17
E2E Latency Storage Latency
The amount of time it
takes e2e to produce and
consume a record.
Twitter | @tikachu99
Storage Latency (99%)
Twitter | @tikachu99
Agenda
What is Degraded
Storage?
Persistent vs. Transient
Degraded Storage
Impact
Mitigating Degraded
Storage
How Does Kafka Write to
the Log?
Twitter | @tikachu99
What is Degraded Storage?
Twitter | @tikachu99
What is Degraded Storage?
Twitter | @tikachu99
Twitter | @tikachu99
Symptoms of Degraded Storage
21
This is an example of
a subtitle with two
lines
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor
incididunt ut labore et dolore
magna aliqua. Leo vel orci
porta non pulvinar neque
laoreet. Gravida quis blandit
turpis cursus. Ipsum nunc
aliquet bibendum enim
facilisis.
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor
incididunt ut labore et dolore
magna aliqua. Leo vel orci
porta non pulvinar neque
laoreet. Gravida quis blandit
turpis cursus. Ipsum nunc al
Partial / Fully
Unavailable Disk
Disk provided by
cloud provider is
“spotty”
Slower Reads /
Writes
Broker read /
writes (when not
from filesystem
cache) are
impacted
Impacted
Performance
Higher e2e and
storage latency
Twitter | @tikachu99
Symptoms of Degraded Storage
22
This is an example of
a subtitle with two
lines
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor
incididunt ut labore et dolore
magna aliqua. Leo vel orci
porta non pulvinar neque
laoreet. Gravida quis blandit
turpis cursus. Ipsum nunc
aliquet bibendum enim
facilisis.
Partial / Fully
Unavailable Disk
Disk provided by
cloud provider is
“spotty”
Slower Reads /
Writes
Broker read /
writes (when not
from filesystem
cache) are
impacted
Impacted
Performance
Higher e2e and
storage latency
Twitter | @tikachu99
Symptoms of Degraded Storage
23
This is an example of
a subtitle with two
lines
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor
incididunt ut labore et dolore
magna aliqua. Leo vel orci
porta non pulvinar neque
laoreet. Gravida quis blandit
turpis cursus. Ipsum nunc
aliquet bibendum enim
facilisis.
Partial / Fully
Unavailable Disk
Disk provided by
cloud provider is
“spotty”
Slower Reads /
Writes
Broker read /
writes (when not
from filesystem
cache) are
impacted
Impacted
Performance
Higher e2e and
storage latencies
Symptoms of Degraded Storage
24
Partial / Fully
Unavailable Disk
Slower Reads /
Writes
Disk provided by
cloud provider is
“spotty”
Broker read / writes
(when not from FS
cache) are
impacted
Twitter | @tikachu99
E2E Latency (99%)
This is an example of
a subtitle with two
lines
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor
incididunt ut labore et dolore
magna aliqua. Leo vel orci
porta non pulvinar neque
laoreet. Gravida quis blandit
turpis cursus. Ipsum nunc
aliquet bibendum enim
facilisis.
Impacted
Performance
Higher e2e and
storage latencies
Symptoms of Degraded Storage
25
Partial / Fully
Unavailable Disk
Slower Reads /
Writes
Disk provided by
cloud provider is
“spotty”
Broker read / writes
(when not from FS
cache) are
impacted
Twitter | @tikachu99
Storage Latency (99.9%)
This is an example of
a subtitle with two
lines
Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor
incididunt ut labore et dolore
magna aliqua. Leo vel orci
porta non pulvinar neque
laoreet. Gravida quis blandit
turpis cursus. Ipsum nunc
aliquet bibendum enim
facilisis.
Impacted
Performance
Higher e2e and
storage latencies
Frequency of Degraded Storage
How many incidents of degraded storage do you
think Confluent Cloud has experienced in the
last 30 days?
Twitter | @tikachu99
Frequency of Degraded Storage
500 incidents in the
past 30 days
Twitter | @tikachu99
Frequency of Degraded Storage
~6000
incidents per
year
Twitter | @tikachu99
Twitter | @tikachu99
Agenda
What is Degraded
Storage?
Persistent vs. Transient
Degraded Storage
Impact
Mitigating Degraded
Storage
How Does Kafka Write to
the Log?
Twitter | @tikachu99
Persistent Degraded Storage
30
➔ long-term instances of degraded infrastructure
Twitter | @tikachu99
Persistent Degraded Storage
31
➔ long-term instances of degraded infrastructure
➔ e2e latencies are not consistently bad
Twitter | @tikachu99
Persistent Degraded Storage
32
➔ long-term instances of degraded infrastructure
➔ e2e latencies are not consistently bad
➔ storage latencies may temporarily recover, but
never fully return to a good state
Twitter | @tikachu99
Persistent Degraded Storage
33
Storage Latency (99.9%)
E2E Latency (99%)
Twitter | @tikachu99
Transient Degraded Storage
34
➔ short-term instances of degraded infrastructure
Twitter | @tikachu99
Transient Degraded Storage
35
➔ short-term instances of degraded infrastructure
➔ may recover within a short period of time due to
transient factors
Twitter | @tikachu99
Transient Degraded Storage
36
Storage Latency (99.9%)
E2E Latency (99%)
Twitter | @tikachu99
Agenda
What is Degraded
Storage?
Persistent vs. Transient
Degraded Storage
Impact
Mitigating Degraded
Storage
How Does Kafka Write to
the Log?
Twitter | @tikachu99
Mitigating Degraded Storage
38
38
On-Prem Confluent Cloud
Identify brokers with bad disks and
reassign leadership using
kafka-reassign-partitions.sh
or kafka-leader-election.sh
Automatically detect
problematic brokers with
bad disks.
Mark the broker as
“degraded” and trigger
mitigation.
(Maybe) Replace disk.
Twitter | @tikachu99
Mitigating Degraded Storage
39
39
On-Prem
Identify brokers with bad disks and
reassign leadership using
kafka-reassign-partitions.sh
or kafka-leader-election.sh
TP-1
(L)
TP-2
(F)
TP-1
(F)
TP-2
(L)
Broker 1 Broker 2
Twitter | @tikachu99
Mitigating Degraded Storage
40
40
On-Prem
Identify brokers with bad disks and
reassign leadership using
kafka-reassign-partitions.sh
or kafka-leader-election.sh
TP-1
(L)
TP-2
(F)
TP-1
(F)
TP-2
(L)
Broker 1 Broker 2
Twitter | @tikachu99
Mitigating Degraded Storage
41
41
On-Prem
Identify brokers with bad disks and
reassign leadership using
kafka-reassign-partitions.sh
or kafka-leader-election.sh
TP-1
(L)
TP-2
(F)
TP-1
(F)
TP-2
(L)
TP-1
(L)
TP-2
(L)
TP-1
(F)
TP-2
(F)
Broker 1 Broker 2
TP-1
(F)
TP-2
(L)
Broker 2
Twitter | @tikachu99
Mitigating Degraded Storage
42
42
On-Prem
Identify brokers with bad disks and
reassign leadership using
kafka-reassign-partitions.sh
or kafka-leader-election.sh
TP-1
(L)
TP-2
(F)
TP-1
(F)
TP-2
(L)
TP-1
(L)
TP-2
(L)
TP-1
(F)
TP-2
(F)
Broker 1 Broker 2
New leader
elected
TP-1
(F)
TP-2
(L)
Broker 2
Twitter | @tikachu99
Mitigating Degraded Storage
43
43
On-Prem
Identify brokers with bad disks and
reassign leadership using
kafka-reassign-partitions.sh
or kafka-leader-election.sh
TP-1
(L)
TP-2
(F)
TP-1
(F)
TP-2
(L)
TP-1
(L)
TP-2
(L)
TP-1
(F)
TP-2
(F)
Broker 1 Broker 2
New leader
elected
Old leader made
follower
TP-1
(F)
TP-2
(L)
Broker 2
Twitter | @tikachu99
Mitigating Degraded Storage
44
44
On-Prem Confluent Cloud
Identify brokers with bad
disks and reassign
leadership!
Automatically detect
problematic brokers with
bad disks.
Mark the broker as
“degraded” and trigger
mitigation.
(Maybe) Replace disk.
Twitter | @tikachu99
Automated Detection & Remediation
45
Monitors for
unhealthy brokers
Twitter | @tikachu99
Automated Detection & Remediation
46
Monitors for
unhealthy brokers
Marks unhealthy
broker as “degraded”
Automation
Twitter | @tikachu99
Automated Detection & Remediation
47
Monitors for
unhealthy brokers
Marks unhealthy
broker as “degraded”
Unhealthy broker’s
disk recovers
Automation
Twitter | @tikachu99
Automated Detection & Remediation
48
Monitors for
unhealthy brokers
Marks unhealthy
broker as “degraded”
Unhealthy broker’s
disk recovers
Marks recovered
broker as “healthy”
Automation
Automation
Twitter | @tikachu99
Automated Detection & Remediation
49
Monitors for
unhealthy brokers
Marks unhealthy
broker as “degraded”
Unhealthy broker’s
disk recovers
Marks recovered
broker as “healthy”
Automation
Automation
Twitter | @tikachu99
Changing Broker State
Broker 0
TP-1 (L)
TP-2 (L)
Twitter | @tikachu99
Changing Broker State
TP-1 (L)
TP-2 (L)
ISR: 0, 2, 4
ISR: 0, 1, 3
Broker 0
Twitter | @tikachu99
Changing Broker State
TP-1 (L)
TP-2 (L)
Broker 0
Twitter | @tikachu99
Changing Broker State
TP-1 (L)
TP-2 (L) Mark
“DEGRADED”
Broker 0
Twitter | @tikachu99
Changing Broker State
TP-1 (L)
TP-2 (L) Mark
“DEGRADED”
TP-1 (F)
TP-2 (F)
Broker 0 Broker 0
Twitter | @tikachu99
Changing Broker State
TP-1 (L)
TP-2 (L) Mark
“DEGRADED”
TP-1 (F)
TP-2 (F)
Broker 0 Broker 0
ISR: 2, 4
ISR: 1, 3
TP-1 (F)
TP-2 (F)
Twitter | @tikachu99
Changing Broker State
TP-1 (L)
TP-2 (L) Mark
“DEGRADED”
56
TP-1 (F)
TP-2 (F)
Broker 0 Broker 0
Twitter | @tikachu99
Changing Broker State
TP-1 (L)
TP-2 (L) Mark
“DEGRADED”
57
Mark
“HEALTHY”
TP-1 (F)
TP-2 (F)
Broker 0 Broker 0
Twitter | @tikachu99
Changing Broker State
TP-1 (L)
TP-2 (L) Mark
“DEGRADED”
58
Mark
“HEALTHY”
Broker 0
TP-1 (F)
TP-2 (F)
TP-1 (F)
TP-2 (F)
Broker 0 Broker 0
Twitter | @tikachu99
Changing Broker State
TP-1 (L)
TP-2 (L) Mark
“DEGRADED”
59
Mark
“HEALTHY”
TP-1 (F)
TP-2 (F)
TP-1 (F)
TP-2 (F)
ISR: 1, 3, 0
ISR: 2, 4, 0
Broker 0
Broker 0 Broker 0
Twitter | @tikachu99
Agenda
What is Degraded
Storage?
Persistent vs. Transient
Degraded Storage
Impact
Mitigating Degraded
Storage
How Does Kafka Write to
the Log?
Twitter | @tikachu99
The Battle Against Degraded Infra…
61
E2E Latency (99%)
Twitter | @tikachu99
The Battle Against Degraded Infra…
62
E2E Latency (99%)
Twitter | @tikachu99
The Battle Against Degraded Infra…
63
⚔
E2E Latency (99%)
Twitter | @tikachu99
The Battle Against Degraded Infra…
64
⚔
E2E Latency (99%)
Twitter | @tikachu99
The Battle Against Degraded Infra…
65
E2E Latency (99%)
Twitter | @tikachu99
The Battle Against Degraded Infra…
66
E2E Latency (99%)
Twitter | @tikachu99
The Battle Against Degraded Infra…
67
E2E Latency (99%)
Twitter | @tikachu99
The Battle Against Degraded Infra…
68
E2E Latency (99%)
Twitter | @tikachu99
The Battle Against Degraded Infra…
Storage Latency (99%)
Twitter | @tikachu99
The Battle Against Degraded Infra…
🥸
Storage Latency (99%)
Twitter | @tikachu99
The Battle Against Degraded Infra…
71
😳
🥸
Storage Latency (99.9%)
Storage Latency (99%)
Twitter | @tikachu99
The Battle Against Degraded Infra…
72
😳
🥸
Storage Latency (99.9%)
Storage Latency (99%)
Tail latencies (p99.9)
matter!
Twitter | @tikachu99
The Battle Against Degraded Infra…
73
😳
🥸
😌
Storage Latency (99.9%)
Storage Latency (99%)
Twitter | @tikachu99
The Battle Against Degraded Infra…
74
🤑
😳
🥸
😌
Storage Latency (99.9%)
Storage Latency (99%)
Twitter | @tikachu99
In Summary…
75
➔ there are two flavors of degraded storage:
persistent and transient
Twitter | @tikachu99
In Summary…
76
➔ there are two flavors of degraded storage:
persistent and transient
➔ degraded storage often goes unnoticed and
affects tail latencies
Twitter | @tikachu99
In Summary…
77
➔ there are two flavors of degraded storage:
persistent and transient
➔ degraded storage often goes unnoticed and
affects tail latencies
➔ don’t let degradation bring you down 󰡀
Questions?
78
Twitter | @tikachu99

More Related Content

Similar to Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degraded Storage in Kafka

Lockless in Seattle - Using In-Memory OLTP for Transaction Processing
Lockless in Seattle -  Using In-Memory OLTP for Transaction ProcessingLockless in Seattle -  Using In-Memory OLTP for Transaction Processing
Lockless in Seattle - Using In-Memory OLTP for Transaction ProcessingMark Broadbent
 
Why is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier LeauteWhy is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier LeauteDatabricks
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta LakeKnoldus Inc.
 
Writing and testing high frequency trading engines in java
Writing and testing high frequency trading engines in javaWriting and testing high frequency trading engines in java
Writing and testing high frequency trading engines in javaPeter Lawrey
 
Scaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insertScaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insertChris Adkin
 
Making It To Veteren Cassandra Status
Making It To Veteren Cassandra StatusMaking It To Veteren Cassandra Status
Making It To Veteren Cassandra StatusEric Lubow
 
Stored Procedure Superpowers: A Developer’s Guide
Stored Procedure Superpowers: A Developer’s GuideStored Procedure Superpowers: A Developer’s Guide
Stored Procedure Superpowers: A Developer’s GuideVoltDB
 
The Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldThe Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldKonrad Malawski
 
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...HostedbyConfluent
 
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...confluent
 
Please stop modernizing lightning 10m - agile dc - 2018-10-15
Please stop modernizing   lightning 10m - agile dc - 2018-10-15Please stop modernizing   lightning 10m - agile dc - 2018-10-15
Please stop modernizing lightning 10m - agile dc - 2018-10-15Dane Weber
 
lock, block & two smoking barrels
lock, block & two smoking barrelslock, block & two smoking barrels
lock, block & two smoking barrelsMark Broadbent
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5Peter Lawrey
 
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
ApacheCon 2020 - Flink SQL in 2020: Time to show off!ApacheCon 2020 - Flink SQL in 2020: Time to show off!
ApacheCon 2020 - Flink SQL in 2020: Time to show off!Timo Walther
 
Flash for the Real World – Separate Hype from Reality
Flash for the Real World – Separate Hype from RealityFlash for the Real World – Separate Hype from Reality
Flash for the Real World – Separate Hype from RealityHitachi Vantara
 
Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Tal Bar-Zvi
 
Kafka Summit SF 2017 - Running Kafka as a Service at Scale
Kafka Summit SF 2017 - Running Kafka as a Service at ScaleKafka Summit SF 2017 - Running Kafka as a Service at Scale
Kafka Summit SF 2017 - Running Kafka as a Service at Scaleconfluent
 
Low latency in java 8 by Peter Lawrey
Low latency in java 8 by Peter Lawrey Low latency in java 8 by Peter Lawrey
Low latency in java 8 by Peter Lawrey J On The Beach
 
The Value of Reactive
The Value of ReactiveThe Value of Reactive
The Value of ReactiveVMware Tanzu
 

Similar to Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degraded Storage in Kafka (20)

Lockless in Seattle - Using In-Memory OLTP for Transaction Processing
Lockless in Seattle -  Using In-Memory OLTP for Transaction ProcessingLockless in Seattle -  Using In-Memory OLTP for Transaction Processing
Lockless in Seattle - Using In-Memory OLTP for Transaction Processing
 
Why is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier LeauteWhy is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier Leaute
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Writing and testing high frequency trading engines in java
Writing and testing high frequency trading engines in javaWriting and testing high frequency trading engines in java
Writing and testing high frequency trading engines in java
 
Scaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insertScaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insert
 
Making It To Veteren Cassandra Status
Making It To Veteren Cassandra StatusMaking It To Veteren Cassandra Status
Making It To Veteren Cassandra Status
 
Stored Procedure Superpowers: A Developer’s Guide
Stored Procedure Superpowers: A Developer’s GuideStored Procedure Superpowers: A Developer’s Guide
Stored Procedure Superpowers: A Developer’s Guide
 
The Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldThe Need for Async @ ScalaWorld
The Need for Async @ ScalaWorld
 
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
 
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...
 
Please stop modernizing lightning 10m - agile dc - 2018-10-15
Please stop modernizing   lightning 10m - agile dc - 2018-10-15Please stop modernizing   lightning 10m - agile dc - 2018-10-15
Please stop modernizing lightning 10m - agile dc - 2018-10-15
 
lock, block & two smoking barrels
lock, block & two smoking barrelslock, block & two smoking barrels
lock, block & two smoking barrels
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5
 
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
ApacheCon 2020 - Flink SQL in 2020: Time to show off!ApacheCon 2020 - Flink SQL in 2020: Time to show off!
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
 
Flash for the Real World – Separate Hype from Reality
Flash for the Real World – Separate Hype from RealityFlash for the Real World – Separate Hype from Reality
Flash for the Real World – Separate Hype from Reality
 
Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019
 
Kafka Summit SF 2017 - Running Kafka as a Service at Scale
Kafka Summit SF 2017 - Running Kafka as a Service at ScaleKafka Summit SF 2017 - Running Kafka as a Service at Scale
Kafka Summit SF 2017 - Running Kafka as a Service at Scale
 
Low latency in java 8 by Peter Lawrey
Low latency in java 8 by Peter Lawrey Low latency in java 8 by Peter Lawrey
Low latency in java 8 by Peter Lawrey
 
The value of reactive
The value of reactiveThe value of reactive
The value of reactive
 
The Value of Reactive
The Value of ReactiveThe Value of Reactive
The Value of Reactive
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 

Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degraded Storage in Kafka

  • 1. Rittika Adhikari Senior Software Engineer Don’t Let Degradation Bring You Down! Automatically Detect and Remediate Degraded Storage in Apache Kafka® Twitter | @tikachu99
  • 2. Twitter | @tikachu99 Agenda What is Degraded Storage? Persistent vs. Transient Degraded Storage Impact Mitigating Degraded Storage How Does Kafka Write to the Log? Twitter | @tikachu99
  • 3. How does Kafka write to the Log? … Cluster Broker 1 Broker 2 Twitter | @tikachu99
  • 4. How does Kafka write to the Log? Producers … Cluster Broker 1 Broker 2 Twitter | @tikachu99
  • 5. How does Kafka write to the Log? Producers v … Cluster Broker 1 Broker 2 Consumers Twitter | @tikachu99
  • 6. How does Kafka write to the Log? Broker TP-1 TP-2 . . . When you write to a topic, you’re really writing to a set of topic-partitions… Twitter | @tikachu99
  • 7. How does Kafka write to the Log? Broker TP-1 TP-2 . . . Segment 00 Segment 09 Segment 03 Segment 06 … Topic-partitions are broken down into segments. Twitter | @tikachu99
  • 8. How does Kafka write to the Log? Broker TP-1 TP-2 . . . Segment 00 Segment 09 Segment 03 Segment 06 … Twitter | @tikachu99 00.timeindex 00.index 00.log The log file stores messages up to a specific offset.
  • 9. How does Kafka write to the Log? Broker TP-1 TP-2 . . . Segment 00 Segment 09 Segment 03 Segment 06 … Twitter | @tikachu99 00.log 00.timeindex 00.index The index file stores a mapping of message id to offset.
  • 10. How does Kafka write to the Log? Broker TP-1 TP-2 . . . Segment 00 Segment 09 Segment 03 Segment 06 … Twitter | @tikachu99 00.log 00.index 00.timeindex The time index file stores a mapping of message timestamp to offset.
  • 11. How does Kafka write to the Log? Broker TP-1 TP-2 . . . Segment 00 Segment 09 Segment 03 Segment 06 … r0 r1 r2 r5 r4 r3 r8 r7 r6 r10 r9 non-active segment active segment Twitter | @tikachu99
  • 12. How does Kafka write to the Log? Broker TP-1 TP-2 . . . Segment 00 Segment 09 Segment 03 Segment 06 … r0 r1 r2 r5 r4 r3 r8 r7 r6 r10 r9 non-active segment active segment Twitter | @tikachu99
  • 13. How does Kafka write to the Log? Broker TP-1 TP-2 . . . Segment 00 Segment 09 Segment 03 Segment 06 … r0 r1 r2 r5 r4 r3 r8 r7 r6 r10 r9 produce message to TP-1 Twitter | @tikachu99
  • 14. Measuring Kafka Log Performance 14 E2E Latency Storage Latency The amount of time it takes e2e to produce and consume a record. The amount of time it takes to write to disk. Twitter | @tikachu99
  • 15. Measuring Kafka Log Performance 15 E2E Latency Storage Latency The amount of time it takes e2e to produce and consume a record The amount of time it takes to write to disk. Twitter | @tikachu99 E2E Latency (99%)
  • 16. Measuring Kafka Log Performance 16 E2E Latency Storage Latency The amount of time it takes e2e to produce and consume a record. The amount of time it takes to write to disk. Twitter | @tikachu99
  • 17. Measuring Kafka Log Performance 17 E2E Latency Storage Latency The amount of time it takes e2e to produce and consume a record. Twitter | @tikachu99 Storage Latency (99%)
  • 18. Twitter | @tikachu99 Agenda What is Degraded Storage? Persistent vs. Transient Degraded Storage Impact Mitigating Degraded Storage How Does Kafka Write to the Log? Twitter | @tikachu99
  • 19. What is Degraded Storage? Twitter | @tikachu99
  • 20. What is Degraded Storage? Twitter | @tikachu99
  • 21. Twitter | @tikachu99 Symptoms of Degraded Storage 21 This is an example of a subtitle with two lines Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Leo vel orci porta non pulvinar neque laoreet. Gravida quis blandit turpis cursus. Ipsum nunc aliquet bibendum enim facilisis. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Leo vel orci porta non pulvinar neque laoreet. Gravida quis blandit turpis cursus. Ipsum nunc al Partial / Fully Unavailable Disk Disk provided by cloud provider is “spotty” Slower Reads / Writes Broker read / writes (when not from filesystem cache) are impacted Impacted Performance Higher e2e and storage latency
  • 22. Twitter | @tikachu99 Symptoms of Degraded Storage 22 This is an example of a subtitle with two lines Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Leo vel orci porta non pulvinar neque laoreet. Gravida quis blandit turpis cursus. Ipsum nunc aliquet bibendum enim facilisis. Partial / Fully Unavailable Disk Disk provided by cloud provider is “spotty” Slower Reads / Writes Broker read / writes (when not from filesystem cache) are impacted Impacted Performance Higher e2e and storage latency
  • 23. Twitter | @tikachu99 Symptoms of Degraded Storage 23 This is an example of a subtitle with two lines Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Leo vel orci porta non pulvinar neque laoreet. Gravida quis blandit turpis cursus. Ipsum nunc aliquet bibendum enim facilisis. Partial / Fully Unavailable Disk Disk provided by cloud provider is “spotty” Slower Reads / Writes Broker read / writes (when not from filesystem cache) are impacted Impacted Performance Higher e2e and storage latencies
  • 24. Symptoms of Degraded Storage 24 Partial / Fully Unavailable Disk Slower Reads / Writes Disk provided by cloud provider is “spotty” Broker read / writes (when not from FS cache) are impacted Twitter | @tikachu99 E2E Latency (99%) This is an example of a subtitle with two lines Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Leo vel orci porta non pulvinar neque laoreet. Gravida quis blandit turpis cursus. Ipsum nunc aliquet bibendum enim facilisis. Impacted Performance Higher e2e and storage latencies
  • 25. Symptoms of Degraded Storage 25 Partial / Fully Unavailable Disk Slower Reads / Writes Disk provided by cloud provider is “spotty” Broker read / writes (when not from FS cache) are impacted Twitter | @tikachu99 Storage Latency (99.9%) This is an example of a subtitle with two lines Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Leo vel orci porta non pulvinar neque laoreet. Gravida quis blandit turpis cursus. Ipsum nunc aliquet bibendum enim facilisis. Impacted Performance Higher e2e and storage latencies
  • 26. Frequency of Degraded Storage How many incidents of degraded storage do you think Confluent Cloud has experienced in the last 30 days? Twitter | @tikachu99
  • 27. Frequency of Degraded Storage 500 incidents in the past 30 days Twitter | @tikachu99
  • 28. Frequency of Degraded Storage ~6000 incidents per year Twitter | @tikachu99
  • 29. Twitter | @tikachu99 Agenda What is Degraded Storage? Persistent vs. Transient Degraded Storage Impact Mitigating Degraded Storage How Does Kafka Write to the Log?
  • 30. Twitter | @tikachu99 Persistent Degraded Storage 30 ➔ long-term instances of degraded infrastructure
  • 31. Twitter | @tikachu99 Persistent Degraded Storage 31 ➔ long-term instances of degraded infrastructure ➔ e2e latencies are not consistently bad
  • 32. Twitter | @tikachu99 Persistent Degraded Storage 32 ➔ long-term instances of degraded infrastructure ➔ e2e latencies are not consistently bad ➔ storage latencies may temporarily recover, but never fully return to a good state
  • 33. Twitter | @tikachu99 Persistent Degraded Storage 33 Storage Latency (99.9%) E2E Latency (99%)
  • 34. Twitter | @tikachu99 Transient Degraded Storage 34 ➔ short-term instances of degraded infrastructure
  • 35. Twitter | @tikachu99 Transient Degraded Storage 35 ➔ short-term instances of degraded infrastructure ➔ may recover within a short period of time due to transient factors
  • 36. Twitter | @tikachu99 Transient Degraded Storage 36 Storage Latency (99.9%) E2E Latency (99%)
  • 37. Twitter | @tikachu99 Agenda What is Degraded Storage? Persistent vs. Transient Degraded Storage Impact Mitigating Degraded Storage How Does Kafka Write to the Log?
  • 38. Twitter | @tikachu99 Mitigating Degraded Storage 38 38 On-Prem Confluent Cloud Identify brokers with bad disks and reassign leadership using kafka-reassign-partitions.sh or kafka-leader-election.sh Automatically detect problematic brokers with bad disks. Mark the broker as “degraded” and trigger mitigation. (Maybe) Replace disk.
  • 39. Twitter | @tikachu99 Mitigating Degraded Storage 39 39 On-Prem Identify brokers with bad disks and reassign leadership using kafka-reassign-partitions.sh or kafka-leader-election.sh TP-1 (L) TP-2 (F) TP-1 (F) TP-2 (L) Broker 1 Broker 2
  • 40. Twitter | @tikachu99 Mitigating Degraded Storage 40 40 On-Prem Identify brokers with bad disks and reassign leadership using kafka-reassign-partitions.sh or kafka-leader-election.sh TP-1 (L) TP-2 (F) TP-1 (F) TP-2 (L) Broker 1 Broker 2
  • 41. Twitter | @tikachu99 Mitigating Degraded Storage 41 41 On-Prem Identify brokers with bad disks and reassign leadership using kafka-reassign-partitions.sh or kafka-leader-election.sh TP-1 (L) TP-2 (F) TP-1 (F) TP-2 (L) TP-1 (L) TP-2 (L) TP-1 (F) TP-2 (F) Broker 1 Broker 2 TP-1 (F) TP-2 (L) Broker 2
  • 42. Twitter | @tikachu99 Mitigating Degraded Storage 42 42 On-Prem Identify brokers with bad disks and reassign leadership using kafka-reassign-partitions.sh or kafka-leader-election.sh TP-1 (L) TP-2 (F) TP-1 (F) TP-2 (L) TP-1 (L) TP-2 (L) TP-1 (F) TP-2 (F) Broker 1 Broker 2 New leader elected TP-1 (F) TP-2 (L) Broker 2
  • 43. Twitter | @tikachu99 Mitigating Degraded Storage 43 43 On-Prem Identify brokers with bad disks and reassign leadership using kafka-reassign-partitions.sh or kafka-leader-election.sh TP-1 (L) TP-2 (F) TP-1 (F) TP-2 (L) TP-1 (L) TP-2 (L) TP-1 (F) TP-2 (F) Broker 1 Broker 2 New leader elected Old leader made follower TP-1 (F) TP-2 (L) Broker 2
  • 44. Twitter | @tikachu99 Mitigating Degraded Storage 44 44 On-Prem Confluent Cloud Identify brokers with bad disks and reassign leadership! Automatically detect problematic brokers with bad disks. Mark the broker as “degraded” and trigger mitigation. (Maybe) Replace disk.
  • 45. Twitter | @tikachu99 Automated Detection & Remediation 45 Monitors for unhealthy brokers
  • 46. Twitter | @tikachu99 Automated Detection & Remediation 46 Monitors for unhealthy brokers Marks unhealthy broker as “degraded” Automation
  • 47. Twitter | @tikachu99 Automated Detection & Remediation 47 Monitors for unhealthy brokers Marks unhealthy broker as “degraded” Unhealthy broker’s disk recovers Automation
  • 48. Twitter | @tikachu99 Automated Detection & Remediation 48 Monitors for unhealthy brokers Marks unhealthy broker as “degraded” Unhealthy broker’s disk recovers Marks recovered broker as “healthy” Automation Automation
  • 49. Twitter | @tikachu99 Automated Detection & Remediation 49 Monitors for unhealthy brokers Marks unhealthy broker as “degraded” Unhealthy broker’s disk recovers Marks recovered broker as “healthy” Automation Automation
  • 50. Twitter | @tikachu99 Changing Broker State Broker 0 TP-1 (L) TP-2 (L)
  • 51. Twitter | @tikachu99 Changing Broker State TP-1 (L) TP-2 (L) ISR: 0, 2, 4 ISR: 0, 1, 3 Broker 0
  • 52. Twitter | @tikachu99 Changing Broker State TP-1 (L) TP-2 (L) Broker 0
  • 53. Twitter | @tikachu99 Changing Broker State TP-1 (L) TP-2 (L) Mark “DEGRADED” Broker 0
  • 54. Twitter | @tikachu99 Changing Broker State TP-1 (L) TP-2 (L) Mark “DEGRADED” TP-1 (F) TP-2 (F) Broker 0 Broker 0
  • 55. Twitter | @tikachu99 Changing Broker State TP-1 (L) TP-2 (L) Mark “DEGRADED” TP-1 (F) TP-2 (F) Broker 0 Broker 0 ISR: 2, 4 ISR: 1, 3 TP-1 (F) TP-2 (F)
  • 56. Twitter | @tikachu99 Changing Broker State TP-1 (L) TP-2 (L) Mark “DEGRADED” 56 TP-1 (F) TP-2 (F) Broker 0 Broker 0
  • 57. Twitter | @tikachu99 Changing Broker State TP-1 (L) TP-2 (L) Mark “DEGRADED” 57 Mark “HEALTHY” TP-1 (F) TP-2 (F) Broker 0 Broker 0
  • 58. Twitter | @tikachu99 Changing Broker State TP-1 (L) TP-2 (L) Mark “DEGRADED” 58 Mark “HEALTHY” Broker 0 TP-1 (F) TP-2 (F) TP-1 (F) TP-2 (F) Broker 0 Broker 0
  • 59. Twitter | @tikachu99 Changing Broker State TP-1 (L) TP-2 (L) Mark “DEGRADED” 59 Mark “HEALTHY” TP-1 (F) TP-2 (F) TP-1 (F) TP-2 (F) ISR: 1, 3, 0 ISR: 2, 4, 0 Broker 0 Broker 0 Broker 0
  • 60. Twitter | @tikachu99 Agenda What is Degraded Storage? Persistent vs. Transient Degraded Storage Impact Mitigating Degraded Storage How Does Kafka Write to the Log?
  • 61. Twitter | @tikachu99 The Battle Against Degraded Infra… 61 E2E Latency (99%)
  • 62. Twitter | @tikachu99 The Battle Against Degraded Infra… 62 E2E Latency (99%)
  • 63. Twitter | @tikachu99 The Battle Against Degraded Infra… 63 ⚔ E2E Latency (99%)
  • 64. Twitter | @tikachu99 The Battle Against Degraded Infra… 64 ⚔ E2E Latency (99%)
  • 65. Twitter | @tikachu99 The Battle Against Degraded Infra… 65 E2E Latency (99%)
  • 66. Twitter | @tikachu99 The Battle Against Degraded Infra… 66 E2E Latency (99%)
  • 67. Twitter | @tikachu99 The Battle Against Degraded Infra… 67 E2E Latency (99%)
  • 68. Twitter | @tikachu99 The Battle Against Degraded Infra… 68 E2E Latency (99%)
  • 69. Twitter | @tikachu99 The Battle Against Degraded Infra… Storage Latency (99%)
  • 70. Twitter | @tikachu99 The Battle Against Degraded Infra… 🥸 Storage Latency (99%)
  • 71. Twitter | @tikachu99 The Battle Against Degraded Infra… 71 😳 🥸 Storage Latency (99.9%) Storage Latency (99%)
  • 72. Twitter | @tikachu99 The Battle Against Degraded Infra… 72 😳 🥸 Storage Latency (99.9%) Storage Latency (99%) Tail latencies (p99.9) matter!
  • 73. Twitter | @tikachu99 The Battle Against Degraded Infra… 73 😳 🥸 😌 Storage Latency (99.9%) Storage Latency (99%)
  • 74. Twitter | @tikachu99 The Battle Against Degraded Infra… 74 🤑 😳 🥸 😌 Storage Latency (99.9%) Storage Latency (99%)
  • 75. Twitter | @tikachu99 In Summary… 75 ➔ there are two flavors of degraded storage: persistent and transient
  • 76. Twitter | @tikachu99 In Summary… 76 ➔ there are two flavors of degraded storage: persistent and transient ➔ degraded storage often goes unnoticed and affects tail latencies
  • 77. Twitter | @tikachu99 In Summary… 77 ➔ there are two flavors of degraded storage: persistent and transient ➔ degraded storage often goes unnoticed and affects tail latencies ➔ don’t let degradation bring you down 󰡀