SlideShare a Scribd company logo
URP? Excuse You!
Todd Palino
Senior Staff Engineer, Site Reliability
LinkedIn
• What is Kafka
• Encyclopedia of Monitoring
• Automation
What This
Talk Is Not
Why Talk About
Monitoring?
Messages per Day at LinkedIn
What is Monitoring (not)?
Monitoring is not Alerting
• Collect everything
• Alert on nothing
• Events are better than metrics
• Tests are better than alerts
• Sleep is best in life
• What’s an SLA?
• Availability
• Latency
• Customer Guarantees
Service
Level
Objectives
Key Kafka Metrics
The Three Metrics You Need to Know
Partitions that are not
fully replicated within
the cluster
URP
The overall utilization
of an Apache Kafka
broker
Request
Handlers
How long requests
are taking, in which
stage of processing
Request
Timing
Under-Replicated Partitions
• Highly discussed
• Overall cluster health
• Replication is a consumer and producer
Under-Replicated Partitions
EXAMPLE: FAILED BROKER
Under-Replicated Partitions
EXAMPLE: CONSUMER PROBLEMS
Under-Replicated Partitions
EXAMPLE: PRODUCER PROBLEMS
Under-Replicated Partitions
• Overrated
• Doesn’t map to SLO
• Often not actionable
• Collect, but don’t alert
Everybody
In The
Pool
• Specialized thread pools
• Clients deal with network and
request pools
• Request handlers do most of
the work
Request
Handlers
• Decode and validate
• Perform task
• Wait for other brokers
• Assemble response
Request Handler Problems
• Anything that causes Kafka
to expend CPU cycles
• Includes problems related
to failing disks (IO wait)
• SSL and compression work
both can use a lot of CPU
CPU Time Timeout Deadlock
• Most often due to failing to
process controller requests
• Intra-cluster requests tend
to be bound by partition
counts
• Rapidly starves the pool of
threads
• Should always be a code
bug
• Usually looks exactly like a
timeout problem
• Rare, but hard to identify
Request Handler Problems
EXAMPLE: TIMEOUT OR DEADLOCK
Request Handler Problems
• Anything that causes Kafka
to expend CPU cycles
• Includes problems related
to failing disks (IO wait)
• SSL and compression work
both can use a lot of CPU
CPU Time Timeout Deadlock
• Most often due to failing to
process controller requests
• Intra-cluster requests tend
to be bound by partition
counts
• Rapidly starves the pool of
threads
• Should always be a code
bug
• Usually looks exactly like a
timeout problem
• Rare, but hard to identify
Brokers Don’t Do Compression
Brokers Don’t Shouldn’t Do Compression
• Kafka brokers are running a new version
• Message format has been set to the new
version
• Clients haven’t upgraded
Up Conversion Down Conversion
• Kafka brokers are running a new version
• Message format is set to an older version
due to clients
• Producer clients update to new version
Request Timing
• Remote – Waiting for other brokers
• Response Queue – Waiting to
send
• Response Send - Send to client
• Total – Request handling, end to
end
• Request Queue – Waiting to
process
• Local – Work local to the broker
Request Timing
EXAMPLE: PRODUCE TOTAL TIME
Request Timing
EXAMPLE: PRODUCE LOCAL TIME
Request Timing
EXAMPLE: PRODUCE REMOTE TIME
Thank you?
What’s Missing?
Availability
Monitoring
• SLO, part 2
• Measured externally
• Client focused
• github.com/linkedin/kafka-monitor
Operating System
And Hardware
Metrics
• What do they mean?
• What application is causing
it?
• Don’t alert unless:
• 100% clear signal
• 100% clear response
Capacity
Planning
• Plan in advance
• Multi-factor
• Don’t alert for capacity
Capacity
Metrics
• Request Handler Idle Ratio
• Disk Utilization
• Partition Count
• Network Utilization
Wrapping Up
If You Remember Nothing Else…
• Define your service level objectives
• Monitor your service level objectives
• Metrics that cover many problems are noisy
• Buy Kafka: The Definitive Guide
Getting (and Giving) Help
• Kafka Monitor
• https://github.com/linkedin/kafka-monitor
• Burrow
• https://github.com/linkedin/Burrow
• Cruise Control
• https://github.com/linkedin/cruise-control
• kafka-tools
• https://github.com/linkedin/kafka-tools
LinkedIn Open Source Get Involved
• Community
• users@kafka.apache.org
• dev@kafka.apache.org
• Bugs and Work:
• https://issues.apache.org/jira/projects/KAFK
A
Thank you

More Related Content

What's hot

Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
Alexander Korotkov
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
confluent
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
confluent
 
How to tune Kafka® for production
How to tune Kafka® for productionHow to tune Kafka® for production
How to tune Kafka® for production
confluent
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
HostedbyConfluent
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
Adam Kotwasinski
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easy
Alexander Kukushkin
 
Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기
NeoClova
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
Todd Palino
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
confluent
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
Chhavi Parasher
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
Sumant Tambe
 
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
Altinity Ltd
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
Timo Walther
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
DataStax
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Implementing End-To-End Tracing With Roman Kolesnev and Antony Stubbs | Curre...
Implementing End-To-End Tracing With Roman Kolesnev and Antony Stubbs | Curre...Implementing End-To-End Tracing With Roman Kolesnev and Antony Stubbs | Curre...
Implementing End-To-End Tracing With Roman Kolesnev and Antony Stubbs | Curre...
HostedbyConfluent
 
State transfer With Galera
State transfer With GaleraState transfer With Galera
State transfer With Galera
Mydbops
 
ClickHouse Keeper
ClickHouse KeeperClickHouse Keeper
ClickHouse Keeper
Altinity Ltd
 

What's hot (20)

Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
How to tune Kafka® for production
How to tune Kafka® for productionHow to tune Kafka® for production
How to tune Kafka® for production
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easy
 
Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
 
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Implementing End-To-End Tracing With Roman Kolesnev and Antony Stubbs | Curre...
Implementing End-To-End Tracing With Roman Kolesnev and Antony Stubbs | Curre...Implementing End-To-End Tracing With Roman Kolesnev and Antony Stubbs | Curre...
Implementing End-To-End Tracing With Roman Kolesnev and Antony Stubbs | Curre...
 
State transfer With Galera
State transfer With GaleraState transfer With Galera
State transfer With Galera
 
ClickHouse Keeper
ClickHouse KeeperClickHouse Keeper
ClickHouse Keeper
 

Similar to URP? Excuse You! The Three Kafka Metrics You Need to Know

Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Ontico
 
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming ApplicationsMetrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
confluent
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
Todd Palino
 
Monitoring Apache Kafka
Monitoring Apache KafkaMonitoring Apache Kafka
Monitoring Apache Kafka
confluent
 
Resilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackResilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes Back
C4Media
 
Make It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version ControlMake It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version Control
indiver
 
Fault Tolerance in Distributed Environment
Fault Tolerance in Distributed EnvironmentFault Tolerance in Distributed Environment
Fault Tolerance in Distributed Environment
Orkhan Gasimov
 
Asynchronous programming using CompletableFutures in Java
Asynchronous programming using CompletableFutures in JavaAsynchronous programming using CompletableFutures in Java
Asynchronous programming using CompletableFutures in Java
Oresztész Margaritisz
 
Production Ready Microservices at Scale
Production Ready Microservices at ScaleProduction Ready Microservices at Scale
Production Ready Microservices at Scale
Rajeev Bharshetty
 
Benchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and ResultsBenchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and Results
NGINX, Inc.
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right Way
DataStax Academy
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
Anshum Gupta
 
Adding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsAdding Real-time Features to PHP Applications
Adding Real-time Features to PHP Applications
Ronny López
 
CoAP Talk
CoAP TalkCoAP Talk
CoAP Talk
Basuke Suzuki
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservices
Bhakti Mehta
 
Design Review Best Practices - SREcon 2014
Design Review Best Practices - SREcon 2014Design Review Best Practices - SREcon 2014
Design Review Best Practices - SREcon 2014
Mandi Walls
 
Continuous Delivery for the Rest of Us
Continuous Delivery for the Rest of UsContinuous Delivery for the Rest of Us
Continuous Delivery for the Rest of Us
C4Media
 
Play With Streams
Play With StreamsPlay With Streams
Play With Streams
Tianjian Chen
 
Stream Processing @ Lyft
Stream Processing @ LyftStream Processing @ Lyft
Stream Processing @ Lyft
Jamie Grier
 

Similar to URP? Excuse You! The Three Kafka Metrics You Need to Know (20)

Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
 
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming ApplicationsMetrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Monitoring Apache Kafka
Monitoring Apache KafkaMonitoring Apache Kafka
Monitoring Apache Kafka
 
Resilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackResilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes Back
 
Make It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version ControlMake It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version Control
 
Fault Tolerance in Distributed Environment
Fault Tolerance in Distributed EnvironmentFault Tolerance in Distributed Environment
Fault Tolerance in Distributed Environment
 
Asynchronous programming using CompletableFutures in Java
Asynchronous programming using CompletableFutures in JavaAsynchronous programming using CompletableFutures in Java
Asynchronous programming using CompletableFutures in Java
 
Production Ready Microservices at Scale
Production Ready Microservices at ScaleProduction Ready Microservices at Scale
Production Ready Microservices at Scale
 
Benchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and ResultsBenchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and Results
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right Way
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Adding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsAdding Real-time Features to PHP Applications
Adding Real-time Features to PHP Applications
 
CoAP Talk
CoAP TalkCoAP Talk
CoAP Talk
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservices
 
Design Review Best Practices - SREcon 2014
Design Review Best Practices - SREcon 2014Design Review Best Practices - SREcon 2014
Design Review Best Practices - SREcon 2014
 
Continuous Delivery for the Rest of Us
Continuous Delivery for the Rest of UsContinuous Delivery for the Rest of Us
Continuous Delivery for the Rest of Us
 
Play With Streams
Play With StreamsPlay With Streams
Play With Streams
 
Stream Processing @ Lyft
Stream Processing @ LyftStream Processing @ Lyft
Stream Processing @ Lyft
 

More from Todd Palino

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
Todd Palino
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
Todd Palino
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Todd Palino
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
Todd Palino
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Todd Palino
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
Todd Palino
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
Todd Palino
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafka
Todd Palino
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
Todd Palino
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
Todd Palino
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
Todd Palino
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a Service
Todd Palino
 

More from Todd Palino (12)

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafka
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a Service
 

Recently uploaded

Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
Kamal Acharya
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
drwaing
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABSDESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
itech2017
 
An Approach to Detecting Writing Styles Based on Clustering Techniques
An Approach to Detecting Writing Styles Based on Clustering TechniquesAn Approach to Detecting Writing Styles Based on Clustering Techniques
An Approach to Detecting Writing Styles Based on Clustering Techniques
ambekarshweta25
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 

Recently uploaded (20)

Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABSDESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
 
An Approach to Detecting Writing Styles Based on Clustering Techniques
An Approach to Detecting Writing Styles Based on Clustering TechniquesAn Approach to Detecting Writing Styles Based on Clustering Techniques
An Approach to Detecting Writing Styles Based on Clustering Techniques
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 

URP? Excuse You! The Three Kafka Metrics You Need to Know

Editor's Notes

  1. Let me start off by telling you what we’re not talking about today. I won’t be going into the basics of what Kafka is – I assume that if you’re attending Kafka Summit, you have an idea of what it does and how it works. Regardless, you’re going to get some good data here on monitoring, even if you have very limited Kafka knowledge. However, this also won’t be an encyclopedic look at monitoring. I’m going to discuss a few key sets of metrics, and how to use them. But I won’t even be covering all the Kafka metrics you should look at, never mind all that exist. I encourage you to spin up a JMX tool of choice and explore what’s exposed for sensors in Kafka. I also encourage you to share with the class, whether in posts, talks, or tweets, any gems that you have for your own monitoring. I’m also not going to talk about automation, even as it relates to handling alerts. There are many fine talks out there about automating responses and runbooks, and we could spend hours talking about just that.
  2. So why am I here today talking about monitoring? There are lots of topics that could be covered, especially in an ecosystem as large as Kafka. And I could always deliver yet another “here’s how we do it at LinkedIn” talk. However, today I’m choosing to share a look at where we’re moving right now. I recently wrote a post for DevOps.com about a term we use, “Code Yellow”. This is one of our tools for dealing with an application, or a team, in crisis. Typically this is due to something like communication problems, or a large amount of tech debt. Since I recently wrote this post, and you all know that I work on Kafka, you can probably guess that I’m currently in this state. In our case, it’s due to somewhat unexpected growth.
  3. LinkedIn started using Kafka back in 2010, before it was open sourced. In September of 2015, we announced that we had hit a milestone, at one trillion messages a day produced into our Kafka clusters. Last year, at Kafka Summit in San Francisco, I noted that we had passed two trillion messages a day. At the beginning of the year, we clocked in at three trillion. And now, we’re over five trillion messages a day. That hockey stick at the end is the current source of my long days and sleepless nights. Top this off with the fact that our monitoring is currently very noisy, partly due to scale problems around this growth, and partly because we alert on many things that are not providing clear signals. We’re currently overhauling our monitoring as a result of this.
  4. So why do we have such noisy alerting? We’ve forgotten that monitoring and alerting are not the same thing.
  5. Today, we're going to be talking about monitoring, not alerting. What is the difference, you ask? In our case, monitoring refers to all the data we have available to us from Kafka and our underlying systems, from high level metrics like partition counts down to the most minute sensor that is available. Alerting, on the other hand, we will use to refer to the metrics that are used to tell us about an imminent problem. They're the metrics that wake us up at night. These should be carefully chosen, and they should be clear signals that demand an immediate response 100% of the time. Another thing to keep in mind that events are almost always superior to metrics when alerting. We know this, right? Kafka is all about events. And yet we still have measurements that are rates where they should be discrete counts of events. We normally can’t work with individual events, like a failed request, at scale. But we do want to know the actual number of failed requests, and not a requests per second metric where we miss data due to time windows. We also need to make sure that we’re testing the code before we deploy it. My team has fallen prey to reactive alerting – we find a new problem, like a socket leak, and we add a new alert for file handles in use so we can catch it before it goes critical. The bug gets fixed, but we keep the alert, just in case we run into it again. It would be much better for everyone if we added a release test that checks for the general case of increased file handle usage, and dropped the alert on the live systems. Alerting should always be aimed at maximizing the amount of sleep that your operations team gets. That means as few alerts as possible to keep everything running, and automating as much as possible.
  6. When we're talking about alerting, the most important thing to watch is the metrics related to your service level objectives, or SLOs. Just as a note, an SLO and an SLA are not the same thing. A service level agreement is a contract: it's basically an SLO with teeth - a penalty. The SLO is the level of service that we're promising to our customers. For Kafka, this is typically going to be that the system will be available, and it will perform at a certain level for produce and consume requests. We'll cover what metrics to use for this in a bit. In addition to these, your SLOs are whatever you’re guaranteeing to your customers. This may include a minimum amount of retention. If you’re working to GDPR, or another privacy standard, you may specify a maximum amount of time that data will be retained for (here’s a hint, that’s not necessarily the retention in time that you set for the topic).
  7. I've talked at length about the under replicated partition count metric. I dedicated a significant number of pages in a book you may have seen about how to respond to any non-zero value. At it's heart, this number tells you that the replication within the cluster is having a problems.
  8. A stable count on all but one broker tells you that that broker is not working. It's either down, or the replication is not started
  9. A variable count on a single broker tells you that that broker is having a problem servicing consume requests
  10. A variable count on multiple brokers indicates a more overall problem. In this case, you'll need to enumerate the partitions that are falling behind (using the CLI tools) and see if there is a common thread, such as a single broker that is having problems replicating from multiple cluster members.
  11. But the most important thing that the URP metric is, is overrated for alerting. That's right, I said it. I don't like getting paged for this metric. But why, you ask? If it illustrates so many problems, why wouldn't I want to get alerts for it? The problem is that it doesn't tell me that I'm breaching my SLO, and whatever problem it's telling me about is often not immediately actionable. More often than not, this metric tells me about two problems. The first is that a broker is down. I can detect that with a much clearer signal, however, by health checking the application. The other problem is that the cluster is operating over it's capacity. I don't want to be paged for that either because capacity is a proactive monitoring problem, not a reactive problem. We'll talk about that more in a few slides. Still, you should be collecting this metric, and you might want to consider generating warnings for it. It does illustrate a risky situation, because we depend on replication in the cluster for redundancy. When it's not zero, you have a problem that needs some attention.
  12. As with most applications, Kafka has thread pools to do work. There are several different ones - network handlers, request handlers, log compaction, recovery (which are also used for handling log segments at startup and shutdown). When we’re talking about client traffic, the network and request handlers are the ones that do all the work, and the request handlers are far more important. This is because the network handlers just take care of the network connection, including reading and writing bytes on the wire.
  13. The request handler does everything else for the client - it decodes and validates the protocol, handles produce and consume work, and assembles the response to send back. It even performs all of the broker internal work, responding to controller requests. This means that if you want a single indicator of how busy the broker is, you couldn’t ask for a much better measure than the utilization of the request handlers. But as with under-replicated partitions, there are a lot of different problems that could be indicated here
  14. CPU - Slow disk performance, often due to a failing drive, is a particular problem for produce requests. As the request handler will have to take more time when writing to disk, it will manifest as higher utilization Timeouts and deadlocks look very similar Timeouts -  all of the request handler threads are getting tied up. We most often see this when the broker is starting up, and it is failing to process requests from the controller within the controller socket timeout.  Deadlock - But if that doesn’t solve it, you may have hit a deadlock condition in handling requests. We’ve seen this recently with some shutdown code, but it was related to the authorizer we were using and not Kafka directly.  
  15. Here are the produce TotalTime graphs for a broker that is working perfectly well. (Include 50th, 99th, and 999th). If the broker is running well, why is there such a discrepancy? The reason is that the amount of time required for a produce request varies widely depending on the content of the request.
  16. Timeouts most often happen when controller requests are not processed within the controller socket timeout. What happens is that the controller sends the request, it times out, and then the controller sends the request again. You’ll see this especially when the broker is starting up, and the controller is trying to send it the state of the world with leader and ISR requests Deadlocks look almost identical, but they’re much more rare. We’ve seen them recently during shutdown, but that was caused by an issue in the authorizer module that we use, and not something that was endemic to Kafka itself. However, they’re almost always code issues. This makes them pretty tricky to debug.
  17. Wait, the Kafka brokers don’t compress data anymore! We got rid of that with the bump to message format 1, and relative offsets in the produced batches. Right? Yeah, that’s what I thought, too. Turns out that there are a couple cases, which are not as rare as you might think, that will result in the broker having to rewrite the incoming message batches.
  18. Another common culprit for the request handlers being over utilized, even at a low traffic volume, is due to compression. This happens when the client versions do not match the message format on disk. The (config name) is settable via a broker configuration, and controls how messages are written to disk. In an ideal world, the producer client version matches this configuration, such that the producer is sending the same message format. If the producer is an older version, the broker will have to upconvert the messages, and if the producer is using a higher message format version the broker will need to down convert. Both of these situations means the broker will be forced to recompress the message batch before writing it to disk (this also happens if your brokers are still using message format zero). This is an expensive operation, and should be avoided. It’s also worth noting that you can set the message format on disk as a per-topic override. You will want to be very careful if you feel the need to do this, as it means the logs on disk are inconsistent, and you could easily have compression you’re not expecting.
  19. If you have slow request processing due to issues like this, you’re also going to have latency issues. Which gets us into the third set of metrics... For each protocol request type, Kafka provides a set of timing metrics. These describe the amount of time that the request spends in various states while being processed: Total time - this is the overall total time to process a request, from when it is received to when it is complete Request Queue Time - how long the request sits in queue before being picked up by a request handler for processing Local Time - The amount of local processing time required for the request. This can include a number of things, such as disk write time for produce requests Remote Time - The amount of time that the request waits on non-local steps. This includes acknowledgements from followers for produce requests Response Queue Time - how long the response for the request sits in queue before being sent to the client Response Send Time - how long it takes to send the response to the client. This only covers getting it into the send buffers locally, not network time. In addition to the time metrics, there is also a rate metric that gives you the number of requests of a particular type per second. The time metrics are provided as percentiles, and as such you can choose from 50th, 75th, 99th, and 99.9th percentiles, as well as an average and maximum value over the course of the running process. Request latency is typically going to be the first of your SLO measurements. Which means that you will probably want to be monitoring these metrics and possibly alerting off them. The problem comes in as you try to pick which attributes to monitor, and what the baseline values are.
  20. Here are the produce TotalTime graphs, 50th percentile and 99.9th percentile, for a broker that is working perfectly well. It may be hard to see, but the scale of the first graph is in single digits, and the scale of the second is in thousands. If the broker is running well, why is there such a discrepancy? The reason is that the amount of time required for a produce request varies widely depending on the content of the request.
  21. Let’s consider the local time. Again, these are the 50th percentile and the 99.9th percentile, and the first graph goes from zero to one, while the second graph is again in the thousands. What would impact the amount of time required to process the produce requests locally? In this case, most of our produce requests are really small - small batches, single topic - but some of them are very large. The bigger the produce request, the more time it takes to write the data to disk.
  22. How about the remote time for the same produce requests? Yet again, these are the 50th and 99.9th percentile graphs, with the first one being from zero to two, and the second being in the thousands. The average value is small, but the 999th is multiple orders of magnitude higher. The most common cause here is that most of our requests are being produced with the required acknowledgements being set to 1, while some are requesting all acknowledgements. That easily drives up the amount of time spent in the remote step. This isn’t to say that you can’t use these metrics effectively for alerting. It just means that you need to define your SLOs appropriately.  Stating simply that produce requests will be handled in 20ms or less may not be reasonable, but specifying that value for the average produce request may be fine. 
  23. OK, so we’ve covered our three metrics, and we’ve still got X minutes left in this talk. I could sit here and just stare at my phone for the rest of the time. Or …
  24. We could talk about what’s missing, since we only covered a very small slice of monitoring for Kafka.
  25. The other side of your service level objectives is probably going to be the availability of Kafka to handle requests. But as with any system, you can’t truly measure the availability of a Kafka cluster from the brokers themselves. There are many factors that go into availability, including whether or not the network is working. Looking at the broker itself may tell you that everything’s fine, meanwhile none of your clients can connect. For monitoring availability, you need to use something external to the Kafka cluster to look at it from the client’s point of view. This is why LinkedIn created, and open sourced, kafka-monitor (https://github.com/linkedin/kafka-monitor). This runs a producer and a consumer for each cluster, and assures that both requests work properly. It can assure that there is at least one partition on each broker in the cluster, so you check the entire cluster. It also provides latency metrics for requests, so you have an objective view of the request timings we were just talking about.
  26. So what should we do about lower level OS and hardware metrics? Well, let me ask you this. I have a Kafka cluster that’s running at 95% CPU, what do I do? Well, if it’s serving requests properly and within the SLO, I go get a cup of coffee. I might need to look at it, but it’s not a crisis. Most metrics, OS or otherwise, are a great recipe for creating lots of alert noise that is not actionable. CPU and memory usage could be high due to other applications, and in most cases relate to overall capacity and not to the application’s performance or current state of functionality. You should definitely collect them so that you can go back and debug problems later. If you’re thinking about setting up an alert you need to ask yourself two things: Is this always actionable when the alert goes off? Is the action 100% clear? If the answer to either of these is something along the lines of “Yes, but…” you need to stop and rethink what you’re trying to accomplish. But, Todd! I need to monitor things like disk usage, don’t I? Yes, of course we do, but this falls under the heading of capacity planning.
  27. My Kafka environment, like many of yours, is shared between many different applications. You may even have some of the tech debt that we have, where you have little control over when someone starts using it for a new service. This means that we should be keeping an eye on the capacity of the system, and preemptively adding more. Preemptively is the key word here. You want to deploy new brokers before you’ve hit 100% capacity, which means that you need to order them earlier than that.
  28. I am no magician, contrary to the perception that many have of my ability to solve problems. It does me no good to get an alarm in the middle of the night that we’re approaching saturation, as I can’t magically make new hardware appear. And if I already have the hardware, it should have been added to the clusters so that I never hit a crisis point.
  29. The metrics that I’m mostly interested in for judging capacity are: Request handler pool idle ratio Disk utilization Partition Count Network utilization You should be trending these metrics over time, and reviewing them on a regular basis. You may want to have some sort of alert once capacity is approaching a point where you need to get more, but that should be an email, or even better, and automatic work ticket in your system of choice. Additionally, make sure you’re making use of features like quotas and retention of messages by size so that you can minimize any surprises.
  30. If you take nothing else away from today’s talk, leave with this. First, you must define what your service level objectives are for Kafka within your organization. Even if you’re running at a small scale, and with a limited number of customers. Even if you’re the only customer of your cluster. Make it clear what the expectations are, and hold to them. Next, once you have those SLOs, that is what you need to be monitoring. David Henke, who led Engineering and Operations at LinkedIn for many years, would often say “What gets measured, gets fixed.” If you do not monitor your SLOs, then they do not really count. But beware of metrics that inform you to many different problems. They are typically noisy, and they often make it difficult to determine what the underlying problem is. They are attractive, because it’s a single number that says “something is wrong”, but they will drive you crazy in the end. And lastly, buy yourself a copy of Kafka: The Definitive Guide. In fact, you should buy two or three. Because reasons.