SlideShare a Scribd company logo
1 of 42
1
Monitoring Apache Kafka
Gwen Shapira, Principal Data Architect
@gwenshap
2
#1 Kafka Monitoring Tip:
Please Monitor
Apache Kafka.
3
Breaking tuning
production databases
since 1997
Apache Kafka PMC
Principal Data Architect
Tweeting a lot
@gwenshap
4
Apache Kafka is a distributed system and has many components
Controller
5
Things to keep an eye on
• Broker health
• Message delivery
• Performance
• Capacity
6
Broker Health Monitoring
7
Kafka & Metrics Reporters
• Pluggable interface
• JMX reporter built in
• TONS of metrics
8
TONS of Metrics
• Broker throughput
• Topic throughput
• Disk utilization
• Unclean leader elections
• Network pool usage
• Request pool usage
• Request latencies – 30 request types, 5 phases
each
• Topic partition status counts: online, under
replicated, offline
• Log flush rates
• ZK disconnects
• Garbage collection pauses
• Message delivery
• Consumer groups reading from topics
• …​
9
#2 Kafka Monitoring Tip:
Have a dashboard
that lets you know
“is everything ok?”
in one look
10
Is the broker process up?
• ps –ef | grep kafka
• Do we receive metrics?
11
Under-replicated partitions
• If you can monitor just one thing…
• Is it a specific broker?
• Cluster wide:
• Out of resources
• Imbalance
• Broker:
• Hardware
• Noisy neighbor
• Configuration
• Garbage collection
12
Drill Down into Broker and Topic: Do we see a problem right here?
13
Check partition placement - is the issue specific to one broker?
14
#3 Kafka Monitoring Tip:
Monitor
Under-replicated Partitions
15
Canary
• Produce an event
• Try to consume the event 1s later
• Did you get it?
16
Other important health metrics
• Active Controller
• ZK Disconnects
• Unclean leader elections
• ISR shrink/expand
• # brokers
17
#3 Kafka Monitoring Tip:
Don’t Watch the Dashboard
18
PSA regarding broker health
• Before 0.11.0.0 – restarting a broker is risky
• Even after…
• Before 1.0.0 – restarting a broker is slow
• Especially with 5000+ partitions per broker
Lesson:
Only restart if you know why this will fix the issue.
19
Message Delivery
20
Delivery Guarantees
• At most once
• At least once
• Exactly once
• … and within N milliseconds
21
Are you meeting your guarantees, right now?
22
#4 Kafka Monitoring Tip:
Monitoring brokers isn’t enough.
You need to monitor events
23
Every Service that uses Kafka is a Distributed System
Orders
Service
Stock
Service
Fulfilment
Service
Fraud Detection
Service
Mobile App
Kafka
24
How to monitor?
The infamous LinkedIn “Audit”:
• Count messages when they are produced
• Count messages when they are consumed
• Check timestamps when they are
consumed
• Compare the results
25
Under Consumption
• Reasons for under consumption:
• Producers not handling errors and retried correctly
• Misbehaving consumers, perhaps the consumer did not follow shutdown sequence
• Real-time apps intentionally skipping messages
26
#5 Kafka Monitoring Life Tips:
• producer.close();
• retries > 0
• Handle send() exceptions.
• Use new consumer (0.10 and up)
27
Over Consumption
• Reasons for over consumption
• Consumers may be processing a set of messages more than once
• Latency may be higher
28
Slow Consumers
• Identify consumers and consumer groups that are not keeping up with data production
• Compare a slow, lagging consumer (left) to a good consumer (right)
29
Other important client metrics
• Producer retries
• Producer errors
• Consumer message lag – especially trends
30
Performance Tuning
31
#5 Kafka Monitoring Tip:
Tune the bottlenecks
32
Wrong:
“OMG! Log Flush Time is 14ms”
Is this high?
How often we flush logs?
Is it blocking?
Who cares?
Right:
“OMG! We only process 50,000
requests per second. We need 10
times this”
Why? IO threads are busy.
Why? Waiting for flush.
Why are we flushing so often?
Log segments keep rotating.
Lets configure larger segments!
33
measure
performance
breakdown
time spent
change one
thing
34
Lifecycle of a request
• Client sends request to broker
• Network thread gets request and puts on queue
• IO thread / handler picks up request and processes
• Read / Write from/to local “disk”
• Wait for other brokers to ack messages
• Put response on queue
• Network thread sends response to client
35
Produce and Fetch Request Latencies
• Breakdown produce and fetch latencies
through the entire request lifecycle
• Each time segment correspond to a metric
36
How to make it faster?
• Are you network, cpu, disk bound?
• Do you need more threads?
• Where do the threads spend their
time?
37
Capacity Planning
Key metrics that indicate a cluster is near capacity:
• CPU
• Network and thread pool usage
• Request latencies
• Network utilization
• Disk utilization
38
#6 Kafka Monitoring Tip:
By the time you reach 90%
utilization.
It is too late.
39
Summary:
Please Monitor Kafka.
40
Few things to remember…
• Alert on what’s important: Under-Replicated Partitions is a good start
• DON’T JUST FIDDLE WITH STUFF
• AND DON’T RESTART KAFKA FOR LOLS
• If you don’t know what you are doing, it is ok. There’s support (and Cloud) for that.
41
We are hiring.
42
Thank You!

More Related Content

What's hot

What's hot (20)

Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at Uber
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for Prometheus
 
Apache Kafka Security
Apache Kafka Security Apache Kafka Security
Apache Kafka Security
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Apache Kafka® Security Overview
Apache Kafka® Security OverviewApache Kafka® Security Overview
Apache Kafka® Security Overview
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Introducing Confluent Cloud: Apache Kafka as a Service
Introducing Confluent Cloud: Apache Kafka as a Service Introducing Confluent Cloud: Apache Kafka as a Service
Introducing Confluent Cloud: Apache Kafka as a Service
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
 

Similar to Monitoring Apache Kafka

Similar to Monitoring Apache Kafka (20)

Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
 
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming ApplicationsMetrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
 
10135 b 11
10135 b 1110135 b 11
10135 b 11
 
URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know
 
OnPrem Monitoring.pdf
OnPrem Monitoring.pdfOnPrem Monitoring.pdf
OnPrem Monitoring.pdf
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
 
Asynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per secondAsynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per second
 
Distributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleDistributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola Scale
 
Stream Processing @ Lyft
Stream Processing @ LyftStream Processing @ Lyft
Stream Processing @ Lyft
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 
Perfug 20-11-2019 - Kafka Performances
Perfug 20-11-2019 - Kafka PerformancesPerfug 20-11-2019 - Kafka Performances
Perfug 20-11-2019 - Kafka Performances
 
Building Awesome APIs with Lumen
Building Awesome APIs with LumenBuilding Awesome APIs with Lumen
Building Awesome APIs with Lumen
 
Fixing Domino Server Sickness
Fixing Domino Server SicknessFixing Domino Server Sickness
Fixing Domino Server Sickness
 
SC'16 PMIx BoF Presentation
SC'16 PMIx BoF PresentationSC'16 PMIx BoF Presentation
SC'16 PMIx BoF Presentation
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
 
BTV PHP - Building Fast Websites
BTV PHP - Building Fast WebsitesBTV PHP - Building Fast Websites
BTV PHP - Building Fast Websites
 

More from confluent

More from confluent (20)

Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 

Recently uploaded

Recently uploaded (20)

Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, Ocado
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 

Monitoring Apache Kafka

  • 1. 1 Monitoring Apache Kafka Gwen Shapira, Principal Data Architect @gwenshap
  • 2. 2 #1 Kafka Monitoring Tip: Please Monitor Apache Kafka.
  • 3. 3 Breaking tuning production databases since 1997 Apache Kafka PMC Principal Data Architect Tweeting a lot @gwenshap
  • 4. 4 Apache Kafka is a distributed system and has many components Controller
  • 5. 5 Things to keep an eye on • Broker health • Message delivery • Performance • Capacity
  • 7. 7 Kafka & Metrics Reporters • Pluggable interface • JMX reporter built in • TONS of metrics
  • 8. 8 TONS of Metrics • Broker throughput • Topic throughput • Disk utilization • Unclean leader elections • Network pool usage • Request pool usage • Request latencies – 30 request types, 5 phases each • Topic partition status counts: online, under replicated, offline • Log flush rates • ZK disconnects • Garbage collection pauses • Message delivery • Consumer groups reading from topics • …​
  • 9. 9 #2 Kafka Monitoring Tip: Have a dashboard that lets you know “is everything ok?” in one look
  • 10. 10 Is the broker process up? • ps –ef | grep kafka • Do we receive metrics?
  • 11. 11 Under-replicated partitions • If you can monitor just one thing… • Is it a specific broker? • Cluster wide: • Out of resources • Imbalance • Broker: • Hardware • Noisy neighbor • Configuration • Garbage collection
  • 12. 12 Drill Down into Broker and Topic: Do we see a problem right here?
  • 13. 13 Check partition placement - is the issue specific to one broker?
  • 14. 14 #3 Kafka Monitoring Tip: Monitor Under-replicated Partitions
  • 15. 15 Canary • Produce an event • Try to consume the event 1s later • Did you get it?
  • 16. 16 Other important health metrics • Active Controller • ZK Disconnects • Unclean leader elections • ISR shrink/expand • # brokers
  • 17. 17 #3 Kafka Monitoring Tip: Don’t Watch the Dashboard
  • 18. 18 PSA regarding broker health • Before 0.11.0.0 – restarting a broker is risky • Even after… • Before 1.0.0 – restarting a broker is slow • Especially with 5000+ partitions per broker Lesson: Only restart if you know why this will fix the issue.
  • 20. 20 Delivery Guarantees • At most once • At least once • Exactly once • … and within N milliseconds
  • 21. 21 Are you meeting your guarantees, right now?
  • 22. 22 #4 Kafka Monitoring Tip: Monitoring brokers isn’t enough. You need to monitor events
  • 23. 23 Every Service that uses Kafka is a Distributed System Orders Service Stock Service Fulfilment Service Fraud Detection Service Mobile App Kafka
  • 24. 24 How to monitor? The infamous LinkedIn “Audit”: • Count messages when they are produced • Count messages when they are consumed • Check timestamps when they are consumed • Compare the results
  • 25. 25 Under Consumption • Reasons for under consumption: • Producers not handling errors and retried correctly • Misbehaving consumers, perhaps the consumer did not follow shutdown sequence • Real-time apps intentionally skipping messages
  • 26. 26 #5 Kafka Monitoring Life Tips: • producer.close(); • retries > 0 • Handle send() exceptions. • Use new consumer (0.10 and up)
  • 27. 27 Over Consumption • Reasons for over consumption • Consumers may be processing a set of messages more than once • Latency may be higher
  • 28. 28 Slow Consumers • Identify consumers and consumer groups that are not keeping up with data production • Compare a slow, lagging consumer (left) to a good consumer (right)
  • 29. 29 Other important client metrics • Producer retries • Producer errors • Consumer message lag – especially trends
  • 31. 31 #5 Kafka Monitoring Tip: Tune the bottlenecks
  • 32. 32 Wrong: “OMG! Log Flush Time is 14ms” Is this high? How often we flush logs? Is it blocking? Who cares? Right: “OMG! We only process 50,000 requests per second. We need 10 times this” Why? IO threads are busy. Why? Waiting for flush. Why are we flushing so often? Log segments keep rotating. Lets configure larger segments!
  • 34. 34 Lifecycle of a request • Client sends request to broker • Network thread gets request and puts on queue • IO thread / handler picks up request and processes • Read / Write from/to local “disk” • Wait for other brokers to ack messages • Put response on queue • Network thread sends response to client
  • 35. 35 Produce and Fetch Request Latencies • Breakdown produce and fetch latencies through the entire request lifecycle • Each time segment correspond to a metric
  • 36. 36 How to make it faster? • Are you network, cpu, disk bound? • Do you need more threads? • Where do the threads spend their time?
  • 37. 37 Capacity Planning Key metrics that indicate a cluster is near capacity: • CPU • Network and thread pool usage • Request latencies • Network utilization • Disk utilization
  • 38. 38 #6 Kafka Monitoring Tip: By the time you reach 90% utilization. It is too late.
  • 40. 40 Few things to remember… • Alert on what’s important: Under-Replicated Partitions is a good start • DON’T JUST FIDDLE WITH STUFF • AND DON’T RESTART KAFKA FOR LOLS • If you don’t know what you are doing, it is ok. There’s support (and Cloud) for that.

Editor's Notes

  1. Lets start with the obvious: If I ask you, “Hey, how is your Kafka cluster doing right now?” you should be able to tell me how many brokers are running and how many partitions are unavailable.
  2. Lets start with the obvious: If I ask you, “Hey, how is your Kafka cluster doing right now?” you should be able to tell me how many brokers are running and how many partitions are unavailable.
  3. Two ways to know if Kafka is up… check the process directly, or check if the other metrics you are getting are up to speed
  4. Lets start with the obvious: If I ask you, “Hey, how is your Kafka cluster doing right now?” you should be able to tell me how many brokers are running and how many partitions are unavailable.
  5. Lets start with the obvious: If I ask you, “Hey, how is your Kafka cluster doing right now?” you should be able to tell me how many brokers are running and how many partitions are unavailable.
  6. Monitoring Kafka isn’t enough – Kafka is part of a system and producers and consumers are involved
  7. Lets start with the obvious: If I ask you, “Hey, how is your Kafka cluster doing right now?” you should be able to tell me how many brokers are running and how many partitions are unavailable.
  8. Lets start with the obvious: If I ask you, “Hey, how is your Kafka cluster doing right now?” you should be able to tell me how many brokers are running and how many partitions are unavailable.