1) Apache Kafka is a distributed system with many moving parts to monitor, including brokers, topics, partitions, and the applications that use Kafka. It is critical to monitor Kafka performance to ensure high availability and catch problems early.
2) Key metrics to monitor include partition replication, broker resource usage, request latencies, and end-to-end message delivery. Monitoring message rates and comparing production to consumption helps identify issues like under- or over-consumption.
3) Identifying performance bottlenecks like slow request handling or network saturation helps optimize the Kafka cluster. Drilling down on request latency metrics provides insight into where bottlenecks exist in the request lifecycle.
Monitor Kafka and Streaming Apps with Metrics and Logs
1. 1
Metrics Are Not Enough
Gwen Shapira, Product Manager
@gwenshap
Monitoring Apache Kafka and Streaming Applications
2. 2
Monitoring Distributed Systems is hard
“Google SRE team with 10–12 members
typically has one or sometimes two members
whose primary assignment is to build and maintain
monitoring systems for their service.”
https://www.oreilly.com/ideas/monitoring-distributed-systems
4. 4
Many Moving Parts to Watch
• Producers
• Consumers
• Consumer Groups
• Brokers
• Controller
• Zookeeper
• Topics
• Partitions
• Messages
• …..
5. 5
And many metrics to monitor
• Broker throughput
• Topic throughput
• Disk utilization
• Unclean leader elections
• Network pool usage
• Request pool usage
• Request latencies – 30 request types, 5 phases
each
• Topic partition status counts: online, under
replicated, offline
• Log flush rates
• ZK disconnects
• Garbage collection pauses
• Message delivery
• Consumer groups reading from topics
• …
6. 6
Every Service that uses Kafka is a Distributed System
Orders
Service
Stock
Service
Fulfilment
Service
Fraud Detection
Service
Mobile App
Kafka
7. 7
It is all CRITICAL to your business
• Real-time applications mean very little room for errors
• Is Kafka available and performing well? You need to know before your users do.
• You must detect and act on small problems before they escalate
• The business cares a lot about accuracy and SLAs
• It is 8:05am, does the dashboard reflect the status of the system up to 8am?
• Continuously improve performance
• Monitor Kafka cluster performance
• Identify and act on leading indicators of future problems
• Quick triage – can you identify likely causes of a problem quickly and effectively?
8. 8
So you may need a bit of help
• Operators must have visibility into the health
of the Kafka cluster
• The business must have visibility into
completeness and latency of message
delivery
• Everyone needs to focus on the most
meaningful metrics
9. 9
Types of monitoring
• Tailing logs
• OS metrics
• Kafka / Client metrics
• Tracing applications
• Event level sampling
• APM – Application performance from user perspective
• …
10. 10
Types of monitoring
• Tailing logs
• OS metrics
• Kafka / Client metrics
• Tracing applications
• Event level sampling
• APM – Application performance from user perspective
• …
12. 12
The basics
• Whatever else you do: Check that the broker process is running
• External agent
• Or alert on stale metrics
• Don’t alert on everything. Fewer, high level alerts are better.
14. 14
Under-replicated partitions
• If you can monitor just one thing…
• Is it a specific broker?
• Cluster wide:
• Out of resources
• Imbalance
• Broker:
• Hardware
• Noisy neighbor
• Configuration
17. 17
Don’t watch the dashboard
• Control Center detects anomalous events in monitoring data
• Users can define triggers
• Control Center performs customizable actions when triggers occur
• When troubleshooting Kafka issues, users can view previous alerts and historical message delivery
data at the time the alert occurred
18. 18
Capacity Planning – Be Proactive
• Capacity planning ensures that your cluster can continue to meet business demands
• Control Center provides indicators if a cluster may need more brokers
• Key metrics that indicate a cluster is near capacity:
• CPU
• Network and thread pool usage
• Request latencies
• Network utilization - Throughput, per broker and per cluster
• Disk utilization - Disk space used by all log segments, per broker
21. 21
Are You Meeting SLAs?
• Stream monitoring helps you determine if all messages are delivered end-to-end in a timely manner
• This is important for several reasons:
• Ensure producers and consumers are not losing messages
• Check if consumers are consuming more than expected
• Verify low latency for real-time applications
• Identify slow consumers
22. 22
How to monitor?
The infamous LinkedIn “Audit”:
• Count messages when they are produced
• Count messages when they are consumed
• Check timestamps when they are consumed
• Compare the results
24. 24
Under Consumption
• Reasons for under consumption:
• Producers not handling errors and retried correctly
• Misbehaving consumers, perhaps the consumer did not follow shutdown sequence
• Real-time apps intentionally skipping messages
• Red bars indicate some messages were not consumed
• Herringbone pattern can indicate error in measurement
• Usually improper shutdown of client
25. 25
Over Consumption
• Reasons for over consumption
• Consumers may be processing a set of messages more than once, which may have impact on their
applications
• Consumption bars are higher than the expected consumption lines
• Latency may be higher
26. 26
Slow Consumers
• Identify consumers and consumer groups that are not keeping up with data production
• Use the per-consumer and per-consumer group metrics
• Compare a slow, lagging consumer (left) to a good consumer (right)
• The slow consumer (left) is processing all the messages, but with high latency
• Slow consumers may also process fewer messages in a given time window, so monitor "Expected
consumption" (the top line)
28. 28
Identify Performance Bottlenecks
• Real-time applications require high throughput or low latency
• Need to baseline where you are
• Monitor for changes to get ahead of the problem
• You may need to identify performance bottlenecks
• Break-down the times for the end-to-end dataflow to give you pointers where streams are taking the
most processing time
• The key metrics to look at include:
• Request latencies
• Network pool usage
• Request pool usage
29. 29
Produce and Fetch Request Latencies
Breakdown produce and fetch latencies through the
entire request lifecycle
Request latency values can be shown at the median,
95th, 99th, or 99.9th percentile
30. 30
Request Latencies Explained (1)
• Total request latency (center)
• Total time of an entire request lifecycle, from the broker point of view
• Request queue
• The time the request is in the request queue waiting for an IO thread
• A high value can indicate there are not enough IO threads or CPU is a bottleneck
• Also check: What are those IO threads doing?
• Request local
• The time the request is being processed locally by the leader
• A high value can imply slow disk so monitor broker disk IO
31. 31
Request Latencies Explained (2)
• Response remote
• The time the request is waiting on other brokers
• Higher times are expected on high-reliability or high-throughput systems
• A high value can indicate a slow network connection, or the consumer is caught up to the end of the log
• Response queue
• The time the request is in the response queue waiting for a network thread
• A high value can imply there are not enough network threads
• Response send
• The time the request is being sent back to the consumer
• A high value can imply the CPU or network is a bottleneck
32. 32
Network and Request Handler Threads
• Network pool usage
• Average network pool capacity usage across all brokers, i.e. the fraction of time the network processor
threads are not idle
• If network pool usage is above 70%, isolate bottleneck with the request latency breakdown
• Consider increasing the broker configuration parameter num.network.threads, especially if Response
queue metric is high and you have resources
• Request pool usage
• Average request handler capacity usage across all brokers, i.e. the fraction of time the request handler
threads are not idle
• If request pool usage is above 70%, isolate bottleneck with the request latency breakdown
• Consider increasing the broker configuration parameter num.io.threads, especially if Request queue
metric is high
• Why are all your handlers busy? Check GC, access patterns and disk IO
34. 34
Few things to remember…
• Monitor Kafka
• Work with your developers to monitor critical applications end-to-end
• More data is better: Metrics + logs + OS + APM + …
• But fewer alerts are better
• Alert on what’s important – Under—Replicated Partitions is a good start
• DON’T JUST FIDDLE WITH STUFF
• AND DON’T RESTART KAFKA FOR LOLS
• If you don’t know what you are doing, it is ok. There’s support (and Cloud) for that.
35. 35
And as you start your Production Kafka Journey…
Plan
Validate
Deploy
Observe
Analyze