Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications


Published on

When you are running systems in production, clearly you want to make sure they are up and running at all times. But in a distributed system such as Apache Kafka… what does “up and running” even mean?

Experienced Apache Kafka users know what is important to monitor, which alerts are critical and how to respond to them. They don’t just collect metrics - they go the extra mile and use additional tools to validate availability and performance on both the Kafka cluster and their entire data pipelines.

In this presentation we’ll discuss best practices of monitoring Apache Kafka. We’ll look at which metrics are critical to alert on, which are useful in troubleshooting and what may actually be misleading. We’ll review a few “worst practices” - common mistakes that you should avoid. We’ll then look at what metrics don’t tell you - and how to cover those essential gaps.

Published in: Technology
  • Login to see the comments

Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications

  1. 1. 1 Metrics Are Not Enough Gwen Shapira, Product Manager @gwenshap Monitoring Apache Kafka and Streaming Applications
  2. 2. 2 Monitoring Distributed Systems is hard “Google SRE team with 10–12 members typically has one or sometimes two members whose primary assignment is to build and maintain monitoring systems for their service.”
  3. 3. 3 Apache Kafka is a distributed system and has many components
  4. 4. 4 Many Moving Parts to Watch • Producers • Consumers • Consumer Groups • Brokers • Controller • Zookeeper • Topics • Partitions • Messages • …..
  5. 5. 5 And many metrics to monitor • Broker throughput • Topic throughput • Disk utilization • Unclean leader elections • Network pool usage • Request pool usage • Request latencies – 30 request types, 5 phases each • Topic partition status counts: online, under replicated, offline • Log flush rates • ZK disconnects • Garbage collection pauses • Message delivery • Consumer groups reading from topics • …​
  6. 6. 6 Every Service that uses Kafka is a Distributed System Orders Service Stock Service Fulfilment Service Fraud Detection Service Mobile App Kafka
  7. 7. 7 It is all CRITICAL to your business • Real-time applications mean very little room for errors • Is Kafka available and performing well? You need to know before your users do. • You must detect and act on small problems before they escalate • The business cares a lot about accuracy and SLAs • It is 8:05am, does the dashboard reflect the status of the system up to 8am? • Continuously improve performance • Monitor Kafka cluster performance • Identify and act on leading indicators of future problems • Quick triage – can you identify likely causes of a problem quickly and effectively?
  8. 8. 8 So you may need a bit of help • Operators must have visibility into the health of the Kafka cluster • The business must have visibility into completeness and latency of message delivery • Everyone needs to focus on the most meaningful metrics
  9. 9. 9 Types of monitoring • Tailing logs • OS metrics • Kafka / Client metrics • Tracing applications • Event level sampling • APM – Application performance from user perspective • …
  10. 10. 10 Types of monitoring • Tailing logs • OS metrics • Kafka / Client metrics • Tracing applications • Event level sampling • APM – Application performance from user perspective • …
  11. 11. 11 Monitor System Health of Your Cluster
  12. 12. 12 The basics • Whatever else you do: Check that the broker process is running • External agent • Or alert on stale metrics • Don’t alert on everything. Fewer, high level alerts are better.
  13. 13. 13 First Things First
  14. 14. 14 Under-replicated partitions • If you can monitor just one thing… • Is it a specific broker? • Cluster wide: • Out of resources • Imbalance • Broker: • Hardware • Noisy neighbor • Configuration
  15. 15. 15 Drill Down into Broker and Topic: Do we see a problem right here?
  16. 16. 16 Check partition placement - is the issue specific to one broker?
  17. 17. 17 Don’t watch the dashboard • Control Center detects anomalous events in monitoring data • Users can define triggers • Control Center performs customizable actions when triggers occur • When troubleshooting Kafka issues, users can view previous alerts and historical message delivery data at the time the alert occurred
  18. 18. 18 Capacity Planning – Be Proactive • Capacity planning ensures that your cluster can continue to meet business demands • Control Center provides indicators if a cluster may need more brokers • Key metrics that indicate a cluster is near capacity: • CPU • Network and thread pool usage • Request latencies • Network utilization - Throughput, per broker and per cluster • Disk utilization - Disk space used by all log segments, per broker
  19. 19. 19 Multi-Cluster Deployments • Monitor all clusters in one place
  20. 20. 20 Monitor End to End Message Delivery
  21. 21. 21 Are You Meeting SLAs? • Stream monitoring helps you determine if all messages are delivered end-to-end in a timely manner • This is important for several reasons: • Ensure producers and consumers are not losing messages • Check if consumers are consuming more than expected • Verify low latency for real-time applications • Identify slow consumers
  22. 22. 22 How to monitor? The infamous LinkedIn “Audit”: • Count messages when they are produced • Count messages when they are consumed • Check timestamps when they are consumed • Compare the results
  23. 23. 23 Message delivery metrics Streaming message delivery metrics are available: • Aggregate • Per-consumer group • Per-topic
  24. 24. 24 Under Consumption • Reasons for under consumption: • Producers not handling errors and retried correctly • Misbehaving consumers, perhaps the consumer did not follow shutdown sequence • Real-time apps intentionally skipping messages • Red bars indicate some messages were not consumed • Herringbone pattern can indicate error in measurement • Usually improper shutdown of client
  25. 25. 25 Over Consumption • Reasons for over consumption • Consumers may be processing a set of messages more than once, which may have impact on their applications • Consumption bars are higher than the expected consumption lines • Latency may be higher
  26. 26. 26 Slow Consumers • Identify consumers and consumer groups that are not keeping up with data production • Use the per-consumer and per-consumer group metrics • Compare a slow, lagging consumer (left) to a good consumer (right) • The slow consumer (left) is processing all the messages, but with high latency • Slow consumers may also process fewer messages in a given time window, so monitor "Expected consumption" (the top line)
  27. 27. 27 Optimize Performance
  28. 28. 28 Identify Performance Bottlenecks • Real-time applications require high throughput or low latency • Need to baseline where you are • Monitor for changes to get ahead of the problem • You may need to identify performance bottlenecks • Break-down the times for the end-to-end dataflow to give you pointers where streams are taking the most processing time • The key metrics to look at include: • Request latencies • Network pool usage • Request pool usage
  29. 29. 29 Produce and Fetch Request Latencies Breakdown produce and fetch latencies through the entire request lifecycle Request latency values can be shown at the median, 95th, 99th, or 99.9th percentile
  30. 30. 30 Request Latencies Explained (1) • Total request latency (center) • Total time of an entire request lifecycle, from the broker point of view • Request queue • The time the request is in the request queue waiting for an IO thread • A high value can indicate there are not enough IO threads or CPU is a bottleneck • Also check: What are those IO threads doing? • Request local • The time the request is being processed locally by the leader • A high value can imply slow disk so monitor broker disk IO
  31. 31. 31 Request Latencies Explained (2) • Response remote • The time the request is waiting on other brokers • Higher times are expected on high-reliability or high-throughput systems • A high value can indicate a slow network connection, or the consumer is caught up to the end of the log • Response queue • The time the request is in the response queue waiting for a network thread • A high value can imply there are not enough network threads • Response send • The time the request is being sent back to the consumer • A high value can imply the CPU or network is a bottleneck
  32. 32. 32 Network and Request Handler Threads • Network pool usage • Average network pool capacity usage across all brokers, i.e. the fraction of time the network processor threads are not idle • If network pool usage is above 70%, isolate bottleneck with the request latency breakdown • Consider increasing the broker configuration parameter, especially if Response queue metric is high and you have resources • Request pool usage • Average request handler capacity usage across all brokers, i.e. the fraction of time the request handler threads are not idle • If request pool usage is above 70%, isolate bottleneck with the request latency breakdown • Consider increasing the broker configuration parameter, especially if Request queue metric is high • Why are all your handlers busy? Check GC, access patterns and disk IO
  33. 33. 33 Summary
  34. 34. 34 Few things to remember… • Monitor Kafka • Work with your developers to monitor critical applications end-to-end • More data is better: Metrics + logs + OS + APM + … • But fewer alerts are better • Alert on what’s important – Under—Replicated Partitions is a good start • DON’T JUST FIDDLE WITH STUFF • AND DON’T RESTART KAFKA FOR LOLS • If you don’t know what you are doing, it is ok. There’s support (and Cloud) for that.
  35. 35. 35 And as you start your Production Kafka Journey… Plan Validate Deploy Observe Analyze
  36. 36. 36 Thank You!