Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

URP? Excuse You! The Three Kafka Metrics You Need to Know

1,160 views

Published on

What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows.

We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain:

Under-replicated Partitions: The mother of all metrics
Request Latencies: Why your users complain
Thread pool utilization: How could 80% be a problem?

We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!

Published in: Engineering
  • Hi there! Essay Help For Students | Discount 10% for your first order! - Check our website! https://vk.cc/80SakO
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

URP? Excuse You! The Three Kafka Metrics You Need to Know

  1. 1. URP? Excuse You! Todd Palino Senior Staff Engineer, Site Reliability LinkedIn
  2. 2. • What is Kafka • Encyclopedia of Monitoring • Automation What This Talk Is Not
  3. 3. Why Talk About Monitoring?
  4. 4. Messages per Day at LinkedIn
  5. 5. What is Monitoring (not)?
  6. 6. Monitoring is not Alerting • Collect everything • Alert on nothing • Events are better than metrics • Tests are better than alerts • Sleep is best in life
  7. 7. • What’s an SLA? • Availability • Latency • Customer Guarantees Service Level Objectives
  8. 8. Key Kafka Metrics
  9. 9. The Three Metrics You Need to Know Partitions that are not fully replicated within the cluster URP The overall utilization of an Apache Kafka broker Request Handlers How long requests are taking, in which stage of processing Request Timing
  10. 10. Under-Replicated Partitions • Highly discussed • Overall cluster health • Replication is a consumer and producer
  11. 11. Under-Replicated Partitions EXAMPLE: FAILED BROKER
  12. 12. Under-Replicated Partitions EXAMPLE: CONSUMER PROBLEMS
  13. 13. Under-Replicated Partitions EXAMPLE: PRODUCER PROBLEMS
  14. 14. Under-Replicated Partitions • Overrated • Doesn’t map to SLO • Often not actionable • Collect, but don’t alert
  15. 15. Everybody In The Pool • Specialized thread pools • Clients deal with network and request pools • Request handlers do most of the work
  16. 16. Request Handlers • Decode and validate • Perform task • Wait for other brokers • Assemble response
  17. 17. Request Handler Problems • Anything that causes Kafka to expend CPU cycles • Includes problems related to failing disks (IO wait) • SSL and compression work both can use a lot of CPU CPU Time Timeout Deadlock • Most often due to failing to process controller requests • Intra-cluster requests tend to be bound by partition counts • Rapidly starves the pool of threads • Should always be a code bug • Usually looks exactly like a timeout problem • Rare, but hard to identify
  18. 18. Request Handler Problems EXAMPLE: TIMEOUT OR DEADLOCK
  19. 19. Request Handler Problems • Anything that causes Kafka to expend CPU cycles • Includes problems related to failing disks (IO wait) • SSL and compression work both can use a lot of CPU CPU Time Timeout Deadlock • Most often due to failing to process controller requests • Intra-cluster requests tend to be bound by partition counts • Rapidly starves the pool of threads • Should always be a code bug • Usually looks exactly like a timeout problem • Rare, but hard to identify
  20. 20. Brokers Don’t Do Compression
  21. 21. Brokers Don’t Shouldn’t Do Compression • Kafka brokers are running a new version • Message format has been set to the new version • Clients haven’t upgraded Up Conversion Down Conversion • Kafka brokers are running a new version • Message format is set to an older version due to clients • Producer clients update to new version
  22. 22. Request Timing • Remote – Waiting for other brokers • Response Queue – Waiting to send • Response Send - Send to client • Total – Request handling, end to end • Request Queue – Waiting to process • Local – Work local to the broker
  23. 23. Request Timing EXAMPLE: PRODUCE TOTAL TIME
  24. 24. Request Timing EXAMPLE: PRODUCE LOCAL TIME
  25. 25. Request Timing EXAMPLE: PRODUCE REMOTE TIME
  26. 26. Thank you?
  27. 27. What’s Missing?
  28. 28. Availability Monitoring • SLO, part 2 • Measured externally • Client focused • github.com/linkedin/kafka-monitor
  29. 29. Operating System And Hardware Metrics • What do they mean? • What application is causing it? • Don’t alert unless: • 100% clear signal • 100% clear response
  30. 30. Capacity Planning • Plan in advance • Multi-factor • Don’t alert for capacity
  31. 31. Capacity Metrics • Request Handler Idle Ratio • Disk Utilization • Partition Count • Network Utilization
  32. 32. Wrapping Up
  33. 33. If You Remember Nothing Else… • Define your service level objectives • Monitor your service level objectives • Metrics that cover many problems are noisy • Buy Kafka: The Definitive Guide
  34. 34. Getting (and Giving) Help • Kafka Monitor • https://github.com/linkedin/kafka-monitor • Burrow • https://github.com/linkedin/Burrow • Cruise Control • https://github.com/linkedin/cruise-control • kafka-tools • https://github.com/linkedin/kafka-tools LinkedIn Open Source Get Involved • Community • users@kafka.apache.org • dev@kafka.apache.org • Bugs and Work: • https://issues.apache.org/jira/projects/KAFK A
  35. 35. Thank you

×