The document discusses Kafka monitoring at LinkedIn. It provides background on Kafka including brokers, topics, partitions, producers and consumers. It then discusses the history of Kafka monitoring at LinkedIn, including the many host, JVM and broker metrics they collect. Key metrics to monitor include under replicated partitions, offline partitions, active controller count, and partitions with replication below the minimum threshold. The future of monitoring at LinkedIn will focus on service level objectives and external availability monitoring. Monitoring collects all data while alerting on very few, key metrics.
17. Kafka as a service
● Broker/Cluster Health
● Message Delivery
● Performance
● Capacity
18. Metrics you need to know
Partitions that are
not fully replicated
within the cluster
Partitions
unavailable for
produce and
consume
Should always be 1
URP Offline
Partitions
Active Controller
Count of partitions
with ISR < MinISR
Under MinISR
Count
23. Metrics you need to know
Partitions that are
not fully replicated
within the cluster
Partitions
unavailable for
produce and
consume
Should always be 1
URP Offline
Partitions
Active Controller
Count of partitions
with ISR < MinISR
UnderMinIsrP
artitionCount
24. Offline partition count
● Partition(s) unavailable
● All brokers hosting replica down OR
● unclean.leader.election.enabled=false
● Potential data loss
25. Metrics you need to know
Partitions that are
not fully replicated
within the cluster
Partitions
unavailable for
produce and
consume
Should always be 1
URP Offline
Partitions
Active Controller
Count of partitions
with ISR < MinISR
UnderMinIsrP
artitionCount
26. Active Controller Count
● What is Kafka controller ?
● Partition management
● There should only 1 controller
29. Metrics you need to know
Partitions that are
not fully replicated
within the cluster
Partitions
unavailable for
produce and
consume
Should always be 1
URP Offline
Partitions
Active Controller
Count of partitions
with ISR < MinISR
UnderMinIsrP
artitionCount
32. Operating System
and Hardware
Metrics
● Should I be worried ?
● What application is causing it ?
● Don’t alert unless:
● 100% clear signal
● 100% actionable
36. Future of Monitoring Kafka at LinkedIn
● SLO based
● Monitor and alert on SLO metrics
• Latency – Produce and consume
• Availability – Produce and consume
• Retention