Kafka Monitoring Metrics

•Download as PPTX, PDF•

2 likes•1,627 views

The document discusses Kafka monitoring at LinkedIn. It provides background on Kafka including brokers, topics, partitions, producers and consumers. It then discusses the history of Kafka monitoring at LinkedIn, including the many host, JVM and broker metrics they collect. Key metrics to monitor include under replicated partitions, offline partitions, active controller count, and partitions with replication below the minimum threshold. The future of monitoring at LinkedIn will focus on service level objectives and external availability monitoring. Monitoring collects all data while alerting on very few, key metrics.

Engineering

Apache Kafka
Monitoring vs Alerting
Ratish Ravindran
Sr. Site Reliability Engineer - LinkedIn

Agenda
Kafka 101
History of Kafka monitoring at LinkedIn
Key Metrics to alert on and why
Future of Kafka monitoring at LinkedIn
Q&A

Brokers and
Clusters
● Single Kafka Server
● Multiple brokers

Topics and
Partitions
● Categorized messages
● Broken down into partitions

Leader and
Replica
● Broker owning partition
● Broker with partition copy

Controller
● Broker with more responsibilities
● Partition management
• Leader election
• Leader Switch
• New topic and partition
• New broker

Common Metrics
● Host metrics
● JVM metrics
● Broker or Application metrics

Host Metrics
● Load average
● Memory usage (free, used, %)
● CPU
● Inbound/Outbound network stats
● Disk usage
● RAID stats
● Inode (free, used, %)
● IO stats
● Disk states
● TCP connections (est. count, error etc)
~ 320 host metrics

JVM Metrics
● Minor/Full GC time
● Minor/Full GC count
● Total thread count
● Used thread count
● Total heap size available
● Total heap size used

Broker Metrics
● Bytes and messages in/out rate
● Fetch/produce queue size
● Request handler idle ratio
● Request pool usage
● Request latencies – at different stages
● Log flush rate
● Produce and consume latency
2100 broker metrics and 165 graphs

Kafka at LinkedIn
● 2500+ Kafka brokers
● 1 PB In per day
● 3.9 PB Out per day

Kafka as a service
● Broker/Cluster Health
● Message Delivery
● Performance
● Capacity

Metrics you need to know
Partitions that are
not fully replicated
within the cluster
Partitions
unavailable for
produce and
consume
Should always be 1
URP Offline
Partitions
Active Controller
Count of partitions
with ISR < MinISR
Under MinISR
Count

Under Replicated Partitions
● Highly discussed
● Overall cluster health
● Replication is consumer and producer

Offline partition count
● Partition(s) unavailable
● All brokers hosting replica down OR
● unclean.leader.election.enabled=false
● Potential data loss

Active Controller Count
● What is Kafka controller ?
● Partition management
● There should only 1 controller

UnderMinIsrPartitionCount
● min.insync.replicas
● in-sync replicas count < min.insync.replicas
● Per-broker metrics

UnderMinIsrPartitionCount
Screen Shot 2018-06-18 at 3.12.17 PMOffline partitions URP
Under MinIsr Partition count

Operating System
and Hardware
Metrics
● Should I be worried ?
● What application is causing it ?
● Don’t alert unless:
● 100% clear signal
● 100% actionable

Capacity
Planning
● Plan in advance
● Multi-factor
● Don’t alert for capacity

Capacity
Metrics
● Request Handler Idle Ratio
● Disk Utilization
● Partition Count
● Network Utilization

Future of Monitoring Kafka at LinkedIn
● SLO based
● Monitor and alert on SLO metrics
• Latency – Produce and consume
• Availability – Produce and consume
• Retention

Availability
Monitoring
● Measured externally
● Client focused
● github.com/linkedin/kafka-monitor

Monitoring is not alerting
● Collect everything
● Alert on almost nothing
● SLO based monitoring

Getting (and Giving) Help
• Community
users@kafka.apache.org
dev@kafka.apache.org
• Bugs and Work
https://issues.apache.org/jira/projects/KAFKA
• Kafka Monitor
https://github.com/linkedin/kafka-monitor
• Burrow
https://github.com/linkedin/Burrow
• Cruise Control
https://github.com/linkedin/cruise-control
• kafka-tools
https://github.com/linkedin/kafka-tools

What's hot

TCP protocol flow control anuragjagetiya

Extending Flink SQL for stream processing use casesFlink Forward

No data loss pipeline with apache kafkaJiangjie Qin

Do flink on web with flow - Dongwon Kim & Haemee park, SK Telecom)Flink Forward

Introducing the Apache Flink Kubernetes OperatorFlink Forward

Changelog Stream Processing with Apache FlinkFlink Forward

The Current State of Table API in 2022Flink Forward

A Closer Look at RabbitMQKyumars Sheykh Esmaili

Using Kafka to scale database replicationVenu Ryali

Data Loss and Duplication in KafkaJayesh Thakrar

Unlocking the Power of Apache Flink: An Introduction in 4 ActsHostedbyConfluent

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward

Plan your Log Platform at Google Cloud PlatformSimon Su

Bewährte Praktiken für HCL Notes/Domino-Sicherheit. Teil 2: Der Domino-Serverpanagenda

Tuning Apache Kafka Connectors for Flink.pptxFlink Forward

How to Lock Down Apache Kafka and Keep Your Streams Safeconfluent

How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...HostedbyConfluent

Apache Kafka® Security Overviewconfluent

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward

Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward

What's hot (20)

TCP protocol flow control

Extending Flink SQL for stream processing use cases

No data loss pipeline with apache kafka

Do flink on web with flow - Dongwon Kim & Haemee park, SK Telecom)

Introducing the Apache Flink Kubernetes Operator

Changelog Stream Processing with Apache Flink

The Current State of Table API in 2022

A Closer Look at RabbitMQ

Using Kafka to scale database replication

Data Loss and Duplication in Kafka

Unlocking the Power of Apache Flink: An Introduction in 4 Acts

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...

Plan your Log Platform at Google Cloud Platform

Bewährte Praktiken für HCL Notes/Domino-Sicherheit. Teil 2: Der Domino-Server

Tuning Apache Kafka Connectors for Flink.pptx

How to Lock Down Apache Kafka and Keep Your Streams Safe

How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...

Apache Kafka® Security Overview

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...

Where is my bottleneck? Performance troubleshooting in Flink

Similar to Kafka Monitoring Metrics

Tips & Tricks for Apache Kafka®confluent

Kafka Needs No KeeperC4Media

Intro to Apache Apex @ Women in Big DataApache Apex

kafkaAriel Moskovich

Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex

Netflix Data Pipeline With KafkaSteven Wu

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

Citi Tech Talk: Monitoring and Performanceconfluent

Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex

Learnings from the Field. Lessons from Working with Dozens of Small & Large D...HostedbyConfluent

Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex

Architectual Comparison of Apache Apex and Spark StreamingApache Apex

Next Gen Big Data Analytics with Apache Apex DataWorks Summit/Hadoop Summit

Introduction to Apache KafkaRicardo Bravo

Building zero data loss pipelines with apache kafkaAvinash Ramineni

Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacApache Apex

Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex

Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex

Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...confluent

Introduction to Apache Apex by Thomas WeiseBig Data Spain

Similar to Kafka Monitoring Metrics (20)

Tips & Tricks for Apache Kafka®

Kafka Needs No Keeper

Intro to Apache Apex @ Women in Big Data

kafka

Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming

Netflix Data Pipeline With Kafka

Citi Tech Talk: Monitoring and Performance

Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex

Learnings from the Field. Lessons from Working with Dozens of Small & Large D...

Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex

Architectual Comparison of Apache Apex and Spark Streaming

Next Gen Big Data Analytics with Apache Apex

Introduction to Apache Kafka

Building zero data loss pipelines with apache kafka

Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac

Ingestion and Dimensions Compute and Enrich using Apache Apex

Intro to Apache Apex - Next Gen Platform for Ingest and Transform

Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...

Introduction to Apache Apex by Thomas Weise

Recently uploaded

Porous Ceramics seminar and technical writingrakeshbaidya232001

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor

Architect Hassan Khalil Portfolio for 2024hassan khalil

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEslot gacor bisa pakai pulsa

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1

SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome

Introduction to IEEE STANDARDS and its different types.pptxupamatechverse

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat

ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis

Recently uploaded (20)

Porous Ceramics seminar and technical writing

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130

Architect Hassan Khalil Portfolio for 2024

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt

SPICE PARK APR2024 ( 6,793 SPICE Models )

Introduction to IEEE STANDARDS and its different types.pptx

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts

ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...

Kafka Monitoring Metrics

1. Apache Kafka Monitoring vs Alerting Ratish Ravindran Sr. Site Reliability Engineer - LinkedIn

2. Agenda Kafka 101 History of Kafka monitoring at LinkedIn Key Metrics to alert on and why Future of Kafka monitoring at LinkedIn Q&A

3. Kafka 101

4. Brokers and Clusters ● Single Kafka Server ● Multiple brokers

5. Topics and Partitions ● Categorized messages ● Broken down into partitions

6. Leader and Replica ● Broker owning partition ● Broker with partition copy

7. Producer and Consumer

8. Controller ● Broker with more responsibilities ● Partition management • Leader election • Leader Switch • New topic and partition • New broker

9. History of Kafka Monitoring at LinkedIn

10. Common Metrics ● Host metrics ● JVM metrics ● Broker or Application metrics

11. Host Metrics ● Load average ● Memory usage (free, used, %) ● CPU ● Inbound/Outbound network stats ● Disk usage ● RAID stats ● Inode (free, used, %) ● IO stats ● Disk states ● TCP connections (est. count, error etc) ~ 320 host metrics

12. JVM Metrics ● Minor/Full GC time ● Minor/Full GC count ● Total thread count ● Used thread count ● Total heap size available ● Total heap size used

13. Broker Metrics ● Bytes and messages in/out rate ● Fetch/produce queue size ● Request handler idle ratio ● Request pool usage ● Request latencies – at different stages ● Log flush rate ● Produce and consume latency 2100 broker metrics and 165 graphs

14. Kafka at LinkedIn

15. Kafka at LinkedIn ● 2500+ Kafka brokers ● 1 PB In per day ● 3.9 PB Out per day

16. Kafka Key Metrics

17. Kafka as a service ● Broker/Cluster Health ● Message Delivery ● Performance ● Capacity

18. Metrics you need to know Partitions that are not fully replicated within the cluster Partitions unavailable for produce and consume Should always be 1 URP Offline Partitions Active Controller Count of partitions with ISR < MinISR Under MinISR Count

19. Under Replicated Partitions ● Highly discussed ● Overall cluster health ● Replication is consumer and producer

20. Under Replicated Partitions

21. Under Replicated Partitions

22. Under Replicated Partitions

23. Metrics you need to know Partitions that are not fully replicated within the cluster Partitions unavailable for produce and consume Should always be 1 URP Offline Partitions Active Controller Count of partitions with ISR < MinISR UnderMinIsrP artitionCount

24. Offline partition count ● Partition(s) unavailable ● All brokers hosting replica down OR ● unclean.leader.election.enabled=false ● Potential data loss

25. Metrics you need to know Partitions that are not fully replicated within the cluster Partitions unavailable for produce and consume Should always be 1 URP Offline Partitions Active Controller Count of partitions with ISR < MinISR UnderMinIsrP artitionCount

26. Active Controller Count ● What is Kafka controller ? ● Partition management ● There should only 1 controller

27. Active Controller Count < 1

28. Active Controller Count > 1

29. Metrics you need to know Partitions that are not fully replicated within the cluster Partitions unavailable for produce and consume Should always be 1 URP Offline Partitions Active Controller Count of partitions with ISR < MinISR UnderMinIsrP artitionCount

30. UnderMinIsrPartitionCount ● min.insync.replicas ● in-sync replicas count < min.insync.replicas ● Per-broker metrics

31. UnderMinIsrPartitionCount Screen Shot 2018-06-18 at 3.12.17 PMOffline partitions URP Under MinIsr Partition count

32. Operating System and Hardware Metrics ● Should I be worried ? ● What application is causing it ? ● Don’t alert unless: ● 100% clear signal ● 100% actionable

33. Capacity Planning ● Plan in advance ● Multi-factor ● Don’t alert for capacity

34. Capacity Metrics ● Request Handler Idle Ratio ● Disk Utilization ● Partition Count ● Network Utilization

35. Future of Kafka Monitoring at LInkedIn

36. Future of Monitoring Kafka at LinkedIn ● SLO based ● Monitor and alert on SLO metrics • Latency – Produce and consume • Availability – Produce and consume • Retention

37. Availability Monitoring ● Measured externally ● Client focused ● github.com/linkedin/kafka-monitor

38. Conclusion

39. Monitoring is not alerting ● Collect everything ● Alert on almost nothing ● SLO based monitoring

40. Getting (and Giving) Help • Community users@kafka.apache.org dev@kafka.apache.org • Bugs and Work https://issues.apache.org/jira/projects/KAFKA • Kafka Monitor https://github.com/linkedin/kafka-monitor • Burrow https://github.com/linkedin/Burrow • Cruise Control https://github.com/linkedin/cruise-control • kafka-tools https://github.com/linkedin/kafka-tools

41. Questions

Kafka Monitoring Metrics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka Monitoring Metrics

Similar to Kafka Monitoring Metrics (20)

Recently uploaded

Recently uploaded (20)

Kafka Monitoring Metrics