SlideShare a Scribd company logo
1 of 20
Download to read offline
5/16/2017 Kafka Metrics and
Monitoring
With Prometheus, Grafana
, Prometheus-jmx-exporter and graf-db base on docker
Touraj Ebrahimi
‫مقدمه‬:
‫کلی‬ ‫طور‬ ‫به‬metric‫توسط‬ ‫زیادی‬ ‫های‬Kafka, Zookeeper‫و‬Kafka Connect‫طریق‬ ‫از‬ ‫که‬ ‫دارند‬ ‫وجود‬ ‫مانیتورینگ‬ ‫برای‬
JMX‫را‬ ‫آنها‬ ‫توان‬ ‫می‬Expose‫و‬Collect‫کرد‬.‫از‬ ‫عبارتند‬ ‫آنها‬ ‫مهمترین‬ ‫که‬ ‫شوند‬ ‫می‬ ‫بندی‬ ‫طبقه‬ ‫دسته‬ ‫چند‬ ‫به‬ ‫متریکها‬:
 System Metrics
 Zookeeper Metrics
 Consumer Metrics
 Producer Metrics
 Connect Metrics
 Kafka-Server Metrics
 Kafka-Cluster Metrics
 Kafka-log Metrics
 Kafka-Network Metrics
‫میان‬ ‫از‬Metric‫و‬ ‫دهند‬ ‫می‬ ‫ما‬ ‫به‬ ‫ملموسی‬ ‫اطالعات‬ ‫که‬ ‫آنهایی‬ ‫روی‬ ‫بر‬ ‫ما‬ ‫باال‬ ‫های‬‫برای‬ ‫مخصوصا‬HealthCheck
‫نماییم‬ ‫می‬ ‫تمرکز‬ ‫کرد‬ ‫استفاده‬ ‫آنها‬ ‫از‬ ‫توان‬ ‫می‬ ‫سیستم‬.
‫برای‬ ‫مهم‬ ‫متریک‬ ‫چند‬ ‫ادامه‬ ‫در‬Health Check‫سیستم‬ ‫وضیعت‬ ‫شرایطی‬ ‫چه‬ ‫در‬ ‫کنیم‬ ‫می‬ ‫مشخص‬ ‫و‬ ‫دهیم‬ ‫می‬ ‫توضیح‬ ‫را‬
‫نیست‬ ‫مناسب‬:
‫برای‬ ‫نیز‬ ‫زیر‬ ‫متریکهای‬Health Check‫کافکا‬‫می‬ ‫پیشنهاد‬
‫شوند‬:
DescriptionAfter Version 9Before version 9Metric
Alert Should be emitted when
>0
kafka.server:type=ReplicaManager,
name=UnderReplicatedPartitions
UnderReplicatedPartitions
In-Sync Replica should not
Shrink Often. Consideration
should be done in case of
shrinking usually.
kafka.server:type=ReplicaManager, name=IsrShrinksPerSec
kafka.server:type=ReplicaManager,name=IsrExpandsPerSec
IsrShrinksPerSec
IsrExpandsPerSec
Average number of requests
sent per second
kafka.producer:type=producer-
metrics,client-id=([-.w]+)
kafka.producer:type=ProducerRequestMetrics,
name=ProducerRequestRateAndTimeMs,clientId=([-.w]+)
Request rate
Bytes consumed per secondkafka.consumer:type=consumer-fetch-
manager-metrics,client-id=([-.w]+)
kafka.consumer:type= ConsumerTopicMetrics,
name=BytesPerSec, clientId=([-.w]+)
BytesPerSec
Messages consumed per
second
kafka.consumer:type=consumer-
fetch-manager-metrics,client-id=([-
.w]+)
kafka.consumer:type= ConsumerTopicMetrics,
name=MessagesPerSec, clientId=([-.w]+)
MessagesPerSec
Minimum rate a consumer
fetches requests to the broker
Attribute: fetch-rate,
kafka.consumer:type=consumer-
fetch-manager-metrics,client-id=([-
.w]+)
kafka.consumer:type= ConsumerFetcherManager,
name=MinFetchRate, clientId=([-.w]+)
MinFetchRate
‫برای‬ ‫باال‬ ‫در‬ ‫شده‬ ‫پیشنهاد‬ ‫متریکهای‬ ‫به‬ ‫مربوط‬ ‫توضیحات‬Health Check‫می‬ ‫زیر‬ ‫صورت‬ ‫به‬
‫باشند‬:
‫توسط‬ ‫شده‬ ‫پیشنهاد‬:Gwen Shapira, System Architect at Confluent
UnderReplicatedPartitions: In a healthy cluster, the number of in sync replicas (ISRs) should be
exactly equal to the total number of replicas. If partition replicas fall too far behind their leaders, the
follower partition is removed from the ISR pool, and you should see a corresponding increase in
IsrShrinksPerSec. Since Kafka’s high-availability guarantees cannot be met without replication,
investigation is certainly warranted should this metric value exceed zero for extended time periods.
IsrShrinksPerSec/IsrExpandsPerSec: The number of in-sync replicas (ISRs) for a particular
partition should remain fairly static, the only exceptions are when you are expanding your broker
cluster or removing partitions. In order to maintain high availability, a healthy Kafka cluster requires a
minimum number of ISRs for failover. A replica could be removed from the ISR pool for a couple of
reasons: it is too far behind the leader’s offset (user-configurable by setting the
replica.lag.max.messages configuration parameter), or it has not contacted the leader for some time
(configurable with the replica.socket.timeout.ms parameter). No matter the reason, an increase in
IsrShrinksPerSec without a corresponding increase in IsrExpandsPerSec shortly thereafter is cause
for concern and requires user intervention.The Kafka documentation provides a wealth of
information on the user-configurable parameters for brokers.
Request rate: The request rate is the rate at which producers send data to brokers. Of course, what
constitutes a healthy request rate will vary drastically depending on the use case. Keeping an eye on
peaks and drops is essential to ensure continuous service availability. If rate-limiting is not enabled
(version 0.9+), in the event of a traffic spike brokers could slow to a crawl as they struggle to process
a rapid influx of data.
BytesPerSec: As with producers and brokers, you will want to monitor your consumer network
throughput. For example, a sudden drop in MessagesPerSec could indicate a failing consumer, but if
its BytesPerSec remains constant, it’s still healthy, just consuming fewer, larger-sized messages.
Observing traffic volume over time, in the context of other metrics, s important for diagnosing
anomalous network usage.
MessagesPerSec: The rate of messages consumed per second may not strongly correlate with the
rate of bytes consumed because messages can be of variable size. Depending on your producers
and workload, in typical deployments you should expect this number to remain fairly constant. By
monitoring this metric over time, you can discover trends in your data consumption and create a
baseline against which you can alert. Again, the shape of this graph depends entirely on your use
case, but in many cases, establishing a baseline and alerting on anomalous behavior is possible.
MinFetchRate: The fetch rate of a consumer can be a good indicator of overall consumer health. A
minimum fetch rate approaching a value of zero could potentially signal an issue on the consumer.
In a healthy consumer, the minimum fetch rate will usually be non-zero, so if you see this value
dropping, it could be a sign of consumer failure.
Monitoring System Health:
‫ما‬ ‫آنها‬ ‫بندیهای‬ ‫دسته‬ ‫و‬ ‫متریکها‬ ‫از‬ ‫بهتر‬ ‫دید‬ ‫داشتن‬ ‫برای‬kafka, Zookeeper‫و‬Kafka Connect‫روی‬ ‫بر‬ ‫را‬JMX Port‫و‬JMX
Host‫روی‬ ‫بر‬ ‫زیر‬ ‫های‬Docker Container‫کردیم‬ ‫تنظیم‬ ‫زیر‬ ‫صورت‬ ‫به‬ ‫آنها‬ ‫های‬:
Zookeeper :: JMXPORT=55001 :: JMXHOST=172.16.159.95
Kafka :: JMXPORT=55002 :: JMXHOST=172.16.159.95
Kafka Connect :: JMXPORT=55003 :: JMXHOST=172.16.159.95
‫طریق‬ ‫از‬ ‫توانیم‬ ‫می‬ ‫حاال‬Jconsole‫صورت‬ ‫به‬ ‫آنها‬ ‫به‬Remote‫و‬ ‫کنیم‬ ‫مانیتور‬ ‫را‬ ‫آنها‬ ‫و‬ ‫شده‬ ‫وصل‬Metric‫را‬ ‫دسترسی‬ ‫قابل‬ ‫های‬
‫از‬ ‫باید‬ ‫اینکار‬ ‫برای‬ ‫نماییم‬ ‫بررسی‬MBeans Tab‫در‬JConsole‫نماییم‬ ‫استفاده‬:
Grafana Suggested Dashboard for Monitoring Kafka:
Download link: https://grafana.com/api/dashboards/721/revisions/1/download
Download link: https://github.com/rama-nallamilli/kafka-prometheus-
monitoring/blob/master/dashboards/Kafka.json
‫کردن‬ ‫گانفیگ‬ ‫برای‬Prometheus, JMX Exporter, Zookeeper, Kafka, Grafana‫ا‬ ‫توانیم‬ ‫می‬‫ز‬Workflow‫در‬ ‫که‬ ‫زیر‬
‫فایل‬ ‫یک‬ ‫واقع‬Docker-Compose‫کنیم‬ ‫اجرا‬ ‫آنرا‬ ‫و‬ ‫گرفته‬ ‫ایده‬ ‫است‬:
We can configure prometheus.yml in order to get metrics from Prometheus-jmx-exporter (here we
named it projmxexpo) like following
prometheus.yml
global:
scrape_interval: 10s
evaluation_interval: 10s
scrape_configs:
- job_name: 'kafka'
static_configs:
- targets:
- projmxexpo:5556
Following is the config.yml that we should provide it for the Prometheus-jmx-exporter (via docker –v
commands or manually altering the default one in the docker container)
config.yml
lowercaseOutputName: true
jmxUrl: service:jmx:rmi:///jndi/rmi://172.16.159.95:55002/jmxrmi
rules:
- pattern : kafka.network<type=Processor, name=IdlePercent,
networkProcessor=(.+)><>Value
- pattern : kafka.network<type=RequestMetrics, name=RequestsPerSec,
request=(.+)><>OneMinuteRate
- pattern : kafka.network<type=SocketServer,
name=NetworkProcessorAvgIdlePercent><>Value
- pattern : kafka.server<type=ReplicaFetcherManager, name=MaxLag,
clientId=(.+)><>Value
- pattern : kafka.server<type=BrokerTopicMetrics, name=(.+),
topic=(.+)><>OneMinuteRate
- pattern : kafka.server<type=KafkaRequestHandlerPool,
name=RequestHandlerAvgIdlePercent><>OneMinuteRate
- pattern : kafka.server<type=Produce><>queue-size
- pattern : kafka.server<type=ReplicaManager, name=(.+)><>(Value|OneMinuteRate)
- pattern : kafka.server<type=controller-channel-metrics, broker-id=(.+)><>(.*)
- pattern : kafka.server<type=socket-server-metrics,
networkProcessor=(.+)><>(.*)
- pattern : kafka.server<type=Fetch><>queue-size
- pattern : kafka.server<type=SessionExpireListener, name=(.+)><>OneMinuteRate
- pattern : kafka.controller<type=KafkaController, name=(.+)><>Value
- pattern : kafka.controller<type=ControllerStats, name=(.+)><>OneMinuteRate
- pattern : kafka.cluster<type=Partition, name=UnderReplicated, topic=(.+),
partition=(.+)><>Value
- pattern : kafka.utils<type=Throttler, name=cleaner-io><>OneMinuteRate
- pattern : kafka.log<type=Log, name=LogEndOffset, topic=(.+),
partition=(.+)><>Value
- pattern : java.lang<type=(.*)>
Example for JMXURL:
jmxUrl: service:jmx:rmi:///jndi/rmi:// 172.16.159.95:55002/jmxrmi
Docker Commands:
Prometheus-jmx-exporter:
docker run -d --name projmxexpo -p 5556:5556 -v "/root/config.yml:/opt/jmx_exporter/config.yml" --
link kafka:kafka --link zookeeper:zookeeper quay.io/toraj58/pro-jmx-exporter
Prometheus:
docker run -d --name prometheus -p 9090:9090 -v
"/root/prometheus.yml:/etc/prometheus/prometheus.yml" --link projmxexpo:projmxexpo
quay.io/toraj58/prometheus
Grafana:
docker run -d --name grafanarc -p 3000:3000 --link prometheus:prometheus quay.io/toraj58/grafanarc
Prometheus:
After running Prometheus Docker Container we can see its UI in the following URL:
Then we can add multitude of graphs in order to monitor desired metrics.
http://172.16.159.95:9090
Prometheus-jmx-collector
After running Prometheus-jmx-collector docker container and exposing port 5556 to host we can
connect to the following URL to see metrics:
http://172.16.159.95:5556/metrics
Grafana:
After running Dockers and configuration of the whole system using their .yml files, json files etc. as
described in this document we can see garafana customized dashboard for Kafka monitoring like
following:
If we issue docker ps and docker images command we should have something like following that gives
us an overview of the dockers we have configured for the monitoring system:
Configured Grafana for monitoring our event bus with Kafka:
References:
https://github.com/rama-nallamilli/kafka-prometheus-monitoring
https://www.robustperception.io/monitoring-kafka-with-prometheus/
https://grafana.net/dashboards/721
https://blog.serverdensity.com/how-to-monitor-kafka/
https://www.serverdensity.com/
http://docs.confluent.io/3.0.0/kafka/monitoring.html
http://debezium.io/docs/monitoring/
http://126kr.com/article/6kaq7meq2pf
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

More Related Content

What's hot

Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheusCeline George
 
Streaming huge databases using logical decoding
Streaming huge databases using logical decodingStreaming huge databases using logical decoding
Streaming huge databases using logical decodingAlexander Shulgin
 
[231] the simplicity of cluster apps with circuit
[231] the simplicity of cluster apps with circuit[231] the simplicity of cluster apps with circuit
[231] the simplicity of cluster apps with circuitNAVER D2
 
Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018Grafana Labs
 
Adding replication protocol support for psycopg2
Adding replication protocol support for psycopg2Adding replication protocol support for psycopg2
Adding replication protocol support for psycopg2Alexander Shulgin
 
Improved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and AlertmanagerImproved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and AlertmanagerJulien Pivotto
 
Ob1k presentation at Java.IL
Ob1k presentation at Java.ILOb1k presentation at Java.IL
Ob1k presentation at Java.ILEran Harel
 
OB1K - New, Better, Faster, Devops Friendly Java container by Outbrain
OB1K - New, Better, Faster, Devops Friendly Java container by OutbrainOB1K - New, Better, Faster, Devops Friendly Java container by Outbrain
OB1K - New, Better, Faster, Devops Friendly Java container by OutbrainEran Harel
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBGeoffrey Anderson
 
Thanos: Global, durable Prometheus monitoring
Thanos: Global, durable Prometheus monitoringThanos: Global, durable Prometheus monitoring
Thanos: Global, durable Prometheus monitoringBartłomiej Płotka
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEkawamuray
 
Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)
Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)
Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)Red Hat Developers
 
Logical Replication in PostgreSQL - FLOSSUK 2016
Logical Replication in PostgreSQL - FLOSSUK 2016Logical Replication in PostgreSQL - FLOSSUK 2016
Logical Replication in PostgreSQL - FLOSSUK 2016Petr Jelinek
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companionPGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companionPGConf APAC
 
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)Wei Shan Ang
 
Benchmarking for HTTP/2
Benchmarking for HTTP/2Benchmarking for HTTP/2
Benchmarking for HTTP/2Kit Chan
 
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon
 
Ruby/rails performance and profiling
Ruby/rails performance and profilingRuby/rails performance and profiling
Ruby/rails performance and profilingDanny Guinther
 
ROMA NOVIKOV, BAQ, "Prometheus + grafana based monitoring"
ROMA NOVIKOV, BAQ, "Prometheus + grafana based monitoring"ROMA NOVIKOV, BAQ, "Prometheus + grafana based monitoring"
ROMA NOVIKOV, BAQ, "Prometheus + grafana based monitoring"Dakiry
 

What's hot (20)

Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Streaming huge databases using logical decoding
Streaming huge databases using logical decodingStreaming huge databases using logical decoding
Streaming huge databases using logical decoding
 
[231] the simplicity of cluster apps with circuit
[231] the simplicity of cluster apps with circuit[231] the simplicity of cluster apps with circuit
[231] the simplicity of cluster apps with circuit
 
Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018
 
Adding replication protocol support for psycopg2
Adding replication protocol support for psycopg2Adding replication protocol support for psycopg2
Adding replication protocol support for psycopg2
 
Improved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and AlertmanagerImproved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and Alertmanager
 
Ob1k presentation at Java.IL
Ob1k presentation at Java.ILOb1k presentation at Java.IL
Ob1k presentation at Java.IL
 
OB1K - New, Better, Faster, Devops Friendly Java container by Outbrain
OB1K - New, Better, Faster, Devops Friendly Java container by OutbrainOB1K - New, Better, Faster, Devops Friendly Java container by Outbrain
OB1K - New, Better, Faster, Devops Friendly Java container by Outbrain
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDB
 
Thanos: Global, durable Prometheus monitoring
Thanos: Global, durable Prometheus monitoringThanos: Global, durable Prometheus monitoring
Thanos: Global, durable Prometheus monitoring
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
 
Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)
Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)
Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)
 
Logical Replication in PostgreSQL - FLOSSUK 2016
Logical Replication in PostgreSQL - FLOSSUK 2016Logical Replication in PostgreSQL - FLOSSUK 2016
Logical Replication in PostgreSQL - FLOSSUK 2016
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companionPGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
 
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
 
Benchmarking for HTTP/2
Benchmarking for HTTP/2Benchmarking for HTTP/2
Benchmarking for HTTP/2
 
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBase
 
Ruby/rails performance and profiling
Ruby/rails performance and profilingRuby/rails performance and profiling
Ruby/rails performance and profiling
 
ROMA NOVIKOV, BAQ, "Prometheus + grafana based monitoring"
ROMA NOVIKOV, BAQ, "Prometheus + grafana based monitoring"ROMA NOVIKOV, BAQ, "Prometheus + grafana based monitoring"
ROMA NOVIKOV, BAQ, "Prometheus + grafana based monitoring"
 

Similar to Kafka monitoring and metrics

Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationKnoldus Inc.
 
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQShameera Rathnayaka
 
How to reduce expenses on monitoring
How to reduce expenses on monitoringHow to reduce expenses on monitoring
How to reduce expenses on monitoringRomanKhavronenko
 
weblogic perfomence tuning
weblogic perfomence tuningweblogic perfomence tuning
weblogic perfomence tuningprathap kumar
 
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...NETWAYS
 
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...netvis
 
Elastic Morocco Meetup Nov 2020
Elastic Morocco Meetup Nov 2020Elastic Morocco Meetup Nov 2020
Elastic Morocco Meetup Nov 2020Anna Ossowski
 
Java Abs Dynamic Server Replication
Java Abs   Dynamic Server ReplicationJava Abs   Dynamic Server Replication
Java Abs Dynamic Server Replicationncct
 
WebSphere Technical University: Top WebSphere Problem Determination Features
WebSphere Technical University: Top WebSphere Problem Determination FeaturesWebSphere Technical University: Top WebSphere Problem Determination Features
WebSphere Technical University: Top WebSphere Problem Determination FeaturesChris Bailey
 
Quantstamp Report - LINKSWAP
Quantstamp Report - LINKSWAPQuantstamp Report - LINKSWAP
Quantstamp Report - LINKSWAPRoy Blackstone
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET Journal
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahuDr. Prakash Sahu
 
Regain Control Thanks To Prometheus
Regain Control Thanks To PrometheusRegain Control Thanks To Prometheus
Regain Control Thanks To PrometheusEtienne Coutaud
 
Troubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesTroubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesMichael Klishin
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafkaNitin Kumar
 
Impact 2009 1783 Achieving Availability With W A Sz User Experience
Impact 2009 1783  Achieving  Availability With  W A Sz   User ExperienceImpact 2009 1783  Achieving  Availability With  W A Sz   User Experience
Impact 2009 1783 Achieving Availability With W A Sz User ExperienceElena Nanos
 

Similar to Kafka monitoring and metrics (20)

Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configuration
 
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
 
How to reduce expenses on monitoring
How to reduce expenses on monitoringHow to reduce expenses on monitoring
How to reduce expenses on monitoring
 
weblogic perfomence tuning
weblogic perfomence tuningweblogic perfomence tuning
weblogic perfomence tuning
 
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
 
Backtrack Manual Part6
Backtrack Manual Part6Backtrack Manual Part6
Backtrack Manual Part6
 
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
 
Elastic Morocco Meetup Nov 2020
Elastic Morocco Meetup Nov 2020Elastic Morocco Meetup Nov 2020
Elastic Morocco Meetup Nov 2020
 
Java Abs Dynamic Server Replication
Java Abs   Dynamic Server ReplicationJava Abs   Dynamic Server Replication
Java Abs Dynamic Server Replication
 
WebSphere Technical University: Top WebSphere Problem Determination Features
WebSphere Technical University: Top WebSphere Problem Determination FeaturesWebSphere Technical University: Top WebSphere Problem Determination Features
WebSphere Technical University: Top WebSphere Problem Determination Features
 
Quantstamp Report - LINKSWAP
Quantstamp Report - LINKSWAPQuantstamp Report - LINKSWAP
Quantstamp Report - LINKSWAP
 
Kafka RealTime Streaming
Kafka RealTime StreamingKafka RealTime Streaming
Kafka RealTime Streaming
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahu
 
NZS-4409 - Enterprise Java Monitoring on zOS Discover, Alert, Optimize
NZS-4409 - Enterprise Java Monitoring on zOS Discover, Alert, OptimizeNZS-4409 - Enterprise Java Monitoring on zOS Discover, Alert, Optimize
NZS-4409 - Enterprise Java Monitoring on zOS Discover, Alert, Optimize
 
Regain Control Thanks To Prometheus
Regain Control Thanks To PrometheusRegain Control Thanks To Prometheus
Regain Control Thanks To Prometheus
 
Troubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesTroubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issues
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafka
 
Impact 2009 1783 Achieving Availability With W A Sz User Experience
Impact 2009 1783  Achieving  Availability With  W A Sz   User ExperienceImpact 2009 1783  Achieving  Availability With  W A Sz   User Experience
Impact 2009 1783 Achieving Availability With W A Sz User Experience
 

More from Touraj Ebrahimi

More from Touraj Ebrahimi (6)

Kafka and kafka connect
Kafka and kafka connectKafka and kafka connect
Kafka and kafka connect
 
Microservices communication styles and event bus
Microservices communication styles and event busMicroservices communication styles and event bus
Microservices communication styles and event bus
 
Event driven architecure
Event driven architecureEvent driven architecure
Event driven architecure
 
CQRS
CQRSCQRS
CQRS
 
Event sourcing
Event sourcingEvent sourcing
Event sourcing
 
Microservice architecture
Microservice architectureMicroservice architecture
Microservice architecture
 

Recently uploaded

Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
oracle 23c new features for developer and dba
oracle 23c new features for developer and dbaoracle 23c new features for developer and dba
oracle 23c new features for developer and dbaRemote DBA Services
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 
What is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docxWhat is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docxTechnogeeks
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Revolutionize Your Video Editing with InVideo.io: A Comprehensive Review
Revolutionize Your Video Editing with InVideo.io: A Comprehensive ReviewRevolutionize Your Video Editing with InVideo.io: A Comprehensive Review
Revolutionize Your Video Editing with InVideo.io: A Comprehensive Reviewjw364beach
 
full course of software engineering mid term.pdf
full course of software engineering mid term.pdffull course of software engineering mid term.pdf
full course of software engineering mid term.pdfAbdul salam
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsDEEPRAJ PATHAK
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 

Recently uploaded (20)

Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
oracle 23c new features for developer and dba
oracle 23c new features for developer and dbaoracle 23c new features for developer and dba
oracle 23c new features for developer and dba
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 
What is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docxWhat is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docx
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Revolutionize Your Video Editing with InVideo.io: A Comprehensive Review
Revolutionize Your Video Editing with InVideo.io: A Comprehensive ReviewRevolutionize Your Video Editing with InVideo.io: A Comprehensive Review
Revolutionize Your Video Editing with InVideo.io: A Comprehensive Review
 
full course of software engineering mid term.pdf
full course of software engineering mid term.pdffull course of software engineering mid term.pdf
full course of software engineering mid term.pdf
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software Projects
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 

Kafka monitoring and metrics

  • 1. 5/16/2017 Kafka Metrics and Monitoring With Prometheus, Grafana , Prometheus-jmx-exporter and graf-db base on docker Touraj Ebrahimi
  • 2. ‫مقدمه‬: ‫کلی‬ ‫طور‬ ‫به‬metric‫توسط‬ ‫زیادی‬ ‫های‬Kafka, Zookeeper‫و‬Kafka Connect‫طریق‬ ‫از‬ ‫که‬ ‫دارند‬ ‫وجود‬ ‫مانیتورینگ‬ ‫برای‬ JMX‫را‬ ‫آنها‬ ‫توان‬ ‫می‬Expose‫و‬Collect‫کرد‬.‫از‬ ‫عبارتند‬ ‫آنها‬ ‫مهمترین‬ ‫که‬ ‫شوند‬ ‫می‬ ‫بندی‬ ‫طبقه‬ ‫دسته‬ ‫چند‬ ‫به‬ ‫متریکها‬:  System Metrics  Zookeeper Metrics  Consumer Metrics  Producer Metrics  Connect Metrics  Kafka-Server Metrics  Kafka-Cluster Metrics  Kafka-log Metrics  Kafka-Network Metrics ‫میان‬ ‫از‬Metric‫و‬ ‫دهند‬ ‫می‬ ‫ما‬ ‫به‬ ‫ملموسی‬ ‫اطالعات‬ ‫که‬ ‫آنهایی‬ ‫روی‬ ‫بر‬ ‫ما‬ ‫باال‬ ‫های‬‫برای‬ ‫مخصوصا‬HealthCheck ‫نماییم‬ ‫می‬ ‫تمرکز‬ ‫کرد‬ ‫استفاده‬ ‫آنها‬ ‫از‬ ‫توان‬ ‫می‬ ‫سیستم‬. ‫برای‬ ‫مهم‬ ‫متریک‬ ‫چند‬ ‫ادامه‬ ‫در‬Health Check‫سیستم‬ ‫وضیعت‬ ‫شرایطی‬ ‫چه‬ ‫در‬ ‫کنیم‬ ‫می‬ ‫مشخص‬ ‫و‬ ‫دهیم‬ ‫می‬ ‫توضیح‬ ‫را‬ ‫نیست‬ ‫مناسب‬: ‫برای‬ ‫نیز‬ ‫زیر‬ ‫متریکهای‬Health Check‫کافکا‬‫می‬ ‫پیشنهاد‬ ‫شوند‬: DescriptionAfter Version 9Before version 9Metric Alert Should be emitted when >0 kafka.server:type=ReplicaManager, name=UnderReplicatedPartitions UnderReplicatedPartitions In-Sync Replica should not Shrink Often. Consideration should be done in case of shrinking usually. kafka.server:type=ReplicaManager, name=IsrShrinksPerSec kafka.server:type=ReplicaManager,name=IsrExpandsPerSec IsrShrinksPerSec IsrExpandsPerSec Average number of requests sent per second kafka.producer:type=producer- metrics,client-id=([-.w]+) kafka.producer:type=ProducerRequestMetrics, name=ProducerRequestRateAndTimeMs,clientId=([-.w]+) Request rate Bytes consumed per secondkafka.consumer:type=consumer-fetch- manager-metrics,client-id=([-.w]+) kafka.consumer:type= ConsumerTopicMetrics, name=BytesPerSec, clientId=([-.w]+) BytesPerSec Messages consumed per second kafka.consumer:type=consumer- fetch-manager-metrics,client-id=([- .w]+) kafka.consumer:type= ConsumerTopicMetrics, name=MessagesPerSec, clientId=([-.w]+) MessagesPerSec Minimum rate a consumer fetches requests to the broker Attribute: fetch-rate, kafka.consumer:type=consumer- fetch-manager-metrics,client-id=([- .w]+) kafka.consumer:type= ConsumerFetcherManager, name=MinFetchRate, clientId=([-.w]+) MinFetchRate
  • 3. ‫برای‬ ‫باال‬ ‫در‬ ‫شده‬ ‫پیشنهاد‬ ‫متریکهای‬ ‫به‬ ‫مربوط‬ ‫توضیحات‬Health Check‫می‬ ‫زیر‬ ‫صورت‬ ‫به‬ ‫باشند‬: ‫توسط‬ ‫شده‬ ‫پیشنهاد‬:Gwen Shapira, System Architect at Confluent UnderReplicatedPartitions: In a healthy cluster, the number of in sync replicas (ISRs) should be exactly equal to the total number of replicas. If partition replicas fall too far behind their leaders, the follower partition is removed from the ISR pool, and you should see a corresponding increase in IsrShrinksPerSec. Since Kafka’s high-availability guarantees cannot be met without replication, investigation is certainly warranted should this metric value exceed zero for extended time periods. IsrShrinksPerSec/IsrExpandsPerSec: The number of in-sync replicas (ISRs) for a particular partition should remain fairly static, the only exceptions are when you are expanding your broker cluster or removing partitions. In order to maintain high availability, a healthy Kafka cluster requires a minimum number of ISRs for failover. A replica could be removed from the ISR pool for a couple of reasons: it is too far behind the leader’s offset (user-configurable by setting the replica.lag.max.messages configuration parameter), or it has not contacted the leader for some time (configurable with the replica.socket.timeout.ms parameter). No matter the reason, an increase in IsrShrinksPerSec without a corresponding increase in IsrExpandsPerSec shortly thereafter is cause for concern and requires user intervention.The Kafka documentation provides a wealth of information on the user-configurable parameters for brokers. Request rate: The request rate is the rate at which producers send data to brokers. Of course, what constitutes a healthy request rate will vary drastically depending on the use case. Keeping an eye on peaks and drops is essential to ensure continuous service availability. If rate-limiting is not enabled (version 0.9+), in the event of a traffic spike brokers could slow to a crawl as they struggle to process a rapid influx of data. BytesPerSec: As with producers and brokers, you will want to monitor your consumer network throughput. For example, a sudden drop in MessagesPerSec could indicate a failing consumer, but if its BytesPerSec remains constant, it’s still healthy, just consuming fewer, larger-sized messages. Observing traffic volume over time, in the context of other metrics, s important for diagnosing anomalous network usage. MessagesPerSec: The rate of messages consumed per second may not strongly correlate with the rate of bytes consumed because messages can be of variable size. Depending on your producers and workload, in typical deployments you should expect this number to remain fairly constant. By monitoring this metric over time, you can discover trends in your data consumption and create a baseline against which you can alert. Again, the shape of this graph depends entirely on your use case, but in many cases, establishing a baseline and alerting on anomalous behavior is possible. MinFetchRate: The fetch rate of a consumer can be a good indicator of overall consumer health. A minimum fetch rate approaching a value of zero could potentially signal an issue on the consumer. In a healthy consumer, the minimum fetch rate will usually be non-zero, so if you see this value dropping, it could be a sign of consumer failure.
  • 4. Monitoring System Health: ‫ما‬ ‫آنها‬ ‫بندیهای‬ ‫دسته‬ ‫و‬ ‫متریکها‬ ‫از‬ ‫بهتر‬ ‫دید‬ ‫داشتن‬ ‫برای‬kafka, Zookeeper‫و‬Kafka Connect‫روی‬ ‫بر‬ ‫را‬JMX Port‫و‬JMX Host‫روی‬ ‫بر‬ ‫زیر‬ ‫های‬Docker Container‫کردیم‬ ‫تنظیم‬ ‫زیر‬ ‫صورت‬ ‫به‬ ‫آنها‬ ‫های‬: Zookeeper :: JMXPORT=55001 :: JMXHOST=172.16.159.95 Kafka :: JMXPORT=55002 :: JMXHOST=172.16.159.95 Kafka Connect :: JMXPORT=55003 :: JMXHOST=172.16.159.95 ‫طریق‬ ‫از‬ ‫توانیم‬ ‫می‬ ‫حاال‬Jconsole‫صورت‬ ‫به‬ ‫آنها‬ ‫به‬Remote‫و‬ ‫کنیم‬ ‫مانیتور‬ ‫را‬ ‫آنها‬ ‫و‬ ‫شده‬ ‫وصل‬Metric‫را‬ ‫دسترسی‬ ‫قابل‬ ‫های‬ ‫از‬ ‫باید‬ ‫اینکار‬ ‫برای‬ ‫نماییم‬ ‫بررسی‬MBeans Tab‫در‬JConsole‫نماییم‬ ‫استفاده‬:
  • 5.
  • 6.
  • 7.
  • 8.
  • 9. Grafana Suggested Dashboard for Monitoring Kafka: Download link: https://grafana.com/api/dashboards/721/revisions/1/download Download link: https://github.com/rama-nallamilli/kafka-prometheus- monitoring/blob/master/dashboards/Kafka.json
  • 10. ‫کردن‬ ‫گانفیگ‬ ‫برای‬Prometheus, JMX Exporter, Zookeeper, Kafka, Grafana‫ا‬ ‫توانیم‬ ‫می‬‫ز‬Workflow‫در‬ ‫که‬ ‫زیر‬ ‫فایل‬ ‫یک‬ ‫واقع‬Docker-Compose‫کنیم‬ ‫اجرا‬ ‫آنرا‬ ‫و‬ ‫گرفته‬ ‫ایده‬ ‫است‬:
  • 11.
  • 12. We can configure prometheus.yml in order to get metrics from Prometheus-jmx-exporter (here we named it projmxexpo) like following prometheus.yml global: scrape_interval: 10s evaluation_interval: 10s scrape_configs: - job_name: 'kafka' static_configs: - targets: - projmxexpo:5556
  • 13. Following is the config.yml that we should provide it for the Prometheus-jmx-exporter (via docker –v commands or manually altering the default one in the docker container) config.yml lowercaseOutputName: true jmxUrl: service:jmx:rmi:///jndi/rmi://172.16.159.95:55002/jmxrmi rules: - pattern : kafka.network<type=Processor, name=IdlePercent, networkProcessor=(.+)><>Value - pattern : kafka.network<type=RequestMetrics, name=RequestsPerSec, request=(.+)><>OneMinuteRate - pattern : kafka.network<type=SocketServer, name=NetworkProcessorAvgIdlePercent><>Value - pattern : kafka.server<type=ReplicaFetcherManager, name=MaxLag, clientId=(.+)><>Value - pattern : kafka.server<type=BrokerTopicMetrics, name=(.+), topic=(.+)><>OneMinuteRate - pattern : kafka.server<type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent><>OneMinuteRate - pattern : kafka.server<type=Produce><>queue-size - pattern : kafka.server<type=ReplicaManager, name=(.+)><>(Value|OneMinuteRate) - pattern : kafka.server<type=controller-channel-metrics, broker-id=(.+)><>(.*) - pattern : kafka.server<type=socket-server-metrics, networkProcessor=(.+)><>(.*) - pattern : kafka.server<type=Fetch><>queue-size - pattern : kafka.server<type=SessionExpireListener, name=(.+)><>OneMinuteRate - pattern : kafka.controller<type=KafkaController, name=(.+)><>Value - pattern : kafka.controller<type=ControllerStats, name=(.+)><>OneMinuteRate - pattern : kafka.cluster<type=Partition, name=UnderReplicated, topic=(.+), partition=(.+)><>Value - pattern : kafka.utils<type=Throttler, name=cleaner-io><>OneMinuteRate - pattern : kafka.log<type=Log, name=LogEndOffset, topic=(.+), partition=(.+)><>Value - pattern : java.lang<type=(.*)>
  • 14. Example for JMXURL: jmxUrl: service:jmx:rmi:///jndi/rmi:// 172.16.159.95:55002/jmxrmi Docker Commands: Prometheus-jmx-exporter: docker run -d --name projmxexpo -p 5556:5556 -v "/root/config.yml:/opt/jmx_exporter/config.yml" -- link kafka:kafka --link zookeeper:zookeeper quay.io/toraj58/pro-jmx-exporter Prometheus: docker run -d --name prometheus -p 9090:9090 -v "/root/prometheus.yml:/etc/prometheus/prometheus.yml" --link projmxexpo:projmxexpo quay.io/toraj58/prometheus Grafana: docker run -d --name grafanarc -p 3000:3000 --link prometheus:prometheus quay.io/toraj58/grafanarc
  • 15. Prometheus: After running Prometheus Docker Container we can see its UI in the following URL: Then we can add multitude of graphs in order to monitor desired metrics. http://172.16.159.95:9090
  • 16. Prometheus-jmx-collector After running Prometheus-jmx-collector docker container and exposing port 5556 to host we can connect to the following URL to see metrics: http://172.16.159.95:5556/metrics
  • 17. Grafana: After running Dockers and configuration of the whole system using their .yml files, json files etc. as described in this document we can see garafana customized dashboard for Kafka monitoring like following:
  • 18. If we issue docker ps and docker images command we should have something like following that gives us an overview of the dockers we have configured for the monitoring system: Configured Grafana for monitoring our event bus with Kafka:
  • 19.