Kafka infrastructure monitoring

Oficinas en Madrid: C/ Francisco Silvela, 54 Duplicado 1ºD 28028
Telf: 91 080 82 44
Oficinas en Barcelona: C/ Madrazo 27-29 4ª 08006
Telf: 933 68 52 46
1
KAFKA INFRASTRUCTURE:
MONITORING

Telf: 91 080 82 44
Telf: 933 68 52 46
1. Important metrics
2. Open source kafka tools
3. The Landoop Stack
4. The Confluent Stack
2
$intro --help

Telf: 91 080 82 44
Telf: 933 68 52 46
Kafka metrics:
- UnderReplicatedPartitions: In a healthy cluster, the number of in sync replicas
(ISRs) should be exactly equal to the total number of replicas. If partition replicas
fall too far behind their leaders, the follower partition is removed from the ISR
pool, and you should see a corresponding increase in IsrShrinksPerSec.
- IsrShrinksPerSec/IsrExpandsPerSec: The number of in-sync replicas (ISRs) for a
particular partition should remain fairly static, the only exceptions are when you
are expanding your broker cluster or removing partitions. An increase in
IsrShrinksPerSec without a corresponding increase in IsrExpandsPerSec shortly
thereafter is cause for concern and requires user intervention.
- ActiveControllerCount: The first node to boot in a Kafka cluster automatically
becomes the controller, and there can be only one. The controller in a Kafka
cluster is responsible for maintaining the list of partition leaders, and coordinating
leadership transitions
3
KAFKA MONITORING: Brokers

Telf: 91 080 82 44
Telf: 933 68 52 46
- OfflinePartitionsCount (controller only): This metric reports the number of
partitions without an active leader. Because all read and write operations are only
performed on partition leaders, a non-zero value for this metric should be alerted
on to prevent service interruptions.
- LeaderElectionRateAndTimeMs: Reports the rate of leader elections (per second)
and the total time the cluster went without a leader (in milliseconds).
- UncleanLeaderElectionsPerSec: An unclean leader election is a special case in
which no available replicas are in sync. Because each topic must have a leader, an
election is held among the out-of-sync replicas and a leader is chosen—meaning
any messages that were not synced prior to the loss of the former leader are lost
forever.
4

Telf: 91 080 82 44
Telf: 933 68 52 46
- TotalTimeMs: The TotalTimeMs metric family measures the total time taken to
service a request (be it a produce, fetch-consumer, or fetch-follower request).
- BytesInPerSec/BytesOutPerSec: Tracking network throughput on your brokers
gives you more information as to where potential bottlenecks may lie, and can
inform decisions like whether or not you should enable end-to-end compression
of your messages.
- Disk usage: Kafka will fail should its disk become full, so keeping track of disk
growth over time is recommended.
- Network bytes sent/received: If you are monitoring Kafka’s bytes in/out metric,
you are getting Kafka’s side of the story. To get a full picture of network usage on
your host, you would need to monitor host-level network throughput
5

Telf: 91 080 82 44
Telf: 933 68 52 46
- Response rate: For producers, the response rate represents the rate of responses
received from brokers. Brokers respond to producers when the data has been
received.
- Request rate: The request rate is the rate at which producers send data to
brokers. Keeping an eye on peaks and drops is essential to ensure continuous
service availability.
- Request latency average: The average request latency is a measure of the
amount of time between when KafkaProducer.send() was called until the
producer receives a response from the broker.
- Outgoing byte rate: As with Kafka brokers, you will want to monitor your
producer network throughput. Observing traffic volume over time is essential to
determine if changes to your network infrastructure are needed.
6
KAFKA MONITORING: Producers

Telf: 91 080 82 44
Telf: 933 68 52 46
- ConsumerLag: ConsumerLag is the calculated difference between a consumer’s
current log offset and a producer’s current log offset.
- MaxLag: Goes hand-in-hand with ConsumerLag, and is the maximum observed
value of ConsumerLag.
- BytesPerSec: As with producers and brokers, you will want to monitor your
consumer network throughput.
- MessagesPerSec: The rate of messages consumed per second may not strongly
correlate with the rate of bytes consumed because messages can be of variable
size.
- MinFetchRate: The fetch rate of a consumer can be a good indicator of overall
consumer health.
7
KAFKA MONITORING: Consumers

Telf: 91 080 82 44
Telf: 933 68 52 46
- zk_outstanding_requests: Clients can end up submitting requests faster than
ZooKeeper can process them. If you have a large number of clients, it’s almost a
given that this will happen occasionally. To prevent using up all available memory
due to queued requests, ZooKeeper will throttle clients if its queue limit is
reached.
- zk_avg_latency: The average request latency is the average time it takes (in
milliseconds) for ZooKeeper to respond to a request. ZooKeeper will not respond
to a request until it has written the transaction to its transaction log.
- zk_num_alive_connections: ZooKeeper reports the number of clients connected
to it via the zk_num_alive_connections metric. This represents all connections,
including connections to non-ZooKeeper nodes.
- zk_followers (leader only): The number of followers should equal the total size of
your ZooKeeper ensemble - 1 (the leader is not included in the follower count).
8
KAFKA MONITORING: Zookeeper

Telf: 91 080 82 44
Telf: 933 68 52 46
- zk_pending_syncs (leader only): The transaction log is the most performance-
critical part of ZooKeeper. ZooKeeper must sync transactions to disk before
returning a response, thus a large number of pending syncs will result in latencies
increases across the board.
- Bytes sent/received (v0.8.x only): Brokers and consumers communicate with
ZooKeeper. In large-scale deployments with many consumers and partitions, this
constant communication means ZooKeeper could become a bottleneck.
- Usable memory: ZooKeeper should reside entirely in RAM and will suffer
considerably if it must page to disk. Therefore, keeping track of the amount of
usable memory is necessary to ensure ZooKeeper performs optimally.
- Disk latency: Although ZooKeeper should reside in RAM, it still makes use of the
filesystem for both periodically snapshotting its current state and for maintaining
logs of all transactions.
9
KAFKA MONITORING: Zookeeper

Telf: 91 080 82 44
Telf: 933 68 52 46
Yahoo Kafka Manager (https://github.com/yahoo/kafka-manager):
- Manage multiple clusters
- Easy inspection of cluster state (topics, consumers, offsets, brokers, replica
distribution, partition distribution)
- Run preferred replica election
- Generate partition assignments with option to select brokers to use
- Run reassignment of partition (based on generated assignments)
- Create a topic with optional topic configs
- Delete topic
- Topic list now indicates topics marked for deletion (only supported on 0.8.2+)
- Batch generate partition assignments for multiple topics with option to select
brokers to use
- Batch run reassignment of partition for multiple topics
- Add partitions to existing topic
- Update config for existing topic
10
MONITORING TOOLS: Open source

Telf: 91 080 82 44
Telf: 933 68 52 46
Yahoo Kafka Manager (https://github.com/yahoo/kafka-manager):
11

Telf: 91 080 82 44
Telf: 933 68 52 46
LinkedIn Burrow (https://github.com/linkedin/Burrow):
Burrow is a monitoring companion for Apache Kafka that provides consumer lag checking as a service without
the need for specifying thresholds. It monitors committed offsets for all consumers and calculates the status of
those consumers on demand. An HTTP endpoint is provided to request status on demand, as well as provide
other Kafka cluster information. There are also configurable notifiers that can send status out via email or HTTP
calls to another service.
- Multiple Kafka Cluster support
- Automatically monitors all consumers using Kafka-committed offsets
- Configurable support for Zookeeper-committed offsets
- Configurable support for Storm-committed offsets
- HTTP endpoint for consumer group status, as well as broker and consumer
information
- Configurable emailer for sending alerts for specific groups
- Configurable HTTP client for sending alerts to another system for all groups
12

Telf: 91 080 82 44
Telf: 933 68 52 46
LinkedIn Burrow (https://github.com/linkedin/Burrow):
13

Telf: 91 080 82 44
Telf: 933 68 52 46
KafDrop (https://github.com/HomeAdvisor/Kafdrop):
Kafdrop is a UI for monitoring Apache Kafka clusters. The tool displays information such as brokers, topics,
partitions, and even lets you view messages. It is a light weight application that runs on Spring Boot and
requires very little configuration.
14

Telf: 91 080 82 44
Telf: 933 68 52 46
LinkedIn’s Kafka Monitor (https://github.com/linkedin/kafka-monitor):
Kafka Monitor is a framework to implement and execute long-running kafka system tests in a real cluster. It
complements Kafka’s existing system tests by capturing potential bugs or regressions that are only likely to
occur after prolonged period of time or with low probability. Moreover, it allows you to monitor Kafka cluster
using end-to-end pipelines to obtain a number of derived vital stats such as end-to-end latency, service
availability and message loss rate. You can easily deploy Kafka Monitor to test and monitor your Kafka cluster
without requiring any change to your application.
Kafka Monitor can automatically create the monitor topic with the specified config and increase partition count
of the monitor topic to ensure partition# >= broker#. It can also reassign partition and trigger preferred leader
election to ensure that each broker acts as leader of at least one partition of the monitor topic. This allows
Kafka Monitor to detect performance issue on every broker without requiring users to manually manage the
partition assignment of the monitor topic.
15

Telf: 91 080 82 44
Telf: 933 68 52 46
LinkedIn’s Kafka Monitor (https://github.com/linkedin/kafka-monitor):
16

Telf: 91 080 82 44
Telf: 933 68 52 46
Grafana + Prometheus (
https://github.com/grafana/grafana
https://github.com/prometheus/prometheus
):
- Grafana: Open source, feature rich metrics dashboard and graph editor for
Graphite, Elasticsearch, OpenTSDB, Prometheus and InfluxDB.
- Prometheus: Prometheus, a Cloud Native Computing Foundation project, is a
systems and service monitoring system. It collects metrics from configured targets
at given intervals, evaluates rule expressions, displays the results, and can trigger
alerts if some condition is observed to be true.
Demo kafka repository (with slack integration):
https://github.com/lucrussell/slack-chatops
17

Telf: 91 080 82 44
Telf: 933 68 52 46
Grafana + Prometheus:
18

Telf: 91 080 82 44
Telf: 933 68 52 46
19
ELK (
https://github.com/elastic/elasticsearch
https://github.com/elastic/logstash
https://github.com/elastic/kibana
):
- Elasticsearh: distributed RESTful search engine built for the cloud.
- Logstash: Logstash is part of the Elastic Stack along with Beats, Elasticsearch and
Kibana. Logstash is a server-side data processing pipeline that ingests data from a
multitude of sources simultaneously, transforms it, and then sends it to your
favorite "stash."
- Kibana: Window into the Elastic Stack. Specifically, it's a browser-based analytics
and search dashboard for Elasticsearch.

Telf: 91 080 82 44
Telf: 933 68 52 46
20
ELK :

Telf: 91 080 82 44
Telf: 933 68 52 46
21
LANDOOP STACK
Landoop Lenses:
Lenses is an enterprise grade product that provides faster streaming application
deliveries and data flow management that natively integrates over Apache Kafka.
Lenses supports the core elements of Kafka with a rich user interface, endpoints and
vital enterprise capabilities that enable engineering and data teams to query real time
data, create and monitor Kafka topologies with rich integrations with other systems.
Fast Data-dev:
Running a demo development environment:
$docker run --rm --net=host landoop/fast-data-dev

Telf: 91 080 82 44
Telf: 933 68 52 46
22
LANDOOP STACK
Landoop Lenses:

Telf: 91 080 82 44
Telf: 933 68 52 46
23
CONFLUET STACK
Confluent Platform: Streaming platform that enables you to organize and manage
data from many different sources with one reliable, high performance system.
Bundle:
- Kafka Connectors
- Kafka Clients
- Schema Registry
- REST Proxy
Enterprise:
- Automatic Data Balancing
- Multi Datacenter Replication
- Confluent Control Center
- JMS Client

Telf: 91 080 82 44
Telf: 933 68 52 46
24
CONFLUET STACK
Confluent Control Center:
Confluent Control Center is a GUI-based system for managing and monitoring Apache
Kafka. It allows you to easily manage Kafka Connect, to create, edit, and manage
connections to other systems. It also allows you to monitor data streams from
producer to consumer, assuring that every message is delivered, and measuring how
long it takes to deliver messages. Using Control Center, you can build a production
data pipeline based on Apache Kafka without writing a line of code. Control Center
also has the capability to define alerts on the latency and completeness statistics of
data streams, which can be delivered by email or queried from a centralized alerting
system.

Telf: 91 080 82 44
Telf: 933 68 52 46
25
CONFLUET STACK
Confluent Control Center:

Kafka infrastructure monitoring

Recommended

Recommended

More Related Content

Similar to Kafka infrastructure monitoring

Similar to Kafka infrastructure monitoring (20)

Recently uploaded

Recently uploaded (20)

Kafka infrastructure monitoring