Monitor, Troubleshoot and Fix the
Most Common Problems as Your
Kafka Clusters Grow
Apr 2, 2019
Monitoring
Logging
Infrastructure
Scalability
Krunal Vora
Software Engineer,
Observability, Tinder
2
3
Expectations from the Talk
● Intended for users with some Kafka experience
● Kafka Monitoring 101
● Questions / problems you are sure to encounter
when working with Kafka
● Tools to make your life easier
4
GLOBAL PERFORMANCE
190+
countries
40+
languages
2.2B+
swipes daily
30B
matches
5
Impact of Kafka @ Tinder
~86B
Events/Day
~1M Events/Second
Cost Effectiveness
~90%
Using Kafka over SQS /
Kinesis saves us
approximately 90% on costs
>40TB
Data/Day
Kafka delivers the
performance and throughput
needed to sustain this scale of
data processing
6
Monitoring for Kafka
7
Monitoring Options
Bunch of other paid
monitoring stacks provided by
various companies:
● Splunk
● Cloudwatch
● Datadog
● Confluent Control Center
to name a few.
8
Monitoring with Prometheus / Grafana
JMX Exporter
“java -javaagent:./jmx_prometheus_javaagent-0.11.0.jar=8080:config.yaml”
9
Monitoring with Prometheus / Grafana
Ref: https://kafka.apache.org/intro
10
Monitoring with Prometheus / Grafana
JMX Exporter
Prometheus
Kafka Consumer
Group Exporter
11
Monitoring with Prometheus / Grafana
Demo!
12
Questions / Problems sure to
face when working with Kafka
13
Partition Reassignment
Reasons to trigger Partition
Reassignment:
● Cluster Expansion
● Selectively moving some
partitions to a broker
Goal:
Achieve load balance across brokers
Some of the tools available:
● Partition Reassignment Tool
● Kafka Manager
● LinkedIn Tools
Let’s take a look at the CLI!
14
Partition Reassignment Tool
Can run in 3 mutually exclusive modes:
● --generate: generates a candidate reassignment
to move partitions
● --execute: kicks off the reassignment of partitions
● --verify: verifies the status of the reassignment
15
Partition Reassignment Tool
16
Data Retention
Reasons to trigger data
retention changes:
● Cluster level data
retention policy setting
● Topic level data retention
as per the consumption
rate
Goal:
Make sure you find that right balance
between uncluttering when you are
done with your data and feeding your
cluster to death.
Log.retention.hours = time-based retention
Can also be per-topic basis
Log.retention.bytes = size-based retention
17
Consumer Group
Rebalancing
Happens when:
● A consumer joins/leaves
the group
● A consumer is considered
dead by the group
coordinator
● Partitions are added
Goal:
Bringing sanity to partition ownership
among consumers of a consumer group.
Ideal scenario:
# of partitions = # of consumers
18
Messages Re-distribution
Might need to design when:
● One or some partitions
have highly unbalanced
amounts of messages
compared to other
partitions
● Consumption is stuck due
to this imbalance
Goal:
Provide a breather to a
partition/consumer that is taking the load
of the entire topic/group.
A precautionary measure is more viable
so as to avoid such a situation.
Designing a solution can differ
according to your scenario.
Shoutout to Vijay Vangapandu @ Tinder for the demo!
19
Administration Tools
These tools help a whole bunch in the
troubleshooting of the issues with Kafka
20
ZooNavigator UI
21
Kafka Manager by Yahoo
22
Apache Kafka Tools /
Utilities:
Kafka Monitor
LinkedIn Cruise Control
Kafka Tool
KafkaCat
KafDrop
Kafka MirrorMaker
23
Resources to keep learning
● Apache Kafka Documentation / Books
● Online courses
For eg., Udemy courses like Kafka Monitoring and Operations by Stephane Maarek
● Spin up your own cluster and
experiment!
PRESENTATION ASSETS
24
Thank you!
linkedin.com/in/krunalvora/
Troubleshooting as Your Kafka Clusters Grow (Krunal Vora, Tinder) Kafka Summit NYC 2019

Troubleshooting as Your Kafka Clusters Grow (Krunal Vora, Tinder) Kafka Summit NYC 2019