Troubleshooting as Your Kafka Clusters Grow (Krunal Vora, Tinder) Kafka Summit NYC 2019

Monitor, Troubleshoot and Fix the
Most Common Problems as Your
Kafka Clusters Grow
Apr 2, 2019

Monitoring
Logging
Infrastructure
Scalability
Krunal Vora
Software Engineer,
Observability, Tinder
2

3
Expectations from the Talk
● Intended for users with some Kafka experience
● Kafka Monitoring 101
● Questions / problems you are sure to encounter
when working with Kafka
● Tools to make your life easier

4
GLOBAL PERFORMANCE
190+
countries
40+
languages
2.2B+
swipes daily
30B
matches

5
Impact of Kafka @ Tinder
~86B
Events/Day
~1M Events/Second
Cost Effectiveness
~90%
Using Kafka over SQS /
Kinesis saves us
approximately 90% on costs
>40TB
Data/Day
Kafka delivers the
performance and throughput
needed to sustain this scale of
data processing

7
Monitoring Options
Bunch of other paid
monitoring stacks provided by
various companies:
● Splunk
● Cloudwatch
● Datadog
● Confluent Control Center
to name a few.

8
Monitoring with Prometheus / Grafana
JMX Exporter
“java -javaagent:./jmx_prometheus_javaagent-0.11.0.jar=8080:config.yaml”

9
Ref: https://kafka.apache.org/intro

10
JMX Exporter
Prometheus
Kafka Consumer
Group Exporter

11
Demo!

12
Questions / Problems sure to
face when working with Kafka

13
Partition Reassignment
Reasons to trigger Partition
Reassignment:
● Cluster Expansion
● Selectively moving some
partitions to a broker
Goal:
Achieve load balance across brokers
Some of the tools available:
● Partition Reassignment Tool
● Kafka Manager
● LinkedIn Tools
Let’s take a look at the CLI!

14
Partition Reassignment Tool
Can run in 3 mutually exclusive modes:
● --generate: generates a candidate reassignment
to move partitions
● --execute: kicks off the reassignment of partitions
● --verify: verifies the status of the reassignment

15
Partition Reassignment Tool

16
Data Retention
Reasons to trigger data
retention changes:
● Cluster level data
retention policy setting
● Topic level data retention
as per the consumption
rate
Goal:
Make sure you find that right balance
between uncluttering when you are
done with your data and feeding your
cluster to death.
Log.retention.hours = time-based retention
Can also be per-topic basis
Log.retention.bytes = size-based retention

17
Consumer Group
Rebalancing
Happens when:
● A consumer joins/leaves
the group
● A consumer is considered
dead by the group
coordinator
● Partitions are added
Goal:
Bringing sanity to partition ownership
among consumers of a consumer group.
Ideal scenario:
# of partitions = # of consumers

18
Messages Re-distribution
Might need to design when:
● One or some partitions
have highly unbalanced
amounts of messages
compared to other
partitions
● Consumption is stuck due
to this imbalance
Goal:
Provide a breather to a
partition/consumer that is taking the load
of the entire topic/group.
A precautionary measure is more viable
so as to avoid such a situation.
Designing a solution can differ
according to your scenario.
Shoutout to Vijay Vangapandu @ Tinder for the demo!

19
Administration Tools
These tools help a whole bunch in the
troubleshooting of the issues with Kafka

22
Apache Kafka Tools /
Utilities:
Kafka Monitor
LinkedIn Cruise Control
Kafka Tool
KafkaCat
KafDrop
Kafka MirrorMaker

23
Resources to keep learning
● Apache Kafka Documentation / Books
● Online courses
For eg., Udemy courses like Kafka Monitoring and Operations by Stephane Maarek
● Spin up your own cluster and
experiment!

PRESENTATION ASSETS
24
Thank you!
linkedin.com/in/krunalvora/

Troubleshooting as Your Kafka Clusters Grow (Krunal Vora, Tinder) Kafka Summit NYC 2019

Troubleshooting as Your Kafka Clusters Grow (Krunal Vora, Tinder) Kafka Summit NYC 2019

More Related Content

What's hot

Similar to Troubleshooting as Your Kafka Clusters Grow (Krunal Vora, Tinder) Kafka Summit NYC 2019

More from confluent

Recently uploaded

Troubleshooting as Your Kafka Clusters Grow (Krunal Vora, Tinder) Kafka Summit NYC 2019