The graphical curve of ‘getting things done with Kafka’ vs. ‘time spent with Kafka’ rises pretty quickly before starting to plateau. Kafka, as we know, is great to get started out of the box. Eventually though, anyone who has used Kafka for a reasonable amount of time would come across problems, both trivial and hardcore. More often than not, these problems feel tedious and repetitive which could be addressed better by the Kafka community. Issues like skewed partitions and brokers, partition reassignment for a large number of topics, working with messed up consumer offsets, distributing workload evenly to partitions, monitor and view the Kafka cluster better, etc. sound familiar among the community. Come, learn from our experiences trying to overcome and automate some of the most common problems surrounding the scaling Kafka clusters at Tinder. Let’s create an environment of sharing the best practices for usage of tools around Kafka.
3. 3
Expectations from the Talk
● Intended for users with some Kafka experience
● Kafka Monitoring 101
● Questions / problems you are sure to encounter
when working with Kafka
● Tools to make your life easier
5. 5
Impact of Kafka @ Tinder
~86B
Events/Day
~1M Events/Second
Cost Effectiveness
~90%
Using Kafka over SQS /
Kinesis saves us
approximately 90% on costs
>40TB
Data/Day
Kafka delivers the
performance and throughput
needed to sustain this scale of
data processing
7. 7
Monitoring Options
Bunch of other paid
monitoring stacks provided by
various companies:
● Splunk
● Cloudwatch
● Datadog
● Confluent Control Center
to name a few.
13. 13
Partition Reassignment
Reasons to trigger Partition
Reassignment:
● Cluster Expansion
● Selectively moving some
partitions to a broker
Goal:
Achieve load balance across brokers
Some of the tools available:
● Partition Reassignment Tool
● Kafka Manager
● LinkedIn Tools
Let’s take a look at the CLI!
14. 14
Partition Reassignment Tool
Can run in 3 mutually exclusive modes:
● --generate: generates a candidate reassignment
to move partitions
● --execute: kicks off the reassignment of partitions
● --verify: verifies the status of the reassignment
16. 16
Data Retention
Reasons to trigger data
retention changes:
● Cluster level data
retention policy setting
● Topic level data retention
as per the consumption
rate
Goal:
Make sure you find that right balance
between uncluttering when you are
done with your data and feeding your
cluster to death.
Log.retention.hours = time-based retention
Can also be per-topic basis
Log.retention.bytes = size-based retention
17. 17
Consumer Group
Rebalancing
Happens when:
● A consumer joins/leaves
the group
● A consumer is considered
dead by the group
coordinator
● Partitions are added
Goal:
Bringing sanity to partition ownership
among consumers of a consumer group.
Ideal scenario:
# of partitions = # of consumers
18. 18
Messages Re-distribution
Might need to design when:
● One or some partitions
have highly unbalanced
amounts of messages
compared to other
partitions
● Consumption is stuck due
to this imbalance
Goal:
Provide a breather to a
partition/consumer that is taking the load
of the entire topic/group.
A precautionary measure is more viable
so as to avoid such a situation.
Designing a solution can differ
according to your scenario.
Shoutout to Vijay Vangapandu @ Tinder for the demo!
23. 23
Resources to keep learning
● Apache Kafka Documentation / Books
● Online courses
For eg., Udemy courses like Kafka Monitoring and Operations by Stephane Maarek
● Spin up your own cluster and
experiment!