Kafka Lag
Monitoring For
Human Beings
Elad Leev
Aug 2020
(Sorry robots)
Elad Leev
Platform Engineer
Real Time Infrastructure Team
Who am I?
AppsFlyer
in a Nutshell
AppsFlyer is a mobile
attribution and analytics
platform
AppsFlyer in a Nutshell
1+
Million
Incoming HTTP
requests/sec
100+
Billion
Events per day
20+ Kafka Clusters
A small recap
Kafka
Is used for building real-time data
pipelines and streaming
applications
Offset
Is a simple integer that is used by
Kafka to maintain the current
position of a consumer
Lag
Delta between the last produced
message
and the last committed offset
__consumer_offsets
Offsets can be stored
either in Zookeeper
or in a special topic
called
__consumer_offsets
https://cwiki.apache.org/confluence/display/KAFKA/Offset+Management
Zookeeper is not
built for high-write
load such as offset
storage
https://cwiki.apache.org/confluence/display/KAFKA/Offset+Management
A consistent, fault
tolerant and
partitioned way of
storing offsets
https://cwiki.apache.org/confluence/display/KAFKA/Offset+Management
__consumer_offsets
__consumer_offsets
OffsetCommitRequest
Broker Broker Broker
Group Coordinator
Application
Check consumer lag
Check consumer lag
Why Lag Matters?
Lag is a
major KPI
when
working with
Kafka
Lag indicates how
far behind your
application is in
processing
up-to-date
information
Why Lag Matters?
__consumer_offsets
Lag indicates how
far behind your
application is in
processing
up-to-date
information
Why Lag Matters?
Kafka persistence
is based on
retention
Why Lag Matters?
We want to keep
the lag of our
application
to be as small as
possible
Why Lag Matters?
How did we
previously monitor it?
How did we previously monitor it?
We created a Clojure
service
Kafka Monitor
How did we previously monitor it?
Parsed the values from
Kafka-consumer-group.sh
And send it to our metrics
stack
How did we used to monitor it?
Hard to maintain
Not scalable
Small part of the config file
You get a
PagerDuty
alert on
Kafka lag
How did we used to monitor it?
You want to
understand
what’s going
on in your
system
How did we used to monitor it?
You go to our
shiny, full of
knowledge
Grafana
dashboard
How did we used to monitor it?
And then you realise that….
How did we used to monitor it?
How did we used to monitor it?
It’s just a number.
How did we used to monitor it?
What the hell does 40K lag mean?
What we wanted
to achieve?
Automatic
No need to change
config file
Filter consumer groups
based on Regex
Scalable
Small footprint
Easy to scale
Simple, easy to use
Support both ZK and
__consumer_offsets
topic
What we wanted to achieve?
The “raw” metrics that we
looked for are:
> Per Partition
> Per Consumer Group
> Per Topic
What we wanted to achieve?
What are the options?
Linkedin - Burrow
> A LinkedIn Project
> More than 2.5K stars
> Active community
> Production ready
Lightbend - Kafka Lag Exporter
> Smart
> Time Based
> Still in Beta
Zalando - Remora
> Inspired by Burrow
> CloudWatch & DataDog
integration
> Wrap around Kafka CLI
What are the options?
Burrow
Burrow is a
monitoring solution
for Apache Kafka that
provides consumer
lag checking as a
service
Burrow
Burrow has a
modular design
Burrow
Cluster / Consumers
Kafka client that
periodically update
cluster information
Burrow
Storage
Stores Burrow’s
information
Burrow
Evaluator
Calculate the status of
each group
Burrow
Notifier
Requests status on
consumer groups and
notifies
Burrow
HTTP Server
Provides an API
Interface for Burrow
Burrow
Associated Projects
Burrow
Burrow UI
Burrow Dashboard
How do we use it?
Burrow Architecture
Deployment Process
Clone Burrow
Source
Trigger
Gitlab Ci
Config file
(.toml)
Linter
Check that all
hosts are
resolvable
Build Burrow
container
Deploy using
in-house
deployment system
Burrow Dashboard
Burrow Dashboard
Burrow Dashboard
Lag By Number
Produce vs. Consume
Producer Rate Consumer Rate
Burrow Dashboard
Partitions Analysis
And the
endgame
Burrow Dashboard
Time Based Metrics
Time Lag - How did we do it?
Diff ( Last_Consumed , Last_Produced )
Producer Rate
Time Lag - How did we do it?
Timeline
Time: 12:00AM
Msg_offset: 134
Time: 12:10AM
Msg_offset: 144
Time: 12:20AM
Msg_offset: 154
Consumer
Producer
Lag
What’s next?
Smart Alerts
Dynamic alerts based
on lag and retention
Decoupling
As we grow, Burrow will
be deployed per cluster
Migration
Migrating crucial part
of the infrastructure is
hard
What’s next?
@eladleev
Thank You!
linkedin.com/in/elad-leev
medium.com/eladleev

Kafka Lag Monitoring For Human Beings (Elad Leev, AppsFlyer) Kafka Summit 2020