War Stories: DIY Kafka

WAR STORIES: DIY
KAFKA
NINA HANZLIKOVA
23-04-2018

2
● Zalando is Europe’s largest
online fashion retailer.
● "Reimagine fashion for the
good of all."
● Radical Agility:
○ Autonomy
○ Mastery
○ Purpose
WHO ARE WE?

3
● Zalando Dublin is the
company’s Fashion Insights
Centre.
● We build data science
applications to help us
understand fashion.
● We use these findings to drive
our business.
WHAT DO WE DO?
● My team works to obtain and
analyze Fashion Data from the
web.
● We aim to provide this data in
near real time.
● There is quite a lot of it.
As developers we loved Apache Kafka® (and Kafka Streams) pretty much
straight away!

4
● Zalando teams have a high level of technological autonomy.
○ Teams can choose their own technology but they have to run it,
too.
● Zalando teams are usually small (in our case 3 developers) and are
highly focused on delivering customer value.
○ We really have to minimise the amount of time spent on Ops.
THERE’S A CATCH THOUGH

6
A BRIEF (LIES!) INTRO TO APACHE KAFKA®
Image courtesy of http://cloudurable.com/blog/kafka-architecture/index.html

7
AN EVEN BRIEFER (LIES!) INTRO TO KAFKA
STREAMS
Image courtesy of https://docs.confluent.io/current/streams/architecture.html

9
We run our services on AWS. So:
● Our producer and consumer
apps can shutdown uncleanly.
● So can our brokers.
● Network partitions are a thing.
● Network saturation is also a
thing.
AVOIDING DATA LOSS THE MINIMALIST WAY
We started with a basic setup:
● 3 ZooKeeper Nodes (t2.large)
● 6 Kafka Brokers (m3.xlarge) with
EBS volumes
● Kafka server.properties:
unclean.leader.election.enable=false
min.insync.replicas=2
default.replication.factor=3
● Producer config: acks=all

12
What happens if Kafka brokers can’t connect to ZooKeeper?
WHAT WENT WRONG?
● If they don’t need to access ZooKeeper they keep on working (even
for days).
● Eventually though they will need to move controller. Or a partition will
fall behind, becoming under replicated or going offline. Or a topic will
be created or deleted…

13
But connectivity will fix itself, right?
WHAT WENT WRONG?
Not exactly…
ZooKeeper client currently still caches resolved hosts and does not
re-resolve them (PR150 and PR451).
On AWS, with ZooKeeper behind a load balancer, this can even cache a
load balancer instance rather than ZooKeeper itself.

14
What about if a Kafka broker cannot properly communicate with
ZooKeeper?
WHAT WENT WRONG?
This happened sometimes when our broker temporarily partitioned and
reconnected with ZooKeeper.
Broker would end up caching an old zkVersion. This would then prevent
ISR-modifying operations from completing.
Good news though, this bug should be resolved from 1.1.0 on!
[KAFKA-2729]

15
● In most of these problems the
simplest and fastest solution
was to restart the broker.
● In our set up, this would
terminate the old broker and
EBS and bring up a new one.
● If the rest of the cluster is
healthy, the data will just
replicate onto the new broker.
FIXES, FIXES, FIXES

16
● Unfortunately, with a lot of data
in the cluster, the initial
replication can saturate the
network.
● Producers and Consumers
start timing out talking to Kafka
- effectively causing downtime.
● What happens if multiple
brokers lose connectivity? In
this case rolling restart is not
always an option.
NOT SO FAST
Graph courtesy of Michal Michalski

17
● We stopped relying on simple broker replication for persistence and
started persisting our EBS volumes as well!
● Most of the data is now persisted during termination. When a broker
restarts, only the messages written during its downtime need to be
replicated.
● If multiple brokers report problems, and rolling restart may not be
feasible, we can restart multiple brokers without losing all their data.
KAFKA CONFIGURATION MARK 2

19
In a nutshell:
● By default Kafka Streams use
RocksDB for local storage.
● This storage can be quite large,
~200 MB per partition.
● RocksDB uses up a lot of
on-heap and off-heap memory.
KAFKA STREAM STORAGE
Basic setup:
● We used memory optimised
EC2 instance (m4) family,
keeping about half the memory
for off-heap usage.
● Instances had an EBS volume
attached for partition
information storage.

20
● If there was a single must-monitor metric for a Kafka Stream apps, it
was the consumer lag.
● We experimented with a number of lag monitors (Burrow, Kafka Lag
Monitor and Kafka Manager) but in the end started using a small utility,
built by our colleague Mark Kelly, called Remora.
THINGS TO KEEP AN EYE ON

21
As the load on our system
increased we started noticing
something odd.
Our stream app would run happily
for a few hours.
Then CPU and memory would spike
up, the system would grind to a halt,
and instance would crash.
RUNTIME PERFORMANCE MYSTERY

22
● We used EBS volumes to provide storage space for RocksDB.
● EBS volumes operate using I/O credits.
○ I/O credits are allocated based on the size of the disk.
○ As they get used up, I/O on the disk gets throttled.
○ These I/O credits eventually replenish over time.
● Under the hood our RocksDB was using up I/O credits faster than they
were replenishing.
● Increasing the size of the EBS volume also increased the number of
I/O credits.
REMEMBER EBS VOLUMES?

25
● Kafka uses ZooKeeper to store some coordination information.
● This includes storing information about other brokers and the cluster
controller.
● Perhaps most importantly, it uses ZooKeeper to store topic partition
assignment mappings.
● These mappings tell Kafka brokers what data they actually store.
● ZooKeeper is a stateful service, and needs to be managed as such.
● If brokers need to be restarted, this needs to be a rolling restart.
LET’S TALK A LITTLE ABOUT ZOOKEEPER

26
Most of the services run by developers in Zalando are stateless, with a
backing store. Vast majority of docs on upgrades reflect this.
During one such upgrade a ZooKeeper cluster holding Kafka information
had all its instances restarted at once. This caused corruption of the
partition assignment mappings. As a result brokers no longer knew what
data they contained.
WHEN ZOOKEEPER STOPPED PLAYING NICE

27
Good News:
● The ZooKeeper appliance in
question ran under Exhibitor.
This is a popular supervisor
system for ZooKeeper, which
provides some backup and
restore capabilities.
● The Kafka cluster was also
being persisted by Secor.
ABOUT THOSE BACKUPS...

28
ABOUT THOSE BACKUPS...
Bad News:
● The Exhibitor backups are
intended for rolling back bad
transactions only. For this a user
has to index transaction logs. It is
also not intended for persisting
after teardown.
● Secor is really a last-resort
recovery solution, not a full
backup system. While the Secor
files were persisted, there was no
replay mechanism or procedure
for restoring from them.

29
LESSONS LEARNED
● Backups are only backups if you know how to restore them.
● Ensure that you understand what a service means when it talks about
backup and restore.
● Test that service provided backups work correctly.
● Regularly check restoring from your stored backups.

30
BACKUP REQUIREMENTS SUMMARY
● We needed to be able to persist data on broker disks for when the
cluster has lost connectivity.
● We don’t have to worry about Kafka Stream apps, since they use
Kafka topics to persist their data and use that to build up their
RocksDB on startup.
● We needed Kafka cluster data snapshots for when bad data is written,
topics are deleted, and other user errors.
● We needed ZooKeeper backups for when partition mapping
information is corrupted.

31
KAFKA BACKUPS
● Much like with Secor, we wanted a convenient way to store a Kafka
data snapshot in S3.
● However we also wanted a simple way to replay this data back into
Kafka.
● This is when we came across Kafka Connect. Kafka connect is a
convenient framework which enables transporting data between Kafka
and many other stores.
● Using the Spredfast S3 connector data can be easily backed up to a
bucket in S3 and later replayed out onto a new topic.
● We set up a daily cron job for this backup.

32
ZOOKEEPER BACKUPS
● Ideally we also wanted a daily snapshot of our ZooKeeper.
● After searching around a little we found Burry. Burry is a small backup
and recovery tool for ZooKeeper, etcd and Consul.
● It simply copies all ZooKeeper znode data to a file and stores it in a
specified location. This can be to local filesystem, S3, Google Cloud
Storage or others.
● Similarly it can be used to replay all this data to a new ZooKeeper
cluster. It will not overwrite existing data in the cluster.
● Likewise we set up a daily backup cron job for our ZooKeeper.

34
SOME FINAL THOUGHTS
● There are lots of things one can monitor on their Kafka, but you don’t
need to be a Kafka wizard (just yet) to effectively understand your
cluster.
● It’s not quite enough to understand how Kafka and Kafka Streams
work. To be able to diagnose and remedy many issues a deeper
understanding of underlying components (such as EBS I/O credits) is
needed.
● In many cases backups don’t have to be overly sophisticated or hard
to implement, but they always need be replayable.

36
Nina Hanzlikova
@geekity2
https://github.com/geekity

War Stories: DIY Kafka

More Related Content

What's hot

Similar to War Stories: DIY Kafka

More from confluent

Recently uploaded

War Stories: DIY Kafka