(Nina Hanzlikova, Zalando) Kafka Summit SF 2018
My team at Zalando fell in love with KStreams and their programming model straight out of the gate. However, as a small team of developers, building out and supporting our infrastructure while still trying to deliver solutions for our business has not always resulted in a smooth journey.
Can a small team of a couple of developers run their own Kafka infrastructure confidently and still spend most of their time developing code?
In this talk, we will dive into some of the problems we experienced while running Kafka brokers and Kafka Streams applications, as well as the consultations we had with other teams around this matter. We will outline some of the pragmatic decisions we made regarding backups, monitoring and operations to minimize our time spent administering our Kafka brokers and various stream applications.
2. 2
● Zalando is Europe’s largest
online fashion retailer.
● "Reimagine fashion for the
good of all."
● Radical Agility:
○ Autonomy
○ Mastery
○ Purpose
WHO ARE WE?
3. 3
● Zalando Dublin is the
company’s Fashion Insights
Centre.
● We build data science
applications to help us
understand fashion.
● We use these findings to drive
our business.
WHAT DO WE DO?
● My team works to obtain and
analyze Fashion Data from the
web.
● We aim to provide this data in
near real time.
● There is quite a lot of it.
As developers we loved Apache Kafka® (and Kafka Streams) pretty much
straight away!
4. 4
● Zalando teams have a high level of technological autonomy.
○ Teams can choose their own technology but they have to run it,
too.
● Zalando teams are usually small (in our case 3 developers) and are
highly focused on delivering customer value.
○ We really have to minimise the amount of time spent on Ops.
THERE’S A CATCH THOUGH
9. 9
We run our services on AWS. So:
● Our producer and consumer
apps can shutdown uncleanly.
● So can our brokers.
● Network partitions are a thing.
● Network saturation is also a
thing.
AVOIDING DATA LOSS THE MINIMALIST WAY
We started with a basic setup:
● 3 ZooKeeper Nodes (t2.large)
● 6 Kafka Brokers (m3.xlarge) with
EBS volumes
● Kafka server.properties:
unclean.leader.election.enable=false
min.insync.replicas=2
default.replication.factor=3
● Producer config: acks=all
12. 12
What happens if Kafka brokers can’t connect to ZooKeeper?
WHAT WENT WRONG?
● If they don’t need to access ZooKeeper they keep on working (even
for days).
● Eventually though they will need to move controller. Or a partition will
fall behind, becoming under replicated or going offline. Or a topic will
be created or deleted…
13. 13
But connectivity will fix itself, right?
WHAT WENT WRONG?
Not exactly…
ZooKeeper client cached resolved hosts and did not re-resolve them
(PR150 and PR451).
On AWS, with ZooKeeper behind a load balancer, this can even cache a
load balancer instance rather than ZooKeeper itself.
This issue should be resolved in Kafka 2.0.0 [KAFKA-4041]
14. 14
What about if a Kafka broker cannot properly communicate with
ZooKeeper?
WHAT WENT WRONG?
This happened sometimes when our broker temporarily partitioned and
reconnected with ZooKeeper.
Broker would end up caching an old zkVersion. This would then prevent
ISR-modifying operations from completing.
Good news though, this bug should be resolved from 1.1.0 on!
[KAFKA-2729]
15. 15
● In most of these problems the
simplest and fastest solution
was to restart the broker.
● In our set up, this would
terminate the old broker and
EBS and bring up a new one.
● If the rest of the cluster is
healthy, the data will just
replicate onto the new broker.
FIXES, FIXES, FIXES
16. 16
● Unfortunately, with a lot of data
in the cluster, the initial
replication can saturate the
network.
● Producers and Consumers
start timing out talking to Kafka
- effectively causing downtime.
● What happens if multiple
brokers lose connectivity? In
this case rolling restart is not
always an option.
NOT SO FAST
Graph courtesy of Michal Michalski
17. 17
● We stopped relying on simple broker replication for persistence and
started persisting our EBS volumes as well!
● Most of the data is now persisted during termination. When a broker
restarts, only the messages written during its downtime need to be
replicated.
● If multiple brokers report problems, and rolling restart may not be
feasible, we can restart multiple brokers without losing all their data.
KAFKA CONFIGURATION MARK 2
19. 19
In a nutshell:
● By default Kafka Streams use
RocksDB for local storage.
● This storage can be quite large,
~200 MB per partition.
● RocksDB uses up a lot of
on-heap and off-heap memory.
KAFKA STREAM STORAGE
Basic setup:
● We used memory optimised
EC2 instance (m4) family,
keeping about half the memory
for off-heap usage.
● Instances had an EBS volume
attached for partition
information storage.
20. 20
● If there was a single must-monitor metric for a Kafka Stream apps, it
was the consumer lag.
● We experimented with a number of lag monitors (Burrow, Kafka Lag
Monitor and Kafka Manager) but in the end started using a small utility,
built by our colleague Mark Kelly, called Remora.
THINGS TO KEEP AN EYE ON
21. 21
As the load on our system
increased we started noticing
something odd.
Our stream app would run happily
for a few hours.
Then CPU and memory would spike
up, the system would grind to a halt,
and instance would crash.
RUNTIME PERFORMANCE MYSTERY
22. 22
● We used EBS volumes to provide storage space for RocksDB.
● EBS volumes operate using I/O credits.
○ I/O credits are allocated based on the size of the disk.
○ As they get used up, I/O on the disk gets throttled.
○ These I/O credits eventually replenish over time.
● Under the hood our RocksDB was using up I/O credits faster than they
were replenishing.
● Increasing the size of the EBS volume also increased the number of
I/O credits.
REMEMBER EBS VOLUMES?
25. 25
● Kafka uses ZooKeeper to store some coordination information.
● This includes storing information about other brokers and the cluster
controller.
● Perhaps most importantly, it uses ZooKeeper to store topic partition
assignment mappings.
● These mappings tell Kafka brokers what data they actually store.
● ZooKeeper is a stateful service, and needs to be managed as such.
● If brokers need to be restarted, this needs to be a rolling restart.
LET’S TALK A LITTLE ABOUT ZOOKEEPER
26. 26
Most of the services run by developers in Zalando are stateless, with a
backing store. Vast majority of docs on upgrades reflect this.
During one such upgrade a ZooKeeper cluster holding Kafka information
had all its instances restarted at once. This caused corruption of the
partition assignment mappings. As a result brokers no longer knew what
data they contained.
WHEN ZOOKEEPER STOPPED PLAYING NICE
27. 27
Good News:
● The ZooKeeper appliance in
question ran under Exhibitor.
This is a popular supervisor
system for ZooKeeper, which
provides some backup and
restore capabilities.
● The Kafka cluster was also
being persisted by Secor.
ABOUT THOSE BACKUPS...
28. 28
ABOUT THOSE BACKUPS...
Bad News:
● The Exhibitor backups are
intended for rolling back bad
transactions only. For this a user
has to index transaction logs. It is
also not intended for persisting
after teardown.
● Secor is really a last-resort
recovery solution, not a full
backup system. While the Secor
files were persisted, there was no
replay mechanism or procedure
for restoring from them.
29. 29
LESSONS LEARNED
● Backups are only backups if you know how to restore them.
● Ensure that you understand what a service means when it talks about
backup and restore.
● Test that service provided backups work correctly.
● Regularly check restoring from your stored backups.
30. 30
BACKUP REQUIREMENTS SUMMARY
● We needed to be able to persist data on broker disks for when the
cluster has lost connectivity.
● We don’t have to worry about Kafka Stream apps, since they use
Kafka topics to persist their data and use that to build up their
RocksDB on startup.
● We needed Kafka cluster data snapshots for when bad data is written,
topics are deleted, and other user errors.
● We needed ZooKeeper backups for when partition mapping
information is corrupted.
31. 31
KAFKA BACKUPS
● Much like with Secor, we wanted a convenient way to store a Kafka
data snapshot in S3.
● However we also wanted a simple way to replay this data back into
Kafka.
● This is when we came across Kafka Connect. Kafka connect is a
convenient framework which enables transporting data between Kafka
and many other stores.
● Using the Spredfast S3 connector data can be easily backed up to a
bucket in S3 and later replayed out onto a new topic.
● We set up a daily cron job for this backup.
32. 32
ZOOKEEPER BACKUPS
● Ideally we also wanted a daily snapshot of our ZooKeeper.
● After searching around a little we found Burry. Burry is a small backup
and recovery tool for ZooKeeper, etcd and Consul.
● It simply copies all ZooKeeper znode data to a file and stores it in a
specified location. This can be to local filesystem, S3, Google Cloud
Storage or others.
● Similarly it can be used to replay all this data to a new ZooKeeper
cluster. It will not overwrite existing data in the cluster.
● Likewise we set up a daily backup cron job for our ZooKeeper.
34. 34
SOME FINAL THOUGHTS
● There are lots of things one can monitor on their Kafka, but you don’t
need to be a Kafka wizard (just yet) to effectively understand your
cluster.
● It’s not quite enough to understand how Kafka and Kafka Streams
work. To be able to diagnose and remedy many issues a deeper
understanding of underlying components (such as EBS I/O credits) is
needed.
● In many cases backups don’t have to be overly sophisticated or hard
to implement, but they always need be replayable.