WAR STORIES: DIY
KAFKA
NINA HANZLIKOVA
23-04-2018
2
● Zalando is Europe’s largest
online fashion retailer.
● "Reimagine fashion for the
good of all."
● Radical Agility:
○ Autonomy
○ Mastery
○ Purpose
WHO ARE WE?
3
● Zalando Dublin is the
company’s Fashion Insights
Centre.
● We build data science
applications to help us
understand fashion.
● We use these findings to drive
our business.
WHAT DO WE DO?
● My team works to obtain and
analyze Fashion Data from the
web.
● We aim to provide this data in
near real time.
● There is quite a lot of it.
As developers we loved Apache Kafka® (and Kafka Streams) pretty much
straight away!
4
● Zalando teams have a high level of technological autonomy.
○ Teams can choose their own technology but they have to run it,
too.
● Zalando teams are usually small (in our case 3 developers) and are
highly focused on delivering customer value.
○ We really have to minimise the amount of time spent on Ops.
THERE’S A CATCH THOUGH
5
6
A BRIEF (LIES!) INTRO TO APACHE KAFKA®
Image courtesy of http://cloudurable.com/blog/kafka-architecture/index.html
7
AN EVEN BRIEFER (LIES!) INTRO TO KAFKA
STREAMS
Image courtesy of https://docs.confluent.io/current/streams/architecture.html
8
KAFKA DATA STORAGE
9
We run our services on AWS. So:
● Our producer and consumer
apps can shutdown uncleanly.
● So can our brokers.
● Network partitions are a thing.
● Network saturation is also a
thing.
AVOIDING DATA LOSS THE MINIMALIST WAY
We started with a basic setup:
● 3 ZooKeeper Nodes (t2.large)
● 6 Kafka Brokers (m3.xlarge) with
EBS volumes
● Kafka server.properties:
unclean.leader.election.enable=false
min.insync.replicas=2
default.replication.factor=3
● Producer config: acks=all
10
AND BASIC MONITORING
11
AND BASIC MONITORING
12
What happens if Kafka brokers can’t connect to ZooKeeper?
WHAT WENT WRONG?
● If they don’t need to access ZooKeeper they keep on working (even
for days).
● Eventually though they will need to move controller. Or a partition will
fall behind, becoming under replicated or going offline. Or a topic will
be created or deleted…
13
But connectivity will fix itself, right?
WHAT WENT WRONG?
Not exactly…
ZooKeeper client currently still caches resolved hosts and does not
re-resolve them (PR150 and PR451).
On AWS, with ZooKeeper behind a load balancer, this can even cache a
load balancer instance rather than ZooKeeper itself.
14
What about if a Kafka broker cannot properly communicate with
ZooKeeper?
WHAT WENT WRONG?
This happened sometimes when our broker temporarily partitioned and
reconnected with ZooKeeper.
Broker would end up caching an old zkVersion. This would then prevent
ISR-modifying operations from completing.
Good news though, this bug should be resolved from 1.1.0 on!
[KAFKA-2729]
15
● In most of these problems the
simplest and fastest solution
was to restart the broker.
● In our set up, this would
terminate the old broker and
EBS and bring up a new one.
● If the rest of the cluster is
healthy, the data will just
replicate onto the new broker.
FIXES, FIXES, FIXES
16
● Unfortunately, with a lot of data
in the cluster, the initial
replication can saturate the
network.
● Producers and Consumers
start timing out talking to Kafka
- effectively causing downtime.
● What happens if multiple
brokers lose connectivity? In
this case rolling restart is not
always an option.
NOT SO FAST
Graph courtesy of Michal Michalski
17
● We stopped relying on simple broker replication for persistence and
started persisting our EBS volumes as well!
● Most of the data is now persisted during termination. When a broker
restarts, only the messages written during its downtime need to be
replicated.
● If multiple brokers report problems, and rolling restart may not be
feasible, we can restart multiple brokers without losing all their data.
KAFKA CONFIGURATION MARK 2
18
KAFKA STREAM STORAGE
19
In a nutshell:
● By default Kafka Streams use
RocksDB for local storage.
● This storage can be quite large,
~200 MB per partition.
● RocksDB uses up a lot of
on-heap and off-heap memory.
KAFKA STREAM STORAGE
Basic setup:
● We used memory optimised
EC2 instance (m4) family,
keeping about half the memory
for off-heap usage.
● Instances had an EBS volume
attached for partition
information storage.
20
● If there was a single must-monitor metric for a Kafka Stream apps, it
was the consumer lag.
● We experimented with a number of lag monitors (Burrow, Kafka Lag
Monitor and Kafka Manager) but in the end started using a small utility,
built by our colleague Mark Kelly, called Remora.
THINGS TO KEEP AN EYE ON
21
As the load on our system
increased we started noticing
something odd.
Our stream app would run happily
for a few hours.
Then CPU and memory would spike
up, the system would grind to a halt,
and instance would crash.
RUNTIME PERFORMANCE MYSTERY
22
● We used EBS volumes to provide storage space for RocksDB.
● EBS volumes operate using I/O credits.
○ I/O credits are allocated based on the size of the disk.
○ As they get used up, I/O on the disk gets throttled.
○ These I/O credits eventually replenish over time.
● Under the hood our RocksDB was using up I/O credits faster than they
were replenishing.
● Increasing the size of the EBS volume also increased the number of
I/O credits.
REMEMBER EBS VOLUMES?
23
REMEMBER EBS VOLUMES?
24
WHEN CATASTROPHE STRIKES
25
● Kafka uses ZooKeeper to store some coordination information.
● This includes storing information about other brokers and the cluster
controller.
● Perhaps most importantly, it uses ZooKeeper to store topic partition
assignment mappings.
● These mappings tell Kafka brokers what data they actually store.
● ZooKeeper is a stateful service, and needs to be managed as such.
● If brokers need to be restarted, this needs to be a rolling restart.
LET’S TALK A LITTLE ABOUT ZOOKEEPER
26
Most of the services run by developers in Zalando are stateless, with a
backing store. Vast majority of docs on upgrades reflect this.
During one such upgrade a ZooKeeper cluster holding Kafka information
had all its instances restarted at once. This caused corruption of the
partition assignment mappings. As a result brokers no longer knew what
data they contained.
WHEN ZOOKEEPER STOPPED PLAYING NICE
27
Good News:
● The ZooKeeper appliance in
question ran under Exhibitor.
This is a popular supervisor
system for ZooKeeper, which
provides some backup and
restore capabilities.
● The Kafka cluster was also
being persisted by Secor.
ABOUT THOSE BACKUPS...
28
ABOUT THOSE BACKUPS...
Bad News:
● The Exhibitor backups are
intended for rolling back bad
transactions only. For this a user
has to index transaction logs. It is
also not intended for persisting
after teardown.
● Secor is really a last-resort
recovery solution, not a full
backup system. While the Secor
files were persisted, there was no
replay mechanism or procedure
for restoring from them.
29
LESSONS LEARNED
● Backups are only backups if you know how to restore them.
● Ensure that you understand what a service means when it talks about
backup and restore.
● Test that service provided backups work correctly.
● Regularly check restoring from your stored backups.
30
BACKUP REQUIREMENTS SUMMARY
● We needed to be able to persist data on broker disks for when the
cluster has lost connectivity.
● We don’t have to worry about Kafka Stream apps, since they use
Kafka topics to persist their data and use that to build up their
RocksDB on startup.
● We needed Kafka cluster data snapshots for when bad data is written,
topics are deleted, and other user errors.
● We needed ZooKeeper backups for when partition mapping
information is corrupted.
31
KAFKA BACKUPS
● Much like with Secor, we wanted a convenient way to store a Kafka
data snapshot in S3.
● However we also wanted a simple way to replay this data back into
Kafka.
● This is when we came across Kafka Connect. Kafka connect is a
convenient framework which enables transporting data between Kafka
and many other stores.
● Using the Spredfast S3 connector data can be easily backed up to a
bucket in S3 and later replayed out onto a new topic.
● We set up a daily cron job for this backup.
32
ZOOKEEPER BACKUPS
● Ideally we also wanted a daily snapshot of our ZooKeeper.
● After searching around a little we found Burry. Burry is a small backup
and recovery tool for ZooKeeper, etcd and Consul.
● It simply copies all ZooKeeper znode data to a file and stores it in a
specified location. This can be to local filesystem, S3, Google Cloud
Storage or others.
● Similarly it can be used to replay all this data to a new ZooKeeper
cluster. It will not overwrite existing data in the cluster.
● Likewise we set up a daily backup cron job for our ZooKeeper.
33
SOME FINAL THOUGHTS
34
SOME FINAL THOUGHTS
● There are lots of things one can monitor on their Kafka, but you don’t
need to be a Kafka wizard (just yet) to effectively understand your
cluster.
● It’s not quite enough to understand how Kafka and Kafka Streams
work. To be able to diagnose and remedy many issues a deeper
understanding of underlying components (such as EBS I/O credits) is
needed.
● In many cases backups don’t have to be overly sophisticated or hard
to implement, but they always need be replayable.
35
36
Nina Hanzlikova
@geekity2
https://github.com/geekity

War Stories: DIY Kafka

  • 1.
    WAR STORIES: DIY KAFKA NINAHANZLIKOVA 23-04-2018
  • 2.
    2 ● Zalando isEurope’s largest online fashion retailer. ● "Reimagine fashion for the good of all." ● Radical Agility: ○ Autonomy ○ Mastery ○ Purpose WHO ARE WE?
  • 3.
    3 ● Zalando Dublinis the company’s Fashion Insights Centre. ● We build data science applications to help us understand fashion. ● We use these findings to drive our business. WHAT DO WE DO? ● My team works to obtain and analyze Fashion Data from the web. ● We aim to provide this data in near real time. ● There is quite a lot of it. As developers we loved Apache Kafka® (and Kafka Streams) pretty much straight away!
  • 4.
    4 ● Zalando teamshave a high level of technological autonomy. ○ Teams can choose their own technology but they have to run it, too. ● Zalando teams are usually small (in our case 3 developers) and are highly focused on delivering customer value. ○ We really have to minimise the amount of time spent on Ops. THERE’S A CATCH THOUGH
  • 5.
  • 6.
    6 A BRIEF (LIES!)INTRO TO APACHE KAFKA® Image courtesy of http://cloudurable.com/blog/kafka-architecture/index.html
  • 7.
    7 AN EVEN BRIEFER(LIES!) INTRO TO KAFKA STREAMS Image courtesy of https://docs.confluent.io/current/streams/architecture.html
  • 8.
  • 9.
    9 We run ourservices on AWS. So: ● Our producer and consumer apps can shutdown uncleanly. ● So can our brokers. ● Network partitions are a thing. ● Network saturation is also a thing. AVOIDING DATA LOSS THE MINIMALIST WAY We started with a basic setup: ● 3 ZooKeeper Nodes (t2.large) ● 6 Kafka Brokers (m3.xlarge) with EBS volumes ● Kafka server.properties: unclean.leader.election.enable=false min.insync.replicas=2 default.replication.factor=3 ● Producer config: acks=all
  • 10.
  • 11.
  • 12.
    12 What happens ifKafka brokers can’t connect to ZooKeeper? WHAT WENT WRONG? ● If they don’t need to access ZooKeeper they keep on working (even for days). ● Eventually though they will need to move controller. Or a partition will fall behind, becoming under replicated or going offline. Or a topic will be created or deleted…
  • 13.
    13 But connectivity willfix itself, right? WHAT WENT WRONG? Not exactly… ZooKeeper client currently still caches resolved hosts and does not re-resolve them (PR150 and PR451). On AWS, with ZooKeeper behind a load balancer, this can even cache a load balancer instance rather than ZooKeeper itself.
  • 14.
    14 What about ifa Kafka broker cannot properly communicate with ZooKeeper? WHAT WENT WRONG? This happened sometimes when our broker temporarily partitioned and reconnected with ZooKeeper. Broker would end up caching an old zkVersion. This would then prevent ISR-modifying operations from completing. Good news though, this bug should be resolved from 1.1.0 on! [KAFKA-2729]
  • 15.
    15 ● In mostof these problems the simplest and fastest solution was to restart the broker. ● In our set up, this would terminate the old broker and EBS and bring up a new one. ● If the rest of the cluster is healthy, the data will just replicate onto the new broker. FIXES, FIXES, FIXES
  • 16.
    16 ● Unfortunately, witha lot of data in the cluster, the initial replication can saturate the network. ● Producers and Consumers start timing out talking to Kafka - effectively causing downtime. ● What happens if multiple brokers lose connectivity? In this case rolling restart is not always an option. NOT SO FAST Graph courtesy of Michal Michalski
  • 17.
    17 ● We stoppedrelying on simple broker replication for persistence and started persisting our EBS volumes as well! ● Most of the data is now persisted during termination. When a broker restarts, only the messages written during its downtime need to be replicated. ● If multiple brokers report problems, and rolling restart may not be feasible, we can restart multiple brokers without losing all their data. KAFKA CONFIGURATION MARK 2
  • 18.
  • 19.
    19 In a nutshell: ●By default Kafka Streams use RocksDB for local storage. ● This storage can be quite large, ~200 MB per partition. ● RocksDB uses up a lot of on-heap and off-heap memory. KAFKA STREAM STORAGE Basic setup: ● We used memory optimised EC2 instance (m4) family, keeping about half the memory for off-heap usage. ● Instances had an EBS volume attached for partition information storage.
  • 20.
    20 ● If therewas a single must-monitor metric for a Kafka Stream apps, it was the consumer lag. ● We experimented with a number of lag monitors (Burrow, Kafka Lag Monitor and Kafka Manager) but in the end started using a small utility, built by our colleague Mark Kelly, called Remora. THINGS TO KEEP AN EYE ON
  • 21.
    21 As the loadon our system increased we started noticing something odd. Our stream app would run happily for a few hours. Then CPU and memory would spike up, the system would grind to a halt, and instance would crash. RUNTIME PERFORMANCE MYSTERY
  • 22.
    22 ● We usedEBS volumes to provide storage space for RocksDB. ● EBS volumes operate using I/O credits. ○ I/O credits are allocated based on the size of the disk. ○ As they get used up, I/O on the disk gets throttled. ○ These I/O credits eventually replenish over time. ● Under the hood our RocksDB was using up I/O credits faster than they were replenishing. ● Increasing the size of the EBS volume also increased the number of I/O credits. REMEMBER EBS VOLUMES?
  • 23.
  • 24.
  • 25.
    25 ● Kafka usesZooKeeper to store some coordination information. ● This includes storing information about other brokers and the cluster controller. ● Perhaps most importantly, it uses ZooKeeper to store topic partition assignment mappings. ● These mappings tell Kafka brokers what data they actually store. ● ZooKeeper is a stateful service, and needs to be managed as such. ● If brokers need to be restarted, this needs to be a rolling restart. LET’S TALK A LITTLE ABOUT ZOOKEEPER
  • 26.
    26 Most of theservices run by developers in Zalando are stateless, with a backing store. Vast majority of docs on upgrades reflect this. During one such upgrade a ZooKeeper cluster holding Kafka information had all its instances restarted at once. This caused corruption of the partition assignment mappings. As a result brokers no longer knew what data they contained. WHEN ZOOKEEPER STOPPED PLAYING NICE
  • 27.
    27 Good News: ● TheZooKeeper appliance in question ran under Exhibitor. This is a popular supervisor system for ZooKeeper, which provides some backup and restore capabilities. ● The Kafka cluster was also being persisted by Secor. ABOUT THOSE BACKUPS...
  • 28.
    28 ABOUT THOSE BACKUPS... BadNews: ● The Exhibitor backups are intended for rolling back bad transactions only. For this a user has to index transaction logs. It is also not intended for persisting after teardown. ● Secor is really a last-resort recovery solution, not a full backup system. While the Secor files were persisted, there was no replay mechanism or procedure for restoring from them.
  • 29.
    29 LESSONS LEARNED ● Backupsare only backups if you know how to restore them. ● Ensure that you understand what a service means when it talks about backup and restore. ● Test that service provided backups work correctly. ● Regularly check restoring from your stored backups.
  • 30.
    30 BACKUP REQUIREMENTS SUMMARY ●We needed to be able to persist data on broker disks for when the cluster has lost connectivity. ● We don’t have to worry about Kafka Stream apps, since they use Kafka topics to persist their data and use that to build up their RocksDB on startup. ● We needed Kafka cluster data snapshots for when bad data is written, topics are deleted, and other user errors. ● We needed ZooKeeper backups for when partition mapping information is corrupted.
  • 31.
    31 KAFKA BACKUPS ● Muchlike with Secor, we wanted a convenient way to store a Kafka data snapshot in S3. ● However we also wanted a simple way to replay this data back into Kafka. ● This is when we came across Kafka Connect. Kafka connect is a convenient framework which enables transporting data between Kafka and many other stores. ● Using the Spredfast S3 connector data can be easily backed up to a bucket in S3 and later replayed out onto a new topic. ● We set up a daily cron job for this backup.
  • 32.
    32 ZOOKEEPER BACKUPS ● Ideallywe also wanted a daily snapshot of our ZooKeeper. ● After searching around a little we found Burry. Burry is a small backup and recovery tool for ZooKeeper, etcd and Consul. ● It simply copies all ZooKeeper znode data to a file and stores it in a specified location. This can be to local filesystem, S3, Google Cloud Storage or others. ● Similarly it can be used to replay all this data to a new ZooKeeper cluster. It will not overwrite existing data in the cluster. ● Likewise we set up a daily backup cron job for our ZooKeeper.
  • 33.
  • 34.
    34 SOME FINAL THOUGHTS ●There are lots of things one can monitor on their Kafka, but you don’t need to be a Kafka wizard (just yet) to effectively understand your cluster. ● It’s not quite enough to understand how Kafka and Kafka Streams work. To be able to diagnose and remedy many issues a deeper understanding of underlying components (such as EBS I/O credits) is needed. ● In many cases backups don’t have to be overly sophisticated or hard to implement, but they always need be replayable.
  • 35.
  • 36.