When Bad Things Happen to
Good Kafka Clusters
True stories that actually happened to production Kafka clusters
As told by
Gwen Shapira, System Architect
@gwenshap 1
Disclaimer
I am talking about other people’s systems
Not yours.
I am sure you had perfectly good reasons to configure your system the
way you did.
This is not personal criticism
Just some stories and few lessons we learned the hard way
2
POCs are super easy
Its time to go production
3
We keep our data in
/tmp/logs
What can possible go
wrong?
4
Replication-factor of 3 is way too much
5
__consumer_offsets topic?
Never heard of it, so its probably
ok to delete.
6
7
What’s wrong with running Kafka 0.7?
8
Remember that time when…
We accidentally lost all our data?
9
We added new partitions…
And immediately ran out of memory
10
We wanted to lookup records by time
The smaller the segments, the more accurate
the lookups
So we created 10k segments.
11
We need REALLY LARGE messages
12
We just serialize JSON
and throw it into a topic.
It’s easy.
The consumers will figure something out.
13
Log4J is a great way to
reliably send data to Kafka
14
Keep your Kafka safe!
“When it absolutely, positively has to be there:
Reliability guarantees in Apache Kafka”
Wednesday, 11:20am, Room 3D
15
Thank you
16
Visit Confluent in booth #929
Books, Kafka t-shirts & stickers, and more…
Gwen Shapira | gwen@confluent.io | @gwenshap

Nyc kafka meetup 2015 - when bad things happen to good kafka clusters

  • 1.
    When Bad ThingsHappen to Good Kafka Clusters True stories that actually happened to production Kafka clusters As told by Gwen Shapira, System Architect @gwenshap 1
  • 2.
    Disclaimer I am talkingabout other people’s systems Not yours. I am sure you had perfectly good reasons to configure your system the way you did. This is not personal criticism Just some stories and few lessons we learned the hard way 2
  • 3.
    POCs are supereasy Its time to go production 3
  • 4.
    We keep ourdata in /tmp/logs What can possible go wrong? 4
  • 5.
    Replication-factor of 3is way too much 5
  • 6.
    __consumer_offsets topic? Never heardof it, so its probably ok to delete. 6
  • 7.
  • 8.
    What’s wrong withrunning Kafka 0.7? 8
  • 9.
    Remember that timewhen… We accidentally lost all our data? 9
  • 10.
    We added newpartitions… And immediately ran out of memory 10
  • 11.
    We wanted tolookup records by time The smaller the segments, the more accurate the lookups So we created 10k segments. 11
  • 12.
    We need REALLYLARGE messages 12
  • 13.
    We just serializeJSON and throw it into a topic. It’s easy. The consumers will figure something out. 13
  • 14.
    Log4J is agreat way to reliably send data to Kafka 14
  • 15.
    Keep your Kafkasafe! “When it absolutely, positively has to be there: Reliability guarantees in Apache Kafka” Wednesday, 11:20am, Room 3D 15
  • 16.
    Thank you 16 Visit Confluentin booth #929 Books, Kafka t-shirts & stickers, and more… Gwen Shapira | gwen@confluent.io | @gwenshap

Editor's Notes

  • #6 Harddrive failures are somewhat correlated. There are bad batches. And Kafka brokers will crash on a single bad disk. Also, sometimes you lose a node *and* need to restart a controller