I'll tell a story on how we've hunted down a Heisenbug in a system that should
have prevented it by design in the very first place and finally fixed it.
The story involves Kafka, Kafka Connect, Elasticsearch, optimistic concurrency
control, data inconsistencies, and SRE with plenty of good intentions that in a
series of unfortunate circumstances caused a nasty bug.
A detailed description of the Elasticsearch indexing pipeline setup:
Setup (3): Elasticsearch
- Optimistic concurrency control
- Client sends the ‘_version’ number of the document in the indexing request
- Elasticsearch promises that document with the highest version number is searchable
- A user changes the price of her listing in Vinted
- The change results in new document version
- Elasticsearch stores only the newer version of the listing with an updated price
- Gist: Elasticsearch stores only the newest version a listing
Setup (4): Kafka
- Data is not deleted when it gets “old”
- retention.ms = -1
- Needed to support data reindexing into Elasticsearch
- Log compaction
- Kafka will always retain at least the last known value for each message key
- This makes sure that we are not running out of disk space
- Tombstone messages, i.e. messages with null body is for deletion
- Newer messages has higher offset in Kafka topic partition
Setup(5): Kafka Connect
- Framework and a library
- Reads listing data from Kafka topics
- Indexes listings into Elasticsearch
- Error handling (e.g. dead letter queue)
- Configuration, management
- Indexing throughput
We use Kafka topic partition offset as an Elasticsearch document _version
This trick allows us to parallelize indexing into Elasticsearch and is worry-free
from the data consistency point-of-view.
The Technical Reason (1)
- Kafka assigns partitions to messages by hashing the key of the message
- But the increased partition count changed the function!
partition_nr = hash(message.key) % partition_count
The technical reason (2)
Most of the messages with a key were written to a different partition after the
increase of partition count:
probability_off_error = 1 - (1 / partition_count)
Why would one increase the partition count?
- Partition is a scalability unit in Kafka.
- write scalability (should fit in one node)
- read scalability (consumers consume at least one partition)
- Required a full re-ingestion of data from the primary datastore into Kafka.
- I'd be enough to just write data to differently named topics.
- However, we used the situation to upgrade the Kafka cluster from 1.1.1 to
2.4.0 (yes, another Kafka cluster)
How to prevent such a bug?
- Don’t increase partition count if you rely on message ordering!
- Do sensible defaults in Kafka settings.
- If you don't rely on offset, e.g. message have no meaningful key (think
logging), then increase of partition count will not cause any big troubles
(just a rebalance of consumer groups).