The document discusses reliability guarantees in Apache Kafka. It explains that Kafka ensures reliability through replication, where each partition has leader and follower replicas. Producers receive acknowledgments when data is committed to the in-sync replicas. Consumers can commit offsets to ensure they don't miss data on rebalance. The document provides best practices for configuration of producers, consumers, and monitoring to prevent data loss in Kafka.
This conceptually is our high-level consumer. In this diagram we have a topic with 6 partitions, and an application running 4 threads.
Kafka provides two different paradigms for commiting offsets. The first is “auto-committing”, more on this later. The second is to manually commit offsets in your application. But what’s the right time? If we commit offsets as soon as we actually receive a message, we expose our selves to data loss as we could have process, machine or thread failure before we persist or otherwise process our data.
So what we’d really like to do is only commit offsets after we’ve done some amount of processing and / or persistence on the data. Typical situations would be, after producing a new message to kafka, or after writing a record to HDFS.
So lets so we have auto-commit enabled, and we are chugging along, and counting on the consumer to commit our offsets for us. This is great because we don’t have to code anything, and don’t have think about the frequency of commits and the impact that might have on our throughput. Life is good. But now we’ve lost a thread or a process. And we don’t really know where we are in the processing, Because the last auto-commit committed stuff that we hadn’t actually written to disk.
So now we’re in a situation where we think we’ve read all of our data but we will have gaps in data. Note the same risk applies if we lose a partition or broker and get a new leader. OR
If we add more consumers in the same group and we rebalance the partition assignment. Imagine a scenario where you are hanging in your processing, or there’s some other reason that you have to exit before persisting to disk, the new consumer added will just pick up from the last committed offset.
Ok so don’t use autocommit if you care about this sort of thing.
One other thing to note, is that if you are running some code akin to the ConsumerGroup Example that’s on the wiki, and you are running one consumer with multiple threads, when you issue a commit from one thread, it will commit across all threads. So this isn’t great for all of the reasons that we mentioned a few moments ago.
So disable auto commit. Commit after your processing, and run the high level consumer in it’s own thread.
To cement this:
Note a lot this changes in the next release with the new Consumer, but maybe we will revisit that once that is released!