Apache Kafka Reliability

1
When it absolutely, positively,
has to be there
Reliability Guarantees in Apache Kafka
@jeffholoman Cloudera
@gwenshap Confluent

2
Apache Kafka
High Throughput
Low Latency
Scalable
Centralized
Real-time

3
Streaming Platform
Producer Consumer
Streaming Applications
Connectors Connectors
Apache Kafka

4
Versions of Apache Kafka
• 0.7.0 <- Please don’t
• 0.8.0 <- Replication exists, it will continue evolving with
every release
• 0.8.2 <- New producer, offset commits to Kafka
• 0.9.0 <- New consumer, Connect APIs
• 0.10.0 <- New consumer improvements, Streams APIs
• 0.11.0 <- Idempotent producer, transactional semantics,
Exactly once.

5
Kafka Components
• Broker
• Java clients:
• Producer
• Consumers
• Kafka Streams
• Kafka Connect
• Non-Java:
• Librdkafka
• Librdkafka based – Python, Go, NodeJS, C#...
• Others

6
If Kafka is a critical piece of our pipeline
 Can we be 100% sure that our data will get there?
 Can we lose messages?
 How do we verify?
 Who’s fault is it?

7
Distributed Systems
 Things Fail
 Systems are designed
to tolerate failure
 We must expect
failures and design our
code and configure our
systems to handle
them

8
Network
Broker MachineClient Machine
Data Flow - Producer
Kafka Client
Broker
O/S Socket Buffer
NIC
NIC
Page Cache
Disk
Application
Thread
O/S Socket Buffercallbac
k
✗
✗
✗
✗
✗
✗
✗✗ data
ack /
exception
Replication

10
Kafka is super reliable.
Stores data, on disk. Replicated.
… if you know how to configure it that way.

11
Replication is your friend
 Kafka protects against failures by replicating data
 The unit of replication is the partition
 One replica is designated as the Leader
 Follower replicas fetch data from the leader
 The leader holds the list of “in-sync” replicas

12
Replication and ISRs
0
1
2
0
1
2
0
1
2
Producer
Broker
100
Broker
101
Broker
102
Topic:
Partitions
:
Replicas:
my_topic
3
3
Partition
:
Leader:
ISR:
1
101
100,102
Partition
:
Leader:
ISR:
2
102
101,100
Partition
:
Leader:
ISR:
0
100
101,102

13
ISR
2 things make a replica in-sync
 Lag behind leader
 replica.lag.time.max.ms – replica that didn’t fetch or is behind
 replica.lag.max.messages – has gone away in 0.9
 Connection to Zookeeper

14
Terminology
Acked
• Producers will not retry sending.
• Depends on producer setting.
Committed
• Only when message got to all ISR
(future leaders have it).
• Consumers can read.
• replica.lag.time.max.ms controls: how long
can a dead replica prevent consumers from
reading?
Committed Offsets
• Consumer told Kafka the latest offsets it
read. By default the consumer will not see
these events again.

15
Replication
Acks = all
• Waits for all in-sync replicas to reply.
Replica 3
100
Replica 2
100
Replica 1
100
Time

16
Replica 3 stopped replicating for some reason
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101
Time
Acked in acks = all
“committed”
Acked in acks = 1
but not
“committed”

17
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101
Time
One replica drops out of ISR, or goes offline
All messages are now acked and committed

18
2nd Replica drops out, or is offline
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101
102
103
104Time

19
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101
102
103
104Time
Now we’re in trouble
✗

20
Replication
If Replica 2 or 3 come back online before the leader
you can will lose data.
Replica 3
100
Replica 2
100
101
Replica 1
100
101
102
103
104Time
All those are
“acked” and
“committed”

21
So what to do?
Disable Unclean Leader Election
•unclean.leader.election.enable = false
•Default from 0.11.0
Set replication factor
•default.replication.factor = 3
Set minimum ISRs
•min.insync.replicas = 2

22
Warning!
min.insync.replicas is applied at the topic-level.
Must alter the topic configuration manually if created before the
server level change
Must manually alter the topic < 0.9.0 (KAFKA-2114)

23
Replication
Replication = 3
Min ISR = 2
Replica 3
100
Replica 2
100
Replica 1
100
Time

24
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101
Time
One replica drops out of ISR, or goes offline

25
Replication
Replica 3
100
Replica 2
100
101
Replica 1
100
101102
103
104
Time
2nd Replica fails out, or is out of sync
Buffers in
Producer

27
Producer Internals
Producer sends batches of messages to a buffer
M3
Application
Thread
Application
Thread
Application
Thread
send()
M2 M1 M0
Batch 3
Batch 2
Batch 1
Fail
?
response
retry
Update
Future
callback
drain
Metadata or
Exception

28
Basics
Durability can be configured with the producer configuration
request.required.acks
•0 The message is written to the network (buffer)
•1 The message is written to the leader
•all The producer gets an ack after all ISRs receive the data; the
message is committed
Make sure producer doesn’t just throw messages away!
•For clients < 09.0, block.on.buffer.full = true
max.block.ms = Long.MAX_VALUE
•Or handle the BufferExhaustedException /
TimeoutException yourself

29
Producer
All calls are non-blocking async
2 Options for checking for failures:
• Immediately block for response: send().get()
• Do follow-up work in Callback, close producer after error threshold
• Be careful about buffering these failures. Future work? KAFKA-1955
• Don’t forget to close the producer! producer.close() will block until in-
flight txns complete
retries (producer config) defaults to 0
In flight requests could lead to message re-ordering
(max.in.flight.request.per.connection)

31
Consumer
Three choices for Consumer API
•Simple Consumer
•High Level Consumer (ZookeeperConsumer)
•New KafkaConsumer

32
New Consumer – auto commit
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "10000");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String,
String>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
processAndUpdateDB(record);
}
} What if we crash
after 8 seconds?
Commit automatically
every 10 seconds

33
New Consumer – manual commit
props.put("enable.auto.commit", "false");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String,
String>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
processAndUpdateDB(record);
consumer.commitSync();
}
Commit entire
batch outside the
loop!

34
Minimize Duplicates for At Least Once Consuming
1. Commit your own offsets - Set autocommit.enable =
false
2. Use Rebalance Listener to limit duplicates
3. Make sure you commit only what you are done processing
4. Note: New consumer is single threaded – one consumer per
thread.

35
Exactly Once Semantics
At most once is easy
At least once is not bad either – commit after 100% sure data is safe
Exactly once is tricky
• Commit data and offsets in one transaction
• Idempotent producer
Kafka Connect – many connectors (especially Confluent’s) are exactly once
by using an external database to write events and store offsets in
one transaction
Kafka Streams – starting at 0.11.0 have easy to configure exactly once
(exactly.once=true).
Other stream processing systems – have their own thing.

36
How do we test Kafka?
Replication Tests:
These Tests verify that replication provides simple durability guarantees by checking that data acked
by brokers is still available for consumption in the face of various failure scenarios
Setup:
• 1 Zookeeper Node
• 3 Kafka Nodes
• 1 Topic with partitions=3 replication-factor-3 and min.insync.replicas=2
Procedure:
• Produce messages in the background
• Consume messages in the background
• Initiate broker failures (shutdown, or bounce repeatedly with kill -15 or kill -9)
• When done driving failures, stop producing and finish consuming
• Validation that every acked message was consumed

37
Monitoring for Data Loss
• Monitor for producer errors – watch the retry numbers
• Monitor consumer lag – MaxLag or via offsets
• Standard schema:
• Each message should contain timestamp and originating service and host
• Each producer can report message counts and offsets to a special topic
• “Monitoring consumer” reports message counts to another special topic
• “Important consumers” also report message counts
• Reconcile the results

38
Be Safe, Not Sorry
Acks = all
Max.block.ms = Long.MAX_VALUE
Retries = MAX_INT
( Max.inflight.requests.per.connection = 1 )
Producer.close()
Replication-factor >= 3
Min.insync.replicas = 2
Unclean.leader.election = false
Auto.offset.commit = false
Commit after processing
Monitor!

Apache Kafka Reliability

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Kafka Reliability

Similar to Apache Kafka Reliability (20)

Recently uploaded

Recently uploaded (20)

Apache Kafka Reliability

Editor's Notes