Kafka eos

1
Exactly Once Semantics in
Apache Kafka

2
Apache Kafka: A Distributed Streaming Platform
Consumers
Producers
Connectors
Processing

3
A Distributed What???
A Streaming Platform is a little like…
• A messaging system
• Except it scales horizontally, stores the streams persistently,
and allows continuous stream processing
• Hadoop
• But not batch oriented

4
Logs: A data structure for continuous streams

5
APIs
1. Producer and Consumer: Read and write streams
2. Connect: Managed Connectors that connect existing
systems
3. Streams: Transformations of streams

9
Kafka Connect Does The Hard Parts
1. Scale out
2. Fault Tolerance
3. Central Management
4. Schemas

14
Streams API
• Full power of a modern stream processing framework
• Distributed and fault-tolerant
• Natively uses event-time
• Stateful processing: joins, aggregations, etc
• Integrates tables and streams
• Easy re-processing
• Just a library

24
Not Limited To Java
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;

25
The Semantics of Working With Streams

26
Two Problems
1. Duplicate writes
2. Exactly-once processing

27
Problem #1:
Duplicate Writes

37
Problem #2:
Duplicate Processing

40
Commit offset 0 as processed

44
Restore from offset=0, resume processing

45
Chose your undesirable semantics
1. Update state, then save offset => At-Least-Once Delivery
2. Save offset, then update state => At-Most-Once Delivery

46
Two workarounds
1. Make processing idempotent
• Much harder than it sounds in practice
2. Store offset in the application DB and update transitionally
• Not all stores support transactions
• Must handle zombies

48
Solving Problem #1:
Avoiding Duplicate Writes with an Idempotent
Producer

49
Basic idea
1. Unique ID for each message
2. Server deduplicates

50
Basic idea has problems
1. Random access database of all message ids?
2. Message IDs would be bulky
3. Must handle server fail-over

51
Better idea: Do it like TCP
1. Unique producer id for each producer (PID)
2. Each producer assigns a sequential number to each
message it sends
3. The unique identifier is the PID + sequence number
4. Sequence number and PID both stored in the log

60
Idempotent Producer
• Works transparently – no API changes.
• Fast enough you don’t need to worry about it
• Will be on by default in the future

61
Solving Problem #2:
Avoiding Duplicate Processing with Transactions

63
It’s More Complex Than I’ve Let On
• Multiple partitions
• Multiple input streams
• Non-determinism
• Diverse data stores
• Zombies

65
Introducing transactions
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(record0);
producer.send(record1);
producer.commitTransaction();
} catch (KafkaException e) {
producer.abortTransaction();
}

66
Introducing ‘transactions’

67
Initializing ‘transactions’

68
Transactional sends – part 1

69
Transactional sends – part 2

74
Consumer returns only committed messages

75
Transactions => Stream Processing

76
Factor problem into two parts
1. Transforming input streams to output streams (Streams)
2. Connecting output streams to data systems (Connect)

77
Stream processing with Kafka
1. Read from input streams
2. Process and update state
3. Produce to output streams
4. Save offsets

78
Stream processing as a sequence of transactions
BEGIN
1. Read from input streams
2. Process and update state
3. Produce to output streams
4. Save Offsets
COMMIT

79
The Theory
• Two Generals
• Atomic Broadcast
• Consensus

81
Performance
• Up to +20% producer throughput
• Up to +50% consumer throughput
• Up to -20% disk utilization
• Savings start when you batch
• Details: https://bit.ly/kafka-eos-perf

82
Cool!
But how do I use this?

83
Producer Configs
• enable.idempotence = true
• acks = “all”
• retries > 1 (preferably MAX_INT)
• transactional.id = ‘some unique id’

84
Consumer configs
• isolation.level:
• “read_committed”, or
• “read_uncommitted”

85
Streams config
• processing.mode = “exactly_once”

86
Confluent
• Founded by the original creators of Apache Kafka
• Headquarters based in Palo Alto, CA
KSQL: Streaming SQL for Apache Kafka
Developer Preview
(https://github.com/confluentinc/ksql)

Kafka eos

More Related Content

What's hot

Similar to Kafka eos

More from Nitin Kumar

Kafka eos

Editor's Notes