1
Exactly Once Semantics in
Apache Kafka
2
Apache Kafka: A Distributed Streaming Platform
Consumers
Producers
Connectors
Processing
3
A Distributed What???
A Streaming Platform is a little like…
• A messaging system
• Except it scales horizontally, stores the streams persistently,
and allows continuous stream processing
• Hadoop
• But not batch oriented
4
Logs: A data structure for continuous streams
5
APIs
1. Producer and Consumer: Read and write streams
2. Connect: Managed Connectors that connect existing
systems
3. Streams: Transformations of streams
6
Producer & Consumer API
7
Consumers Scale With Groups
8
Connect API
9
Kafka Connect Does The Hard Parts
1. Scale out
2. Fault Tolerance
3. Central Management
4. Schemas
10
11
12
13
14
Streams API
• Full power of a modern stream processing framework
• Distributed and fault-tolerant
• Natively uses event-time
• Stateful processing: joins, aggregations, etc
• Integrates tables and streams
• Easy re-processing
• Just a library
15
Wordcount Example
16
Wordcount Example
17
Deploy as you wish
24
Not Limited To Java
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
25
The Semantics of Working With Streams
26
Two Problems
1. Duplicate writes
2. Exactly-once processing
27
Problem #1:
Duplicate Writes
28
Duplicate Writes
29
Duplicate Writes
30
Duplicate Writes
31
Duplicate Writes
32
Duplicate Writes
33
Duplicate Writes
34
Duplicate Writes
35
Duplicate Writes
36
Duplicate Writes
37
Problem #2:
Duplicate Processing
38
Read From Offset=0
39
Process and Update State
40
Commit offset 0 as processed
41
Read from offset=1
42
Process and Update State
43
App crashes!
44
Restore from offset=0, resume processing
45
Chose your undesirable semantics
1. Update state, then save offset => At-Least-Once Delivery
2. Save offset, then update state => At-Most-Once Delivery
46
Two workarounds
1. Make processing idempotent
• Much harder than it sounds in practice
2. Store offset in the application DB and update transitionally
• Not all stores support transactions
• Must handle zombies
47
Solving These Problems
48
Solving Problem #1:
Avoiding Duplicate Writes with an Idempotent
Producer
49
Basic idea
1. Unique ID for each message
2. Server deduplicates
50
Basic idea has problems
1. Random access database of all message ids?
2. Message IDs would be bulky
3. Must handle server fail-over
51
Better idea: Do it like TCP
1. Unique producer id for each producer (PID)
2. Each producer assigns a sequential number to each
message it sends
3. The unique identifier is the PID + sequence number
4. Sequence number and PID both stored in the log
52
The idempotent producer
53
The idempotent producer
54
The idempotent producer
55
The idempotent producer
56
The idempotent producer
57
The idempotent producer
58
The idempotent producer
59
The idempotent producer
60
Idempotent Producer
• Works transparently – no API changes.
• Fast enough you don’t need to worry about it
• Will be on by default in the future
61
Solving Problem #2:
Avoiding Duplicate Processing with Transactions
62
63
It’s More Complex Than I’ve Let On
• Multiple partitions
• Multiple input streams
• Non-determinism
• Diverse data stores
• Zombies
64
Transactions in Kafka
65
Introducing transactions
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(record0);
producer.send(record1);
producer.commitTransaction();
} catch (KafkaException e) {
producer.abortTransaction();
}
66
Introducing ‘transactions’
67
Initializing ‘transactions’
68
Transactional sends – part 1
69
Transactional sends – part 2
70
Commit – phase 1
71
Commit – phase 2
72
Commit – phase 2
73
Success!
74
Consumer returns only committed messages
75
Transactions => Stream Processing
76
Factor problem into two parts
1. Transforming input streams to output streams (Streams)
2. Connecting output streams to data systems (Connect)
77
Stream processing with Kafka
1. Read from input streams
2. Process and update state
3. Produce to output streams
4. Save offsets
78
Stream processing as a sequence of transactions
BEGIN
1. Read from input streams
2. Process and update state
3. Produce to output streams
4. Save Offsets
COMMIT
79
The Theory
• Two Generals
• Atomic Broadcast
• Consensus
80
In Practice
81
Performance
• Up to +20% producer throughput
• Up to +50% consumer throughput
• Up to -20% disk utilization
• Savings start when you batch
• Details: https://bit.ly/kafka-eos-perf
82
Cool!
But how do I use this?
83
Producer Configs
• enable.idempotence = true
• acks = “all”
• retries > 1 (preferably MAX_INT)
• transactional.id = ‘some unique id’
84
Consumer configs
• isolation.level:
• “read_committed”, or
• “read_uncommitted”
85
Streams config
• processing.mode = “exactly_once”
86
Confluent
• Founded by the original creators of Apache Kafka
• Headquarters based in Palo Alto, CA
KSQL: Streaming SQL for Apache Kafka
Developer Preview
(https://github.com/confluentinc/ksql)
87
Thank You!

Kafka eos

Editor's Notes

  • #11 First example is loading data into Hadoop. Note that Kaka is a big, scale-out system; and Hadoop is also a big scale out system, so it’s important in this instance that we can scale out Kafka connect. It’s also important that we capture the metadata or schema information for the records we have in our topics so that we can replicate that into HDFS as well and load data in a structured format like Parquet.
  • #12 Another example is loading data out of a relational database using JDBC. (Note that you can also go the other way, loading data into a DB…that would be a sink and we have one of those comming soon too). So here you would think you don’t really need the scale out capability of connect because you just have a single centralized relational database. But in reality you face a similar problem—instead of replicating one big data system into another big data system you likely have something like this...
  • #13 …here you have lot’s of little relational databases and you need to manage the replication with all of these. This is part of how the connect api in Kafka makes this really managable: even if you have hundreds of database to pull data from, you can manage these dynamically off a small set of connect workers. You don’t need to set up one process per database.
  • #14 And of course the idea isn’t that you run connectors to one or another system, but rather that you are able to manage lots of these connections to all different kinds of systems. We’ll dive into Kafka connect in more detail in the third installment of this talk series which goes far deeper into the practice of building streaming pipelines with Kafka.
  • #61 Stress the application level resends bit. Encourage people to rely on the producer retries and not re-send messages from their apps.
  • #72 New concept – control messages. Mention that commit markers are special message with log the producer id and the result of the transaction. These messages are not passed on to application – the client interprets them and acts accordingly.
  • #75 Mention ’read_uncommitted’ Mention that the buffering is broker side.
  • #77 Transformation may be complex and stateful Connector is pretty simple and reusable
  • #78 Solution is to do all these in a transaction
  • #79 Solution is to do all these in a transaction
  • #84 max.inflight is required for idempotence. It will cause a slowdown because you now have a sync producer.