Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and Scalability

1
Exactly-Once Made Fast
Boyang Chen: Engineer@Conﬂuent
Guozhang Wang: Engineer@Conﬂuent

2
- Recap: exactly-once semantics (EOS) for Kafka
- What cost are we paying for EOS today
- Closing the gaps: usability and scalability
Overview

13
The Kafka Approach for Exactly Once
1) Idempotent writes in order within a single topic partition
2) Transactional writes across multiple output topic partitions
3) Guarantee single writer for any input topic partitions
[KIP-98, KIP-129]

14
The Kafka Approach for Exactly Once
[KIP-98, KIP-129]

17
txn-status: non-exist, partitions: {}

18
txn-status: non-exist, partitions: {}
producer.initTxn();

19
txn-status: empty, partitions: {}
producer.initTxn();
non-exist

20
try {
producer.beginTxn();
records = consumer.poll();
for (Record rec <- records) {
// process ..
}
} catch (KafkaException e) {
}txn-status: empty, partitions: {}
producer.initTxn();

21
try {
// process ..
producer.send(“output”, ..);
}
}txn-status: empty, partitions: {}
producer.initTxn();

22
try {
// process ..
}
}txn-status: on-going, partitions: {output-0}
producer.initTxn();
non-exist {}

23
try {
// process ..
}
producer.initTxn();

24
try {
// process ..
}
producer.initTxn();

25
try {
// process ..
}
producer.initTxn();

26
try {
// process ..
}
producer.sendOffsets(..);
producer.initTxn();

27
try {
// process ..
}
}txn-status: on-going, partitions: {output-0, offset-0}
producer.initTxn();
{output-0}

28
try {
// process ..
}
producer.commitTxn();
producer.initTxn();

29
try {
// process ..
}
}txn-status: prep-commit, partitions: {output-0, offset-0}
producer.initTxn();
on-going

30
try {
// process ..
}
producer.initTxn();

31
try {
// process ..
}
producer.initTxn();

32
try {
// process ..
}
}txn-status: committed, partitions: {output-0, offset-0}
producer.initTxn();
prep-commit

33
try {
// process ..
}
producer.initTxn();

34
try {
// process ..
}
producer.abortTxn();
producer.initTxn();

35
try {
// process ..
}
}txn-status: prep-abort, partitions: {output-0, offset-0}
producer.initTxn();
on-going

36
try {
// process ..
}
producer.initTxn();

37
try {
// process ..
}
producer.initTxn();

38
try {
// process ..
}
}txn-status: aborted, partitions: {output-0, offset-0}
producer.initTxn();
prep-abort

39
The Kafka Approach for Exactly Once:

42
At a given time, an input partition should only be processed by a single client

43
Consumer Group

44
Consumer Group

45
Consumer Group

46
The “Single Writer” Fencing Problem

47
When Taking Over the Partition:
1) The previous txn must have completed commit or abort so there are no
concurrent transactions.
2) Other clients will be fenced write processing results for those input
partitions, a.k.a we have a “single writer”.

48
Transactional ID: deﬁnes single writer scope
1) Conﬁgured by the unique producer `transactional.id` property.
2) Enforced fencing by a monotonic epoch for each id.
3) Producer initialization await pending transaction completion.

49

50

51
Consumer Group
txn.Id = A, epoch = 1
txn.Id = B, epoch = 1

52
Consumer Group

53
Consumer Group

54
Consumer Group
txn.Id = B, initializing...

55
Consumer Group

56
Consumer Group
Num. producer transaction IDs ~= num. input partitions
Producers need to be dynamically created when rebalance

58
Number of Input
Partitions
Growth of Producers
Number of
Applications
5 10 15 20 25 30
600
500
400
300
200
1
00

59
Number of Input
Partitions
At Least Once
Growth of Producers
Number of
Applications
5 10 15 20 25 30
600
500
400
300
200
1
00

60
Number of Input
Partitions
At Least Once
Growth of Producers
Number of
Applications
5 10 15 20 25 30
600
500
400
300
200
1
00
Exactly Once

62
What problems
are KIP-447
solving ?

63
What problems
are KIP-447
solving ?
● Make one producer per process model work

64
What problems
are KIP-447
solving ?
● Make one producer per process model work
● Unblock technical challenges
○ Offset commit fencing
○ Concurrent transaction

65
What problems
are KIP-447
solving ?
● Offset commit fencing
● Concurrent transaction

66
● We are fencing on the transactional producer side,
which assumes a static partition assignment
What problems
are KIP-447
solving ?

67
● Consumer group partition assignments are dynamic
in practice
What problems
are KIP-447
solving ?

74
in practice
● Action: fence zombie producer commit
What problems
are KIP-447
solving ?

75
in practice
○ Different from epoch fencing
○ Utilize consumer group generation ~= epoch
What problems
are KIP-447
solving ?

83
in practice
● Add new APIs
What problems
are KIP-447
solving ?

84
try {
// process ..
}
producer.sendOffsets(offsets);
}txn-status: on-going, partitions: {output, offset}

85
try {
// process ..
}
producer.sendOffsets(offsets,
consumer.groupMetadata());
}
txn-status: on-going, partitions: {output, offset}

86
in practice
● Add new APIs
○ Expose group generation through
consumer#groupMetadata()
○ Commit transaction with consumer metadata through
producer#sendOffsetsToTransaction(offsets,
groupMetadata)
What problems
are KIP-447
solving ?

87
What problems
are KIP-447
solving ?

88
What problems
are KIP-447
solving ?
● Only one open transaction allowed for each input
partition

89
What problems
are KIP-447
solving ?
partition
● Offset commit is the only critical section

90
What problems
are KIP-447
solving ?
partition
○ Observed: Broker uses pending offsets to indicate
other ongoing transaction

91
What problems
are KIP-447
solving ?
partition
○ Observed: consumer always needs to fetch offset after
rebalance

92
What problems
are KIP-447
solving ?
partition
○ Observed: consumer always needs to fetch offset after
rebalance
○ Action: OffsetFetchRequest will back-off until pending
offsets are cleared, either by previous transaction
complete or timeout

100
447 Summary
● Resolve the semantic mismatch between producer
and consumer

101
447 Summary
● Resolve the semantic mismatch between producer
and consumer
● Make the one producer per processing unit possible

102
Number of Input
Partitions
At Least Once
Growth of Producers
Number of
Applications
5 10 15 20 25 30
600
500
400
300
200
1
00
Exactly Once

103
Number of Input
Partitions
At Least Once
Growth of Producers
Number of
Applications
5 10 15 20 25 30
600
500
400
300
200
1
00
Exactly Once
Exactly Once After 447

105
Prove the 447 scalability improvement to
break the limit
- At_least_once
- Exactly_once
- Exactly_once_beta (post KIP-447 EOS)
Scale Testing

107
Scale Testing
● Num.brokers = 3
● Num.input.partitions = 200
● Num.output.partitions = 100
● Test.interval.ms = 4 min
● Num.threads = 3
● Num.records.second = 1000
● Commit.interval.ms = 1 second
● Num.instances = 10, 20, 30...

108
Scale Testing
● Num.brokers = 3
● Num.input.partitions = 200
● Num.output.partitions = 100
● Test.interval.ms = 4 min
● Num.threads = 3
● Num.records.second = 1000
● Commit.interval.ms = 1 second
● Num.instances = 10, 20, 30...

109
Scale Testing
● At_least_once and
exactly_once_beta perform
steadily
● Exactly_once (pre KIP-447)
throughput degrades
signiﬁcantly around 20-25
applications

112
Upgrade Procedure
● Rolling bounce brokers to >= Apache Kafka 2.5

113
Upgrade Procedure
● Upgrade the stream application binary and keep the
PROCESSING_GUARATNEE setting at "exactly_once". Do the ﬁrst rolling
bounce, and make sure the group is stable with every instance on 2.6 binary.

114
Upgrade Procedure
● Upgrade the stream application binary and keep the
PROCESSING_GUARATNEE setting at "exactly_once". Do the ﬁrst rolling
bounce, and make sure the group is stable with every instance on 2.6 binary.
● Upgrade the PROCESSING_GUARANTEE setting to "exactly_once_beta" and do
a second rolling bounce to start using new thread producer for EOS.

116
1. Walkthrough Kafka transaction model

117
2. The usability and scalability issues with “single writer”

118
3. How KIP-447 solves the challenges

119
3. How KIP-447 solves the challenges
4. How to adopt KIP-447 in Kafka Streams

121
- KIP-98 - Exactly Once Delivery and Transactional
Messaging - Apache Kafka
- KIP-447: Producer scalability for exactly once
semantics - Apache Kafka
Resources

Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and Scalability

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and Scalability

Similar to Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and Scalability (20)

More from Guozhang Wang

More from Guozhang Wang (9)

Recently uploaded

Recently uploaded (20)

Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and Scalability