The document proposes an incremental cooperative rebalancing protocol to address issues with Kafka's current eager rebalancing algorithm. The current algorithm revokes all partitions from consumers and immediately triggers a full reassignment, causing stop-the-world behavior. The new protocol would only reassign partitions that need to change ownership after membership changes, reducing unnecessary rebalances and revocations. It aims to rebalance incrementally and cooperatively rather than eagerly.
Strategies for Landing an Oracle DBA Job as a Fresher
The Silver Bullet for Endless Rebalancing
1. A. Sophie Blee-Goldman, Guozhang Wang
Bay Area Kafka Meetup, Dec. 5, 2019
The Silver Bullet for Endless Rebalances
Introduction to the Incremental Cooperative Protocol
2. Outline
• Review of the current eager rebalance algorithm
• Identify the known issues with common scenarios
• A new proposal: incremental cooperative rebalancing
2
3. 3
A Short History of Consumer Groups
Topic 1
Topic 2
Partitions
Producers
Producers
Consumers
Consumers
Brokers
4. 4
A Short History of Consumer Groups
Consumers
Consumers
fetch
fetch
1) assignment (who owns what)
2) offset (consumed up to where)
Kafka 0.8.2-
5. 5
A Short History of Consumer Groups
Consumers
Consumers
fetch
fetch
1) assignment (who owns what)
2) offset (consumed up to where)
Kafka 0.9.0+
Group Coordinator
6. 5
A Short History of Consumer Groups
Consumers
Consumers
fetch
fetch
1) assignment (who owns what)
2) offset (consumed up to where)
Kafka 0.9.0+
Group Coordinator
7. 6
Consumer Rebalance Protocol
• A rebalance happens when:
• Membership change
• Member crash: failure of a consumer
• Scaling in: member leaves the group
• Scaling out: new member joins
• Partition resources change
• Topics are created or deleted
• More partitions added to topics
27. 12
Consumer Rebalance Protocol
• During the rebalance:
• Existing consumers re-join the group
• A single member is chosen as group leader
• leader determines partition assignment (user customizable)
54. 16
Known Issue #1: Stop-the-world Rebalance
join-group
re-
join-group
#onPartitionsRevoked(all partitions) #assign(…)
sync-
#onPartitionsAssigned(given partitions)
re-join
sync-group
C1
C2
Group Coordinator(broker side)
C3
revoked all re-assigned most
1 2 3
4 5 6
1 2
4 5
3 6
55. 16
Known Issue #1: Stop-the-world Rebalance
join-group
re-
join-group
#onPartitionsRevoked(all partitions) #assign(…)
sync-
#onPartitionsAssigned(given partitions)
re-join
sync-group
C1
C2
Group Coordinator(broker side)
C3
revoked all re-assigned most
eager rebalance:
before rebalance revoked all the partitions,
after rebalance most of the partitions are reassigned back
1 2 3
4 5 6
1 2
4 5
3 6
61. 17
Known Issue #2: Back-and-forth Rebalance
join-group
re-
join-group
#onPartitionsRevoked(all partitions)
sync-
#onPartitionsAssigned(given partitions)
re-join
sync-group
C1
C2
Group Coordinator(broker side)
C3
leave-group
#assign(…) #onPartitionsRevoked(all partitions) #assign(…)
unnecessary rebalances:
first one to move partitions from C3 to C1/C2,
second one to move them back to C3 from C1/C2
bounce a consumer
1 2
4 5
3 6
1 2
4 5
3 6
62. 18
Let’s Revisit:
When to trigger a rebalance,
Who to participate in a rebalance,
What to reassign during rebalance
64. 20
When Who What
Current Protocol
(Eager)
Immediately Everyone Everything
Proposed Protocol
(Cooperative)
After determining what
needs to be reassigned
Only those whose
assignment will be
changed
Only those partitions
who change ownership
Rebalance Protocols
65. 21
Rebalance Protocols
• [KIP-415] : incremental rebalance in Connect(2.3+)
• [KIP-345] : static membership in Consumer / Streams(2.3+)
• [KIP-429] : incremental rebalance in Consumer / Streams(2.4+)
66. 21
Rebalance Protocols
• [KIP-415] : incremental rebalance in Connect(2.3+)
• [KIP-345] : static membership in Consumer / Streams(2.3+)
• [KIP-429] : incremental rebalance in Consumer / Streams(2.4+)
99. 28
Augmented Listener Interface
ConsumerRebalanceListener
• #onPartitionsRevoked (will not be triggered if there is nothing to revoke)
• #onPartitionsAssigned (triggered at completion of rebalance, regardless of newly added partitions)
• # #onPartitionsLost (triggered instead of onPartitionRevoked when a member falls out of group)
100. 29
Switch to Cooperative Rebalancing
In Consumer
• first rolling bounce: add “sticky-cooperative” / “my-cooperative” to [partition.assignment.strategy]
• second rolling bounce: remove old assignor (e.g.,“range”) from the config
In Streams
• first rolling bounce: set [upgrade.from = old version (“2.3”)]
• second rolling bounce: remove [upgrade.from] config
101. Take-aways
• We have extended the rebalance protocol to enable
smarter assignment (when, who, and what)
30
102. Take-aways
• We have extended the rebalance protocol to enable
smarter assignment (when, who, and what)
• No more stop-the-world rebalances with the incremental
cooperative protocol!
31
103. THANKS!
Guozhang Wang | guozhang@confluent.io | @guozhangwang
32
A. Sophie Blee-Goldman | sophie@confluent.io | @ableegoldman