Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
1
Everything you always wanted to know
about Kafka’s rebalance protocol but
you were afraid to ask
Matthias J. Sax | Sof...
What is rebalancing
about?
● Group membership
● Resource assignment
● Example: KafkaConsumer
○ Consumer group
○ Partition ...
3
3
3
@MatthiasJSax
Design Decisions
● Broker side: membership
○ JoinGroup
○ Heartbeat
○ LeaveGroup
● Client side: assignm...
4
4
4
Rebalancing happens if
(a) a member joins/leaves the group
(b) resources need to be reassigned
@MatthiasJSax
5
Let’s Rebalance
GroupCoordinator
(broker side)
heartbeat ok heartbeat ok
session.timeout.ms
heartbeat.interval.ms
C1
C2
...
6
Let’s Rebalance
GroupCoordinator
(broker side)
C1
C2
C3
C4
JoinGroup
(subscription)
rebalanceheartbeat
@MatthiasJSax
syn...
7
Let’s Rebalance
@MatthiasJSax
rebalance
synchronization barrier
JoinGroup
(subscription)
GroupCoordinator
(broker side)
...
8
Group-
Coordinator
Group-
Coordinator
Group-
Coordinator
Group-
Coordinator
Brokers:
- Maintain/monitor
groups
- Store g...
Kafka Connect
● Cluster of workers
● Single group
● Resources:
○ Tasks
○ Configuration
@MatthiasJSax
9
10
10
Kafka Streams
● Application instances / threads
● Tasks plus Standbys
○ Stateful
○ Co-partitioning
● Interactive Que...
11
11
11
Issues
@MatthiasJSax
12
12
Unnecessary Rebalance
group.id=“grp”
POD
Application
member.id=1
group.id=“grp”
POD
Application
member.id=2
group.id...
13
13
Unnecessary Rebalance
group.id=“grp”
POD
Application
member.id=2
group.id=“grp”
POD
Application
member.id=3
GroupCoo...
14
14
Unnecessary Rebalance
group.id=“grp”
POD
Application
member.id=4
group.id=“grp”
POD
Application
member.id=2
group.id...
15
15
Unnecessary Rebalance
group.id=“grp”
POD
Application
member.id=4
group.id=“grp”
POD
Application
member.id=2
group.id...
16
16
Stop-the-World Effect
Application
Application
Application
Why stop processing if partitions are reassigned
anyway?
@...
17
17
17
Looking into the Future
● Work in progress
○ Static group membership
○ Incremental rebalancing
● Future work
○ Sm...
18
18
Static Group Membership
group.id=“grp”
group.instance.id=“A”
POD
Application
member.id=1
group.id=“grp”
group.instan...
19
19
Static Group Membership
group.instance.id member.id
A 1
B 2
C 3
GroupCoordinator
(broker side)
group.id=“grp”
group....
20
20
Static Group Membership
group.instance.id member.id
A 1
B 2
C 3
GroupCoordinator
(broker side)
group.id=“grp”
group....
21
21
Static Group Membership
group.instance.id member.id
A 1
B 2
C 3
GroupCoordinator
(broker side)
group.id=“grp”
group....
22
22
Incremental Rebalancing
GroupCoordinator
(broker side)
C1
C2
C3
@MatthiasJSax
C1
C2
C3
JoinGroup
(subscription)
reba...
23
23
Incremental Rebalancing
@MatthiasJSax
C1
C2
C3
GroupCoordinator
(broker side)
Group Leader
(received all subscriptio...
24
24
Incremental Rebalancing
@MatthiasJSax
C1
C2
C3
GroupCoordinator
(broker side)
Group Leader
(received all subscriptio...
25
25
Incremental Rebalancing
@MatthiasJSax
C1
C2
C3
Group Leader
eived all subscriptions) synchronization barrier
GroupCo...
26
26
26
@MatthiasJSax
Summary
● Deep dive into rebalance
protocol
○ Powerful and flexible
○ Stop-the-world property
● Cur...
27
27
27
● heartbeat.interval.ms
● session.timeout.ms
vs
max.poll.interval.ms
● rebalance.timeout.ms
● Static group member...
28@MatthiasJSax
Resources
• KIP-345: Introduce static membership protocol to reduce consumer rebalances (accepted)
https:/...
29
29
We are hiring!
@MatthiasJSax
matthias@confluent.io | mjsax@apache.org
Upcoming SlideShare
Loading in …5
×

Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Were Afraid to Ask (Matthias J. Sax, Confluent) Kafka Summit London 2019

890 views

Published on

Apache Kafka is a scalable streaming platform with built-in dynamic client scaling. The elastic scale-in/scale-out feature leverages Kafka’s “rebalance protocol” that was designed in the 0.9 release and improved ever since then. The original design aims for on-prem deployments of stateless clients. However, it does not always align with modern deployment tools like Kubernetes and stateful stream processing clients, like Kafka Streams. Those shortcoming lead to two mayor recent improvement proposals, namely static group membership and incremental rebalancing (which will hopefully be available in version 2.3). This talk provides a deep dive into the details of the rebalance protocol, starting from its original design in version 0.9 up to the latest improvements and future work. We discuss internal technical details, pros and cons of the existing approaches, and explain how you configure your client correctly for your use case. Additionally, we discuss configuration tradeoffs for stateless, stateful, on-prem, and containerized deployments.

Published in: Technology
  • Be the first to comment

Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Were Afraid to Ask (Matthias J. Sax, Confluent) Kafka Summit London 2019

  1. 1. 1 1 Everything you always wanted to know about Kafka’s rebalance protocol but you were afraid to ask Matthias J. Sax | Software Engineer @MatthiasJSax
  2. 2. What is rebalancing about? ● Group membership ● Resource assignment ● Example: KafkaConsumer ○ Consumer group ○ Partition ownership @MatthiasJSax 2
  3. 3. 3 3 3 @MatthiasJSax Design Decisions ● Broker side: membership ○ JoinGroup ○ Heartbeat ○ LeaveGroup ● Client side: assignment ○ SyncGroup ○ Leader
  4. 4. 4 4 4 Rebalancing happens if (a) a member joins/leaves the group (b) resources need to be reassigned @MatthiasJSax
  5. 5. 5 Let’s Rebalance GroupCoordinator (broker side) heartbeat ok heartbeat ok session.timeout.ms heartbeat.interval.ms C1 C2 C3 @MatthiasJSax
  6. 6. 6 Let’s Rebalance GroupCoordinator (broker side) C1 C2 C3 C4 JoinGroup (subscription) rebalanceheartbeat @MatthiasJSax synchronization barrier JoinGroup (subscription) JoinResponse max.poll.interval.ms (consumer) rebalance.timeout.ms (connect) C1 C2 C3 C4
  7. 7. 7 Let’s Rebalance @MatthiasJSax rebalance synchronization barrier JoinGroup (subscription) GroupCoordinator (broker side) Group Leader (receives all subscriptions) Leader Selection C1 C2 C3 C4 SyncGroup JoinResponse SyncResponse (assignment)
  8. 8. 8 Group- Coordinator Group- Coordinator Group- Coordinator Group- Coordinator Brokers: - Maintain/monitor groups - Store group metadata AbstractCoordinator WorkerCoordinator ConsumerCoordinator Connect API Consumer / Streams API Range RR Sticky custom StreamsAssignor Clients: - Assign resources __consumer_offsets @MatthiasJSax
  9. 9. Kafka Connect ● Cluster of workers ● Single group ● Resources: ○ Tasks ○ Configuration @MatthiasJSax 9
  10. 10. 10 10 Kafka Streams ● Application instances / threads ● Tasks plus Standbys ○ Stateful ○ Co-partitioning ● Interactive Queries ○ Endpoint metadata @MatthiasJSax
  11. 11. 11 11 11 Issues @MatthiasJSax
  12. 12. 12 12 Unnecessary Rebalance group.id=“grp” POD Application member.id=1 group.id=“grp” POD Application member.id=2 group.id=“grp” POD Application member.id=3 GroupCoordinator (broker side) group.id List of member.id “grp” 1,2,3 @MatthiasJSax
  13. 13. 13 13 Unnecessary Rebalance group.id=“grp” POD Application member.id=2 group.id=“grp” POD Application member.id=3 GroupCoordinator (broker side) group.id List of member.id “grp” 2,3 @MatthiasJSax
  14. 14. 14 14 Unnecessary Rebalance group.id=“grp” POD Application member.id=4 group.id=“grp” POD Application member.id=2 group.id=“grp” POD Application member.id=3 GroupCoordinator (broker side) group.id List of member.id “grp” 2,3,4 group.id List of member.id “grp” 2,3,4 @MatthiasJSax
  15. 15. 15 15 Unnecessary Rebalance group.id=“grp” POD Application member.id=4 group.id=“grp” POD Application member.id=2 group.id=“grp” POD Application member.id=3 Why rebalance if we know that the application is restarted anyway? @MatthiasJSax
  16. 16. 16 16 Stop-the-World Effect Application Application Application Why stop processing if partitions are reassigned anyway? @MatthiasJSax
  17. 17. 17 17 17 Looking into the Future ● Work in progress ○ Static group membership ○ Incremental rebalancing ● Future work ○ Smooth scale-out for Kafka Streams @MatthiasJSax 17
  18. 18. 18 18 Static Group Membership group.id=“grp” group.instance.id=“A” POD Application member.id=1 group.id=“grp” group.instance.id=“C” POD Application member.id=3 group.id=“grp” group.instance.id=“B” POD Application member.id=2 group.instance.id member.id A 1 B 2 C 3 GroupCoordinator (broker side) @MatthiasJSax
  19. 19. 19 19 Static Group Membership group.instance.id member.id A 1 B 2 C 3 GroupCoordinator (broker side) group.id=“grp” group.instance.id=“A” POD Application member.id=1 group.id=“grp” group.instance.id=“C” POD Application member.id=3 group.id=“grp” group.instance.id=“B” POD Application member.id=2 @MatthiasJSax
  20. 20. 20 20 Static Group Membership group.instance.id member.id A 1 B 2 C 3 GroupCoordinator (broker side) group.id=“grp” group.instance.id=“C” POD Application member.id=3 group.id=“grp” group.instance.id=“B” POD Application member.id=2 @MatthiasJSax
  21. 21. 21 21 Static Group Membership group.instance.id member.id A 1 B 2 C 3 GroupCoordinator (broker side) group.id=“grp” group.instance.id=“A” POD Application member.id=1 group.id=“grp” group.instance.id=“C” POD Application member.id=3 group.id=“grp” group.instance.id=“B” POD Application member.id=2 @MatthiasJSax
  22. 22. 22 22 Incremental Rebalancing GroupCoordinator (broker side) C1 C2 C3 @MatthiasJSax C1 C2 C3 JoinGroup (subscription) rebalanceheartbeat JoinGroup (subscription) JoinResponse
  23. 23. 23 23 Incremental Rebalancing @MatthiasJSax C1 C2 C3 GroupCoordinator (broker side) Group Leader (received all subscriptions) JoinResponse SyncGroup (intended assignment) SyncResponse (enforce revoke)
  24. 24. 24 24 Incremental Rebalancing @MatthiasJSax C1 C2 C3 GroupCoordinator (broker side) Group Leader (received all subscriptions) synchronization barrier JoinResponse SyncGroup (intended assignment) SyncResponse (enforce revoke) JoinGroup JoinResponse
  25. 25. 25 25 Incremental Rebalancing @MatthiasJSax C1 C2 C3 Group Leader eived all subscriptions) synchronization barrier GroupCoordinator (broker side) SyncGroup (intended assignment) SyncResponse (enforce revoke) JoinGroup Group Leader (received all subscriptions) JoinResponse SyncGroup SyncResponse
  26. 26. 26 26 26 @MatthiasJSax Summary ● Deep dive into rebalance protocol ○ Powerful and flexible ○ Stop-the-world property ● Current Work ○ Static group membership ○ Incremental rebalancing 26
  27. 27. 27 27 27 ● heartbeat.interval.ms ● session.timeout.ms vs max.poll.interval.ms ● rebalance.timeout.ms ● Static group membership ○ group.instance.id ● Quick vs “considered” rebalance Lessons Learned @MatthiasJSax
  28. 28. 28@MatthiasJSax Resources • KIP-345: Introduce static membership protocol to reduce consumer rebalances (accepted) https://cwiki.apache.org/confluence/display/KAFKA/KIP- 345%3A+Introduce+static+membership+protocol+to+reduce+consumer+rebalances • Design doc: Incremental Cooperative Rebalancing https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing%3A+Support+and+Policies • KIP-415: Incremental Cooperator Rebalancing in Kafka Connect (accepted) https://cwiki.apache.org/confluence/display/KAFKA/KIP-415%3A+Incremental+Cooperative+Rebalancing+in+Kafka+Connect (accepted) • KIP-429: Kafka Consumer Incremental Rebalance Protocol (under discussion) https://cwiki.apache.org/confluence/display/KAFKA/KIP-429%3A+Kafka+Consumer+Incremental+Rebalance+Protocol • KIP-441: Smooth Scaling Out for Kafka Streams (under discussion) https://cwiki.apache.org/confluence/display/KAFKA/KIP-441%3A+Smooth+Scaling+Out+for+Kafka+Streams • "The Magical Rebalance Protocol of Apache Kafka" by Gwen Shapira (Strange Loop Talk, Sep 2018) https://www.youtube.com/watch?v=MmLezWRI3Ys&t=8s • Thanks to Jason Gustafson and Guozhang Wang
  29. 29. 29 29 We are hiring! @MatthiasJSax matthias@confluent.io | mjsax@apache.org

×