The Log of All Logs: Raft-based Consensus Inside Kafka

The Log of All Logs
Raft-based Consensus inside Kafka
Guozhang Wang, Engineer @ Confluent
Kafka Summit APAC

01
Why Replace
Zookeeper?
02
Why Choose
Raft？
03
KRaft v.s. Raft
What’s the Difference?
04
The Quorum
Controller
(on top of KRaft)

The (old) Controller
3
Controller
Brokers
Zookeeper
• Single node, elected via ZK

4
Controller
Brokers
Zookeeper
• Maintains cluster metadata, persisted
in ZK as source of truth
• Broker membership
• Partition assignment
• Configs, Auth, ACLs, etc

5
Controller
Brokers
Zookeeper
• Maintains cluster metadata, persisted
in ZK as source of truth
• Broker membership
• Partition assignment
• Configs, Auth, ACLs, etc
• Handles (most) cluster-level admin
operations
• (Mostly) single-threaded
• Events triggered from ZK watches
• Push-based metadata propagation
• Handling logic often requires more ZK read/write

The (old) Controller: Broker Shutdown
6
Controller
Brokers
Zookeeper
ISR {1, 2, 3}

7
Controller
Brokers
Zookeeper
SIG_TERM
ISR {1, 2, 3}

8
Controller
Brokers
Zookeeper
ISR {1, 2, 3}
ControlledShutdown
ISR {2, 3}

9
Controller
Brokers
Zookeeper
ISR {2, 3}
Write ZK

10
Controller
Brokers
Zookeeper
ISR {2, 3}
MetadataUpdate
LeaderAndISR

11
Controller
Brokers
Zookeeper
ISR {2, 3}
ControlledShutdown

12
Controller
Brokers
Zookeeper
Writes to ZK
are synchronous
Impact: longer
shutdown time
Metadata propagation req/resp
is per-partition
Impact: client timeout

The (old) Controller: Controller Failover
13
Controller
Brokers
Zookeeper

14
Brokers
Zookeeper

15
Brokers
Zookeeper

16
Brokers
Zookeeper
Controller

17
Brokers
Zookeeper
Controller
Read ZK
ISR {1, 2, 3}

18
Brokers
Zookeeper
Controller ISR {1, 2, 3}
ISR {2, 3}
Write ZK

19
Brokers
Zookeeper
Controller ISR {2, 3}
MetadataUpdate
LeaderAndISR

20
Brokers
Zookeeper
Controller
Reads from ZK are
O(num.partition)
Impact: longer
unavailability window

Controller Scalability Limitations
21
• Controller failover, broker start / shutdown events all require RPCs that
are O(num.partitions)

22
• Metadata persistence in ZK are synchronous and also O(num.partitions)

23
• Metadata persistence in ZK are synchronous and also O(num.partitions)
• ZK as the source-of-truth
• Znode Size limit, max num.watchers limit, watcher fire orderings, etc
• Controller’s metadata view is often out of date as all brokers (& admin clients) write to ZK
• Broker’s metadata view can diverge over time due to push-based propagation
How to get to 1K brokers and 1M partitions in a cluster?

What’s really behind the Controller / Zookeeper?
24
• A metadata LOG!
/brokers/topics/foo/partitions/0/state changed
/topics changed
/brokers/ids/0 changed
/config/topics/bar changed
/kafka-acl/group/grp1 changed
…

Rethinking the Controller based on the LOG
25
• Why not making the metadata log itself as the route of truth

26
• Single writer, a.k.a. the controller
• Multi-operations in flight
• Async APIs to write to the log
• Controller collocates with the metadata log, as internal Kafka topic

27
• Metadata propagation not via RPC, but log replication
• Brokers locally cache metadata from the log, fix divergence
• Isolate control plane from data path: separate ports, queues, metrics, etc

28
• Metadata propagation not via RPC, but log replication
• Brokers locally cache metadata from the log, fix divergence
• Isolate control plane from data path: separate ports, queues, metrics, etc
• A quorum of controllers, not a single controller
• Controller failover to standby is O(1)
• KRaft protocol to favor latency over failure tolerance

Log Replication: Primary-backup
29
Logs
Leader
Logs
Logs
Follower-1 Follower-2
Write

Log Replication: Quorum
30
Logs
Leader
Logs
Logs
Follower-1 Follower-2
Write

Primary-backup v.s. Quorum
31
• Kafka replication’s failure model is f+1
Quorum (Paxos, Raft, etc)’s failure model is 2f+1
• Kafka’s replication needs to wait for all followers (in ISR)
Quorum (Paxos, Raft, etc)’s replication waits for majority and has better latency

KRaft: Kafka’s Raft Implementation
32
• Piggy-back on Kafka’s log replication utilities
• Record schema versioning
• Batching and compression
• Log recovery, segmentation, indexing, checksumming
• NIO network transfer, leader epoch caching, tooling

33
• Piggy-back on Kafka’s log replication utilities
• Record schema versioning
• Batching and compression
• Log recovery, segmentation, indexing, checksumming
• NIO network transfer, leader epoch caching, tooling
• Non-ZK based leader election protocol
• No “split-brain” with multiple leaders
• No “grid-locking” with no leaders being elected

34
• Leader election to allow only one leader at each epoch
Logs
Voter-1
Logs
Logs
Voter-2 Voter-3

35
Logs
Candidate-1
Logs
Logs
Voter-2 Voter-3
Vote for Me (epoch=3, end=6)
Voter-1
Yes!

36
Logs
Candidate-1
Logs
Logs
Voter-2 Voter-3
Leader-1
Begin Epoch (3)

37
Logs
Candidate-1
Logs
Logs
Voter-2 Voter-3
Vote for Me (epoch=2, end=3)
No..
No..

38
• Follower will become a candidate after timeout to ask others for votes
• Voters only give one vote per epoch: the vote decision is persisted locally
• Voters only grants vote if its own log is not “longer” than the candidate
• Simply majority wins; backoff to avoid gridlocked election
• As a result, elected leaders must have all committed records up to their epoch

39
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
Fetch (epoch=3, end=7)

40
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
OK (data)

41
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3

42
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
No.. (epoch=2, end=6)

43
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3

44
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
OK (data)

45
• Pros: fewer round-trips to log reconciliation, no “disruptive servers”
• Cons: replication commit requires follower’s next fetch

46
• More details
• Controller module isolation: separate ports and queues, separate node ID space
• Snapshots: consistent view of full state: allow fetch snapshots for new brokers
• Delta records: think about “PartitionChangeRecord” v.s. “PartitionRecord”
• State Machine API: trigger upon commit, used for metadata caching
• And more..
[KIP-500, KIP-595, KIP-630, KIP-640]

Quorum Controller on top of KRaft Logs
47
Quorum
Observers
Leader
Voter
Voter
Metadata Log
• Controller run in a broker JVM or standalone
• Single-node Kafka cluster is possible
• Controller operations can be pipelined
• Brokers cache metadata read from the log
• Metadata is naturally “versioned” by log offset
• Divergence / corner cases resolvable
• Brokers liveness check via heartbeat
• Controlled shutdown piggy-backed on hb req/resp
• Simpler implementation and less timeout risks
• Fence brokers during startup until they are ready

Quorum Controller: Broker Shutdown
48
Quorum
Observers
Leader
Voter
Voter
Metadata Log
Batch append partition / broker changes

Quorum Controller: Failover
49
Quorum
Observers
Leader
Voter
Voter
Metadata Log
Newly Elected Leader already
has committed metadata
Leader

Quorum Controller Scalability
50

Recap and Roadmap
51
• KRaft: the source-of-truth metadata log for all Kafka data logs!
• Co-locate metadata storage with processing
• Better replication latency with pull-based leader election and log reconciliation
• Quorum Controller: built on top of KRaft
• Strictly ordered metadata with single writer
• Fast controller failover and broker restarts
• Roadmap
• Early access in the 2.8 release: KRaft replication, state snapshot
• On-going works for missing features: quorum reconfiguration, security, JBOD, EOS, etc..
• ZK mode would first be deprecated in a coming bridge release and then removed in future

52
Thank you!
Guozhang Wang
guozhang@confluent.io in/guozhangwang @guozhangwang
cnfl.io/meetups cnfl.io/slack
cnfl.io/blog

The Log of All Logs: Raft-based Consensus Inside Kafka

Recommended

Recommended

More Related Content

More from Guozhang Wang

More from Guozhang Wang (7)

Recently uploaded

Recently uploaded (20)

The Log of All Logs: Raft-based Consensus Inside Kafka