Kafka organizes data as immutable append-only logs at its core, and relied on external consensus services (a.k.a. Zookeeper) to manage the metadata --- such as topic-level configs, leader replicas and ISR information, received admin requests --- of these logs. In this talk, I will discuss a recent core initiative, that migrates the management of such metadata from external services into Kafka as its own special logs. More specifically, I will cover the following:
1. Why we believe an internal consensus protocol provides Kafka more benefit than an external consensus service.
2. Why we choose to build this internal "metadata log" based on the Raft protocol, instead of Kafka's current leader-follower replication mechanism.
3. What are the key design decisions we made in its implementation, and how it is different from the standard Raft algorithm (KIP-595).
4. How this Raft-based metadata log is leveraged by the new Quorum Controller (KIP-500).
4. The (old) Controller
4
Controller
Brokers
Zookeeper
• Single node, elected via ZK
• Maintains cluster metadata, persisted
in ZK as source of truth
• Broker membership
• Partition assignment
• Configs, Auth, ACLs, etc
5. The (old) Controller
5
Controller
Brokers
Zookeeper
• Single node, elected via ZK
• Maintains cluster metadata, persisted
in ZK as source of truth
• Broker membership
• Partition assignment
• Configs, Auth, ACLs, etc
• Handles (most) cluster-level admin
operations
• (Mostly) single-threaded
• Events triggered from ZK watches
• Push-based metadata propagation
• Handling logic often requires more ZK read/write
22. Controller Scalability Limitations
22
• Controller failover, broker start / shutdown events all require RPCs that
are O(num.partitions)
• Metadata persistence in ZK are synchronous and also O(num.partitions)
23. Controller Scalability Limitations
23
• Controller failover, broker start / shutdown events all require RPCs that
are O(num.partitions)
• Metadata persistence in ZK are synchronous and also O(num.partitions)
• ZK as the source-of-truth
• Znode Size limit, max num.watchers limit, watcher fire orderings, etc
• Controller’s metadata view is often out of date as all brokers (& admin clients) write to ZK
• Broker’s metadata view can diverge over time due to push-based propagation
How to get to 1K brokers and 1M partitions in a cluster?
25. Rethinking the Controller based on the LOG
25
• Why not making the metadata log itself as the route of truth
26. Rethinking the Controller based on the LOG
26
• Why not making the metadata log itself as the route of truth
• Single writer, a.k.a. the controller
• Multi-operations in flight
• Async APIs to write to the log
• Controller collocates with the metadata log, as internal Kafka topic
27. Rethinking the Controller based on the LOG
27
• Why not making the metadata log itself as the route of truth
• Single writer, a.k.a. the controller
• Multi-operations in flight
• Async APIs to write to the log
• Controller collocates with the metadata log, as internal Kafka topic
• Metadata propagation not via RPC, but log replication
• Brokers locally cache metadata from the log, fix divergence
• Isolate control plane from data path: separate ports, queues, metrics, etc
28. Rethinking the Controller based on the LOG
28
• Why not making the metadata log itself as the route of truth
• Single writer, a.k.a. the controller
• Multi-operations in flight
• Async APIs to write to the log
• Controller collocates with the metadata log, as internal Kafka topic
• Metadata propagation not via RPC, but log replication
• Brokers locally cache metadata from the log, fix divergence
• Isolate control plane from data path: separate ports, queues, metrics, etc
• A quorum of controllers, not a single controller
• Controller failover to standby is O(1)
• KRaft protocol to favor latency over failure tolerance
31. Primary-backup v.s. Quorum
31
• Kafka replication’s failure model is f+1
Quorum (Paxos, Raft, etc)’s failure model is 2f+1
• Kafka’s replication needs to wait for all followers (in ISR)
Quorum (Paxos, Raft, etc)’s replication waits for majority and has better latency
33. KRaft: Kafka’s Raft Implementation
33
• Piggy-back on Kafka’s log replication utilities
• Record schema versioning
• Batching and compression
• Log recovery, segmentation, indexing, checksumming
• NIO network transfer, leader epoch caching, tooling
• Non-ZK based leader election protocol
• No “split-brain” with multiple leaders
• No “grid-locking” with no leaders being elected
34. KRaft: Kafka’s Raft Implementation
34
• Leader election to allow only one leader at each epoch
Logs
Voter-1
Logs
Logs
Voter-2 Voter-3
35. KRaft: Kafka’s Raft Implementation
35
• Leader election to allow only one leader at each epoch
Logs
Candidate-1
Logs
Logs
Voter-2 Voter-3
Vote for Me (epoch=3, end=6)
Voter-1
Yes!
36. KRaft: Kafka’s Raft Implementation
36
• Leader election to allow only one leader at each epoch
Logs
Candidate-1
Logs
Logs
Voter-2 Voter-3
Leader-1
Begin Epoch (3)
37. KRaft: Kafka’s Raft Implementation
37
• Leader election to allow only one leader at each epoch
Logs
Candidate-1
Logs
Logs
Voter-2 Voter-3
Vote for Me (epoch=2, end=3)
No..
No..
38. KRaft: Kafka’s Raft Implementation
38
• Leader election to allow only one leader at each epoch
• Follower will become a candidate after timeout to ask others for votes
• Voters only give one vote per epoch: the vote decision is persisted locally
• Voters only grants vote if its own log is not “longer” than the candidate
• Simply majority wins; backoff to avoid gridlocked election
• As a result, elected leaders must have all committed records up to their epoch
39. KRaft: Kafka’s Raft Implementation
39
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
Fetch (epoch=3, end=7)
40. KRaft: Kafka’s Raft Implementation
40
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
OK (data)
41. KRaft: Kafka’s Raft Implementation
41
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
Fetch (epoch=2, end=8)
42. KRaft: Kafka’s Raft Implementation
42
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
No.. (epoch=2, end=6)
43. KRaft: Kafka’s Raft Implementation
43
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
Fetch (epoch=2, end=6)
44. KRaft: Kafka’s Raft Implementation
44
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
OK (data)
45. KRaft: Kafka’s Raft Implementation
45
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
• Pros: fewer round-trips to log reconciliation, no “disruptive servers”
• Cons: replication commit requires follower’s next fetch
46. KRaft: Kafka’s Raft Implementation
46
• More details
• Controller module isolation: separate ports and queues, separate node ID space
• Snapshots: consistent view of full state: allow fetch snapshots for new brokers
• Delta records: think about “PartitionChangeRecord” v.s. “PartitionRecord”
• State Machine API: trigger upon commit, used for metadata caching
• And more..
[KIP-500, KIP-595, KIP-630, KIP-640]
47. Quorum Controller on top of KRaft Logs
47
Quorum
Observers
Leader
Voter
Voter
Metadata Log
• Controller run in a broker JVM or standalone
• Single-node Kafka cluster is possible
• Controller operations can be pipelined
• Brokers cache metadata read from the log
• Metadata is naturally “versioned” by log offset
• Divergence / corner cases resolvable
• Brokers liveness check via heartbeat
• Controlled shutdown piggy-backed on hb req/resp
• Simpler implementation and less timeout risks
• Fence brokers during startup until they are ready
48. Quorum Controller: Broker Shutdown
48
Quorum
Observers
Leader
Voter
Voter
Metadata Log
Batch append partition / broker changes
51. Recap and Roadmap
51
• KRaft: the source-of-truth metadata log for all Kafka data logs!
• Co-locate metadata storage with processing
• Better replication latency with pull-based leader election and log reconciliation
• Quorum Controller: built on top of KRaft
• Strictly ordered metadata with single writer
• Fast controller failover and broker restarts
• Roadmap
• Early access in the 2.8 release: KRaft replication, state snapshot
• On-going works for missing features: quorum reconfiguration, security, JBOD, EOS, etc..
• ZK mode would first be deprecated in a coming bridge release and then removed in future