In this talk I'd like to cover an everlasting story in distributed systems: consensus. More specifically, the consensus challenges in Apache Kafka, and how we addressed it starting from theory in papers to production in the cloud.
3. Apache Kafka: Streaming Platform
3
• Source-of-truth stream data storage
• De-facto programing paradigm for real-time events
4. Apache Kafka: Streaming Platform
4
• Source-of-truth stream data storage
• De-facto programing paradigm for real-time events
• Kafka’s architecture:
• Data organized as partitioned topics
• Partitions are replicated & log-structured
• Clients produce to / consume from topics
via sequential log IOs
5. Distributed Consensus: An Everlasting Tale
5
• Kafka needs consensus on:
• Broker metadata
• Topic metadata
• Client metadata (offsets, txns)
• And of course, replicated data itself
• Consensus access patterns varys:
• Control metadata propagation: low throughput (relatively), strict consistency
• Data replication: high throughput, low latency
6. Kafka Circa 2013
6
• Apache ZooKeeper for metadata
• Single controller elected to broadcast changes
• Control operations executed as ZK writes
• Leader-follower replication for data [VLDB 2015]
• Configurable latency / durability tradeoff
• Leader (re-)selected from in-sync replicas
Controller
Brokers
Zookeeper
7. Challenges for the Cloud Scale
7
• Single-controller syndromes
• Slow failover, ops latency, split-brain brokers, etc..
• Listener-based metadata propagation limits
• Exploding metadata state machines [SIGMOD 2021]
• New features == new metadata
• Metadata scattered on multiple “sources”
• Yet another system to operate
• Deployment and monitoring
• Security, networking, interface evolutions, etc..
Controller
Brokers
Zookeeper
8. Challenges for the Cloud Scale
8
• Single-controller syndromes
• Slow failover, ops latency, split-brain brokers, etc..
• Listener-based metadata propagation limits
• Exploding metadata state machines [SIGMOD 2021]
• New features == new metadata
• Metadata scattered on multiple “sources”
• Yet another system to operate
• Deployment and monitoring
• Security, networking, interface evolutions, etc..
Controller
Brokers
Zookeeper
How to scale Kafka clusters efficiently in the Cloud?
9. What do we really need for Consensus?
9
• A unified, locally replicable metadata LOG!
/brokers/topics/foo/partitions/0/state changed
/topics changed
/brokers/ids/0 changed
/config/topics/bar changed
/kafka-acl/group/grp1 changed
…
10. Rethinking Kafka Control Plane on the LOG
10
• Why not have the local metadata changelog as the source of truth
11. Rethinking Kafka Control Plane on the LOG
11
• Why not have the local metadata changelog as the source of truth
• Unified metadata replication APIs
• Async, multi in-flight log appends
• Pull-based log reads
12. Rethinking Kafka Control Plane on the LOG
12
• Why not have the local metadata changelog as the source of truth
• Unified metadata replication APIs
• Async, multi in-flight log appends
• Pull-based log reads
• Versioned metadata state machines
• Local log offset == version numbers
• Easy membership management and split brain resolution
13. Rethinking Kafka Control Plane on the LOG
13
• Why not have the local metadata changelog as the source of truth
• Unified metadata replication APIs
• Async, multi in-flight log appends
• Pull-based log reads
• Versioned metadata state machines
• Local log offset == version numbers
• Easy membership management and split brain resolution
• Flexibility in consensus trade-offs
• Quorum controllers v.s. single controller
• Selective metadata materialization
Metadata
Listeners
Metadata
Log
Metadata
Quorum
14. KRaft: Kafka’s Log of All Logs [Kafka Summit APAC 2021]
14
• Log-based leader election
• No “split-brain” with multiple leaders
• No “grid-locking” with no leaders being elected
• Quorum-based replication
• Favor latency over failure tolerance
• O(1) controller failover
• Piggy-back on Kafka’s log replication utilities
• Schema, NIO layer, log recovery algo.
• Batching / compression / indexing / segmentation, etc..
• However, isolated access from data path: separate ports, queues, metrics
15. Quorum Controller on top of KRaft Logs
15
Metadata
Quorum
Observers
Metadata
Log
• Controller run in a broker JVM or standalone
• Single-node Kafka cluster is possible
• Controller quorum can be isolated on the network
• Controller operations can be pipelined
• Brokers cache metadata read from the log
• Consistent snapshots
• Potential for clients to reason about consistent
metadata as well
16. KRaft Made Live
16
Hurdles to bring KRaft to production:
• Model Checking for Correctness: TLA+
• Performance tuning: fsync, leader/broker session timeouts, broker forwarding
• Integration challenges: JBOD, SCRAM, delegation tokens, metadata versioning
• Zk Migration Path: dynamic configuration, API compatibility
• Robustness: client quotas, disaster recovery
• Hardening…
21. KRaft in Production
• Default for new clusters in all regions
in AWS, GCP, and Azure
• 2000+ clusters
• 20% of all partitions
• ~50ms p99 metadata log latency
22. Kora: The Cloud Native Engine for Kafka [VLDB 2023]
22
• KRaft: simple metadata consensus for control
plane
• Tiered storage: low-cost, predictable perf data
plane
• Multi-tenant resource isolation and
management
• Automated upgrade and mitigation
• Elasticity, observability, durability, and more..