Consensus in Apache Kafka: From Theory to Production.pdf

Consensus in Apache Kafka:
From Theory to Production
Guozhang Wang, Jason Gustafson
SIGMOD 2023

01
Kafka’s Control
Plane Needs
02
The Quorum
Controller: KRaft
03
KRaft
Implementation
04
KRaft in Prod
(in Cloud)

Apache Kafka: Streaming Platform
3
• Source-of-truth stream data storage
• De-facto programing paradigm for real-time events

Apache Kafka: Streaming Platform
4
• Source-of-truth stream data storage
• De-facto programing paradigm for real-time events
• Kafka’s architecture:
• Data organized as partitioned topics
• Partitions are replicated & log-structured
• Clients produce to / consume from topics
via sequential log IOs

Distributed Consensus: An Everlasting Tale
5
• Kafka needs consensus on:
• Broker metadata
• Topic metadata
• Client metadata (offsets, txns)
• And of course, replicated data itself
• Consensus access patterns varys:
• Control metadata propagation: low throughput (relatively), strict consistency
• Data replication: high throughput, low latency

Kafka Circa 2013
6
• Apache ZooKeeper for metadata
• Single controller elected to broadcast changes
• Control operations executed as ZK writes
• Leader-follower replication for data [VLDB 2015]
• Configurable latency / durability tradeoff
• Leader (re-)selected from in-sync replicas
Controller
Brokers
Zookeeper

Challenges for the Cloud Scale
7
• Single-controller syndromes
• Slow failover, ops latency, split-brain brokers, etc..
• Listener-based metadata propagation limits
• Exploding metadata state machines [SIGMOD 2021]
• New features == new metadata
• Metadata scattered on multiple “sources”
• Yet another system to operate
• Deployment and monitoring
• Security, networking, interface evolutions, etc..
Controller
Brokers
Zookeeper

Challenges for the Cloud Scale
8
• Single-controller syndromes
• Slow failover, ops latency, split-brain brokers, etc..
• Listener-based metadata propagation limits
• Exploding metadata state machines [SIGMOD 2021]
• New features == new metadata
• Metadata scattered on multiple “sources”
• Yet another system to operate
• Deployment and monitoring
• Security, networking, interface evolutions, etc..
Controller
Brokers
Zookeeper
How to scale Kafka clusters efficiently in the Cloud?

What do we really need for Consensus?
9
• A unified, locally replicable metadata LOG!
/brokers/topics/foo/partitions/0/state changed
/topics changed
/brokers/ids/0 changed
/config/topics/bar changed
/kafka-acl/group/grp1 changed
…

Rethinking Kafka Control Plane on the LOG
10
• Why not have the local metadata changelog as the source of truth

11
• Unified metadata replication APIs
• Async, multi in-flight log appends
• Pull-based log reads

12
• Versioned metadata state machines
• Local log offset == version numbers
• Easy membership management and split brain resolution

13
• Versioned metadata state machines
• Local log offset == version numbers
• Easy membership management and split brain resolution
• Flexibility in consensus trade-offs
• Quorum controllers v.s. single controller
• Selective metadata materialization
Metadata
Listeners
Metadata
Log
Metadata
Quorum

KRaft: Kafka’s Log of All Logs [Kafka Summit APAC 2021]
14
• Log-based leader election
• No “split-brain” with multiple leaders
• No “grid-locking” with no leaders being elected
• Quorum-based replication
• Favor latency over failure tolerance
• O(1) controller failover
• Piggy-back on Kafka’s log replication utilities
• Schema, NIO layer, log recovery algo.
• Batching / compression / indexing / segmentation, etc..
• However, isolated access from data path: separate ports, queues, metrics

Quorum Controller on top of KRaft Logs
15
Metadata
Quorum
Observers
Metadata
Log
• Controller run in a broker JVM or standalone
• Single-node Kafka cluster is possible
• Controller quorum can be isolated on the network
• Controller operations can be pipelined
• Brokers cache metadata read from the log
• Consistent snapshots
• Potential for clients to reason about consistent
metadata as well

KRaft Made Live
16
Hurdles to bring KRaft to production:
• Model Checking for Correctness: TLA+
• Performance tuning: fsync, leader/broker session timeouts, broker forwarding
• Integration challenges: JBOD, SCRAM, delegation tokens, metadata versioning
• Zk Migration Path: dynamic configuration, API compatibility
• Robustness: client quotas, disaster recovery
• Hardening…

Production Incident
Brokers
Controller
Quorum
Broker Session
(heartbeats)

KRaft in Production
• Default for new clusters in all regions
in AWS, GCP, and Azure
• 2000+ clusters
• 20% of all partitions
• ~50ms p99 metadata log latency

Kora: The Cloud Native Engine for Kafka [VLDB 2023]
22
• KRaft: simple metadata consensus for control
plane
• Tiered storage: low-cost, predictable perf data
plane
• Multi-tenant resource isolation and
management
• Automated upgrade and mitigation
• Elasticity, observability, durability, and more..

23
Thank you!
cnfl.io/meetups cnfl.io/slack
cnfl.io/blog

Consensus in Apache Kafka: From Theory to Production.pdf

Recommended

Recommended

More Related Content

Similar to Consensus in Apache Kafka: From Theory to Production.pdf

Similar to Consensus in Apache Kafka: From Theory to Production.pdf (20)

More from Guozhang Wang

More from Guozhang Wang (14)

Recently uploaded

Recently uploaded (20)

Consensus in Apache Kafka: From Theory to Production.pdf