SlideShare a Scribd company logo
1 of 23
Download to read offline
Consensus in Apache Kafka:
From Theory to Production
Guozhang Wang, Jason Gustafson
SIGMOD 2023
01
Kafka’s Control
Plane Needs
02
The Quorum
Controller: KRaft
03
KRaft
Implementation
04
KRaft in Prod
(in Cloud)
Apache Kafka: Streaming Platform
3
• Source-of-truth stream data storage
• De-facto programing paradigm for real-time events
Apache Kafka: Streaming Platform
4
• Source-of-truth stream data storage
• De-facto programing paradigm for real-time events
• Kafka’s architecture:
• Data organized as partitioned topics
• Partitions are replicated & log-structured
• Clients produce to / consume from topics
via sequential log IOs
Distributed Consensus: An Everlasting Tale
5
• Kafka needs consensus on:
• Broker metadata
• Topic metadata
• Client metadata (offsets, txns)
• And of course, replicated data itself
• Consensus access patterns varys:
• Control metadata propagation: low throughput (relatively), strict consistency
• Data replication: high throughput, low latency
Kafka Circa 2013
6
• Apache ZooKeeper for metadata
• Single controller elected to broadcast changes
• Control operations executed as ZK writes
• Leader-follower replication for data [VLDB 2015]
• Configurable latency / durability tradeoff
• Leader (re-)selected from in-sync replicas
Controller
Brokers
Zookeeper
Challenges for the Cloud Scale
7
• Single-controller syndromes
• Slow failover, ops latency, split-brain brokers, etc..
• Listener-based metadata propagation limits
• Exploding metadata state machines [SIGMOD 2021]
• New features == new metadata
• Metadata scattered on multiple “sources”
• Yet another system to operate
• Deployment and monitoring
• Security, networking, interface evolutions, etc..
Controller
Brokers
Zookeeper
Challenges for the Cloud Scale
8
• Single-controller syndromes
• Slow failover, ops latency, split-brain brokers, etc..
• Listener-based metadata propagation limits
• Exploding metadata state machines [SIGMOD 2021]
• New features == new metadata
• Metadata scattered on multiple “sources”
• Yet another system to operate
• Deployment and monitoring
• Security, networking, interface evolutions, etc..
Controller
Brokers
Zookeeper
How to scale Kafka clusters efficiently in the Cloud?
What do we really need for Consensus?
9
• A unified, locally replicable metadata LOG!
/brokers/topics/foo/partitions/0/state changed
/topics changed
/brokers/ids/0 changed
/config/topics/bar changed
/kafka-acl/group/grp1 changed
…
Rethinking Kafka Control Plane on the LOG
10
• Why not have the local metadata changelog as the source of truth
Rethinking Kafka Control Plane on the LOG
11
• Why not have the local metadata changelog as the source of truth
• Unified metadata replication APIs
• Async, multi in-flight log appends
• Pull-based log reads
Rethinking Kafka Control Plane on the LOG
12
• Why not have the local metadata changelog as the source of truth
• Unified metadata replication APIs
• Async, multi in-flight log appends
• Pull-based log reads
• Versioned metadata state machines
• Local log offset == version numbers
• Easy membership management and split brain resolution
Rethinking Kafka Control Plane on the LOG
13
• Why not have the local metadata changelog as the source of truth
• Unified metadata replication APIs
• Async, multi in-flight log appends
• Pull-based log reads
• Versioned metadata state machines
• Local log offset == version numbers
• Easy membership management and split brain resolution
• Flexibility in consensus trade-offs
• Quorum controllers v.s. single controller
• Selective metadata materialization
Metadata
Listeners
Metadata
Log
Metadata
Quorum
KRaft: Kafka’s Log of All Logs [Kafka Summit APAC 2021]
14
• Log-based leader election
• No “split-brain” with multiple leaders
• No “grid-locking” with no leaders being elected
• Quorum-based replication
• Favor latency over failure tolerance
• O(1) controller failover
• Piggy-back on Kafka’s log replication utilities
• Schema, NIO layer, log recovery algo.
• Batching / compression / indexing / segmentation, etc..
• However, isolated access from data path: separate ports, queues, metrics
Quorum Controller on top of KRaft Logs
15
Metadata
Quorum
Observers
Metadata
Log
• Controller run in a broker JVM or standalone
• Single-node Kafka cluster is possible
• Controller quorum can be isolated on the network
• Controller operations can be pipelined
• Brokers cache metadata read from the log
• Consistent snapshots
• Potential for clients to reason about consistent
metadata as well
KRaft Made Live
16
Hurdles to bring KRaft to production:
• Model Checking for Correctness: TLA+
• Performance tuning: fsync, leader/broker session timeouts, broker forwarding
• Integration challenges: JBOD, SCRAM, delegation tokens, metadata versioning
• Zk Migration Path: dynamic configuration, API compatibility
• Robustness: client quotas, disaster recovery
• Hardening…
Production Incident
Brokers
Controller
Quorum
Broker Session
(heartbeats)
Production Incident
Brokers
Controller
Quorum
Broker Session
(heartbeats)
Production Incident
Brokers
Controller
Quorum
Broker Session
(heartbeats)
Production Incident
KRaft in Production
• Default for new clusters in all regions
in AWS, GCP, and Azure
• 2000+ clusters
• 20% of all partitions
• ~50ms p99 metadata log latency
Kora: The Cloud Native Engine for Kafka [VLDB 2023]
22
• KRaft: simple metadata consensus for control
plane
• Tiered storage: low-cost, predictable perf data
plane
• Multi-tenant resource isolation and
management
• Automated upgrade and mitigation
• Elasticity, observability, durability, and more..
23
Thank you!
cnfl.io/meetups cnfl.io/slack
cnfl.io/blog

More Related Content

Similar to Consensus in Apache Kafka: From Theory to Production.pdf

Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaAngelo Cesaro
 
How is Kafka so Fast?
How is Kafka so Fast?How is Kafka so Fast?
How is Kafka so Fast?Ricardo Paiva
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
A day in the life of a log message
A day in the life of a log messageA day in the life of a log message
A day in the life of a log messageJosef Karásek
 
Stream Processing @ Lyft
Stream Processing @ LyftStream Processing @ Lyft
Stream Processing @ LyftJamie Grier
 
messaging.pptx
messaging.pptxmessaging.pptx
messaging.pptxNParakh1
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Data Con LA
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streamsYoni Farin
 
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
Au delà des brokers, un tour de l’environnement Kafka | Florent RamièreAu delà des brokers, un tour de l’environnement Kafka | Florent Ramière
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramièreconfluent
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to KafkaAkash Vacher
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent
 
Distributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleDistributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleApache Kafka TLV
 
A Primer Towards Running Kafka on Top of Kubernetes.pdf
A Primer Towards Running Kafka on Top of Kubernetes.pdfA Primer Towards Running Kafka on Top of Kubernetes.pdf
A Primer Towards Running Kafka on Top of Kubernetes.pdfAvinashUpadhyaya3
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent
 

Similar to Consensus in Apache Kafka: From Theory to Production.pdf (20)

Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache Kafka
 
How is Kafka so Fast?
How is Kafka so Fast?How is Kafka so Fast?
How is Kafka so Fast?
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
A day in the life of a log message
A day in the life of a log messageA day in the life of a log message
A day in the life of a log message
 
Stream Processing @ Lyft
Stream Processing @ LyftStream Processing @ Lyft
Stream Processing @ Lyft
 
messaging.pptx
messaging.pptxmessaging.pptx
messaging.pptx
 
Kafka Explainaton
Kafka ExplainatonKafka Explainaton
Kafka Explainaton
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streams
 
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
Au delà des brokers, un tour de l’environnement Kafka | Florent RamièreAu delà des brokers, un tour de l’environnement Kafka | Florent Ramière
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
 
Distributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleDistributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola Scale
 
A Primer Towards Running Kafka on Top of Kubernetes.pdf
A Primer Towards Running Kafka on Top of Kubernetes.pdfA Primer Towards Running Kafka on Top of Kubernetes.pdf
A Primer Towards Running Kafka on Top of Kubernetes.pdf
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
 

More from Guozhang Wang

Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Guozhang Wang
 
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Guozhang Wang
 
Introduction to the Incremental Cooperative Protocol of Kafka
Introduction to the Incremental Cooperative Protocol of KafkaIntroduction to the Incremental Cooperative Protocol of Kafka
Introduction to the Incremental Cooperative Protocol of KafkaGuozhang Wang
 
Performance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsPerformance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsGuozhang Wang
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedApache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedGuozhang Wang
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingGuozhang Wang
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaGuozhang Wang
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedInGuozhang Wang
 
Behavioral Simulations in MapReduce
Behavioral Simulations in MapReduceBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduceGuozhang Wang
 
Automatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsAutomatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsGuozhang Wang
 

More from Guozhang Wang (14)

Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
 
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
 
Introduction to the Incremental Cooperative Protocol of Kafka
Introduction to the Incremental Cooperative Protocol of KafkaIntroduction to the Incremental Cooperative Protocol of Kafka
Introduction to the Incremental Cooperative Protocol of Kafka
 
Performance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsPerformance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams Applications
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedApache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache Kafka
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Behavioral Simulations in MapReduce
Behavioral Simulations in MapReduceBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduce
 
Automatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsAutomatic Scaling Iterative Computations
Automatic Scaling Iterative Computations
 

Recently uploaded

Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...ssuserdfc773
 
Linux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using PipesLinux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using PipesRashidFaridChishti
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessorAshwiniTodkar4
 
Query optimization and processing for advanced database systems
Query optimization and processing for advanced database systemsQuery optimization and processing for advanced database systems
Query optimization and processing for advanced database systemsmeharikiros2
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan
 
fitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .pptfitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .pptAfnanAhmad53
 
Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257subhasishdas79
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...ppkakm
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Introduction to Geographic Information Systems
Introduction to Geographic Information SystemsIntroduction to Geographic Information Systems
Introduction to Geographic Information SystemsAnge Felix NSANZIYERA
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Ramkumar k
 
Worksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxWorksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxMustafa Ahmed
 
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...ronahami
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxMustafa Ahmed
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 

Recently uploaded (20)

Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
 
Linux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using PipesLinux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using Pipes
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor
 
Query optimization and processing for advanced database systems
Query optimization and processing for advanced database systemsQuery optimization and processing for advanced database systems
Query optimization and processing for advanced database systems
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
fitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .pptfitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .ppt
 
Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Introduction to Geographic Information Systems
Introduction to Geographic Information SystemsIntroduction to Geographic Information Systems
Introduction to Geographic Information Systems
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
Worksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxWorksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptx
 
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptx
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 

Consensus in Apache Kafka: From Theory to Production.pdf

  • 1. Consensus in Apache Kafka: From Theory to Production Guozhang Wang, Jason Gustafson SIGMOD 2023
  • 2. 01 Kafka’s Control Plane Needs 02 The Quorum Controller: KRaft 03 KRaft Implementation 04 KRaft in Prod (in Cloud)
  • 3. Apache Kafka: Streaming Platform 3 • Source-of-truth stream data storage • De-facto programing paradigm for real-time events
  • 4. Apache Kafka: Streaming Platform 4 • Source-of-truth stream data storage • De-facto programing paradigm for real-time events • Kafka’s architecture: • Data organized as partitioned topics • Partitions are replicated & log-structured • Clients produce to / consume from topics via sequential log IOs
  • 5. Distributed Consensus: An Everlasting Tale 5 • Kafka needs consensus on: • Broker metadata • Topic metadata • Client metadata (offsets, txns) • And of course, replicated data itself • Consensus access patterns varys: • Control metadata propagation: low throughput (relatively), strict consistency • Data replication: high throughput, low latency
  • 6. Kafka Circa 2013 6 • Apache ZooKeeper for metadata • Single controller elected to broadcast changes • Control operations executed as ZK writes • Leader-follower replication for data [VLDB 2015] • Configurable latency / durability tradeoff • Leader (re-)selected from in-sync replicas Controller Brokers Zookeeper
  • 7. Challenges for the Cloud Scale 7 • Single-controller syndromes • Slow failover, ops latency, split-brain brokers, etc.. • Listener-based metadata propagation limits • Exploding metadata state machines [SIGMOD 2021] • New features == new metadata • Metadata scattered on multiple “sources” • Yet another system to operate • Deployment and monitoring • Security, networking, interface evolutions, etc.. Controller Brokers Zookeeper
  • 8. Challenges for the Cloud Scale 8 • Single-controller syndromes • Slow failover, ops latency, split-brain brokers, etc.. • Listener-based metadata propagation limits • Exploding metadata state machines [SIGMOD 2021] • New features == new metadata • Metadata scattered on multiple “sources” • Yet another system to operate • Deployment and monitoring • Security, networking, interface evolutions, etc.. Controller Brokers Zookeeper How to scale Kafka clusters efficiently in the Cloud?
  • 9. What do we really need for Consensus? 9 • A unified, locally replicable metadata LOG! /brokers/topics/foo/partitions/0/state changed /topics changed /brokers/ids/0 changed /config/topics/bar changed /kafka-acl/group/grp1 changed …
  • 10. Rethinking Kafka Control Plane on the LOG 10 • Why not have the local metadata changelog as the source of truth
  • 11. Rethinking Kafka Control Plane on the LOG 11 • Why not have the local metadata changelog as the source of truth • Unified metadata replication APIs • Async, multi in-flight log appends • Pull-based log reads
  • 12. Rethinking Kafka Control Plane on the LOG 12 • Why not have the local metadata changelog as the source of truth • Unified metadata replication APIs • Async, multi in-flight log appends • Pull-based log reads • Versioned metadata state machines • Local log offset == version numbers • Easy membership management and split brain resolution
  • 13. Rethinking Kafka Control Plane on the LOG 13 • Why not have the local metadata changelog as the source of truth • Unified metadata replication APIs • Async, multi in-flight log appends • Pull-based log reads • Versioned metadata state machines • Local log offset == version numbers • Easy membership management and split brain resolution • Flexibility in consensus trade-offs • Quorum controllers v.s. single controller • Selective metadata materialization Metadata Listeners Metadata Log Metadata Quorum
  • 14. KRaft: Kafka’s Log of All Logs [Kafka Summit APAC 2021] 14 • Log-based leader election • No “split-brain” with multiple leaders • No “grid-locking” with no leaders being elected • Quorum-based replication • Favor latency over failure tolerance • O(1) controller failover • Piggy-back on Kafka’s log replication utilities • Schema, NIO layer, log recovery algo. • Batching / compression / indexing / segmentation, etc.. • However, isolated access from data path: separate ports, queues, metrics
  • 15. Quorum Controller on top of KRaft Logs 15 Metadata Quorum Observers Metadata Log • Controller run in a broker JVM or standalone • Single-node Kafka cluster is possible • Controller quorum can be isolated on the network • Controller operations can be pipelined • Brokers cache metadata read from the log • Consistent snapshots • Potential for clients to reason about consistent metadata as well
  • 16. KRaft Made Live 16 Hurdles to bring KRaft to production: • Model Checking for Correctness: TLA+ • Performance tuning: fsync, leader/broker session timeouts, broker forwarding • Integration challenges: JBOD, SCRAM, delegation tokens, metadata versioning • Zk Migration Path: dynamic configuration, API compatibility • Robustness: client quotas, disaster recovery • Hardening…
  • 21. KRaft in Production • Default for new clusters in all regions in AWS, GCP, and Azure • 2000+ clusters • 20% of all partitions • ~50ms p99 metadata log latency
  • 22. Kora: The Cloud Native Engine for Kafka [VLDB 2023] 22 • KRaft: simple metadata consensus for control plane • Tiered storage: low-cost, predictable perf data plane • Multi-tenant resource isolation and management • Automated upgrade and mitigation • Elasticity, observability, durability, and more..