SlideShare a Scribd company logo
1 of 52
Download to read offline
The Log of All Logs
Raft-based Consensus inside Kafka
Guozhang Wang, Engineer @ Confluent
Kafka Summit APAC
01
Why Replace
Zookeeper?
02
Why Choose
Raft?
03
KRaft v.s. Raft
What’s the Difference?
04
The Quorum
Controller
(on top of KRaft)
The (old) Controller
3
Controller
Brokers
Zookeeper
• Single node, elected via ZK
The (old) Controller
4
Controller
Brokers
Zookeeper
• Single node, elected via ZK
• Maintains cluster metadata, persisted
in ZK as source of truth
• Broker membership
• Partition assignment
• Configs, Auth, ACLs, etc
The (old) Controller
5
Controller
Brokers
Zookeeper
• Single node, elected via ZK
• Maintains cluster metadata, persisted
in ZK as source of truth
• Broker membership
• Partition assignment
• Configs, Auth, ACLs, etc
• Handles (most) cluster-level admin
operations
• (Mostly) single-threaded
• Events triggered from ZK watches
• Push-based metadata propagation
• Handling logic often requires more ZK read/write
The (old) Controller: Broker Shutdown
6
Controller
Brokers
Zookeeper
ISR {1, 2, 3}
The (old) Controller: Broker Shutdown
7
Controller
Brokers
Zookeeper
SIG_TERM
ISR {1, 2, 3}
The (old) Controller: Broker Shutdown
8
Controller
Brokers
Zookeeper
ISR {1, 2, 3}
ControlledShutdown
ISR {2, 3}
The (old) Controller: Broker Shutdown
9
Controller
Brokers
Zookeeper
ISR {2, 3}
Write ZK
The (old) Controller: Broker Shutdown
10
Controller
Brokers
Zookeeper
ISR {2, 3}
MetadataUpdate
LeaderAndISR
The (old) Controller: Broker Shutdown
11
Controller
Brokers
Zookeeper
ISR {2, 3}
ControlledShutdown
The (old) Controller: Broker Shutdown
12
Controller
Brokers
Zookeeper
Writes to ZK
are synchronous
Impact: longer
shutdown time
Metadata propagation req/resp
is per-partition
Impact: client timeout
The (old) Controller: Controller Failover
13
Controller
Brokers
Zookeeper
The (old) Controller: Controller Failover
14
Brokers
Zookeeper
The (old) Controller: Controller Failover
15
Brokers
Zookeeper
The (old) Controller: Controller Failover
16
Brokers
Zookeeper
Controller
The (old) Controller: Controller Failover
17
Brokers
Zookeeper
Controller
Read ZK
ISR {1, 2, 3}
The (old) Controller: Controller Failover
18
Brokers
Zookeeper
Controller ISR {1, 2, 3}
ISR {2, 3}
Write ZK
The (old) Controller: Controller Failover
19
Brokers
Zookeeper
Controller ISR {2, 3}
MetadataUpdate
LeaderAndISR
The (old) Controller: Controller Failover
20
Brokers
Zookeeper
Controller
Reads from ZK are
O(num.partition)
Impact: longer
unavailability window
Controller Scalability Limitations
21
• Controller failover, broker start / shutdown events all require RPCs that
are O(num.partitions)
Controller Scalability Limitations
22
• Controller failover, broker start / shutdown events all require RPCs that
are O(num.partitions)
• Metadata persistence in ZK are synchronous and also O(num.partitions)
Controller Scalability Limitations
23
• Controller failover, broker start / shutdown events all require RPCs that
are O(num.partitions)
• Metadata persistence in ZK are synchronous and also O(num.partitions)
• ZK as the source-of-truth
• Znode Size limit, max num.watchers limit, watcher fire orderings, etc
• Controller’s metadata view is often out of date as all brokers (& admin clients) write to ZK
• Broker’s metadata view can diverge over time due to push-based propagation
How to get to 1K brokers and 1M partitions in a cluster?
What’s really behind the Controller / Zookeeper?
24
• A metadata LOG!
/brokers/topics/foo/partitions/0/state changed
/topics changed
/brokers/ids/0 changed
/config/topics/bar changed
/kafka-acl/group/grp1 changed
…
Rethinking the Controller based on the LOG
25
• Why not making the metadata log itself as the route of truth
Rethinking the Controller based on the LOG
26
• Why not making the metadata log itself as the route of truth
• Single writer, a.k.a. the controller
• Multi-operations in flight
• Async APIs to write to the log
• Controller collocates with the metadata log, as internal Kafka topic
Rethinking the Controller based on the LOG
27
• Why not making the metadata log itself as the route of truth
• Single writer, a.k.a. the controller
• Multi-operations in flight
• Async APIs to write to the log
• Controller collocates with the metadata log, as internal Kafka topic
• Metadata propagation not via RPC, but log replication
• Brokers locally cache metadata from the log, fix divergence
• Isolate control plane from data path: separate ports, queues, metrics, etc
Rethinking the Controller based on the LOG
28
• Why not making the metadata log itself as the route of truth
• Single writer, a.k.a. the controller
• Multi-operations in flight
• Async APIs to write to the log
• Controller collocates with the metadata log, as internal Kafka topic
• Metadata propagation not via RPC, but log replication
• Brokers locally cache metadata from the log, fix divergence
• Isolate control plane from data path: separate ports, queues, metrics, etc
• A quorum of controllers, not a single controller
• Controller failover to standby is O(1)
• KRaft protocol to favor latency over failure tolerance
Log Replication: Primary-backup
29
Logs
Leader
Logs
Logs
Follower-1 Follower-2
Write
Log Replication: Quorum
30
Logs
Leader
Logs
Logs
Follower-1 Follower-2
Write
Primary-backup v.s. Quorum
31
• Kafka replication’s failure model is f+1
Quorum (Paxos, Raft, etc)’s failure model is 2f+1
• Kafka’s replication needs to wait for all followers (in ISR)
Quorum (Paxos, Raft, etc)’s replication waits for majority and has better latency
KRaft: Kafka’s Raft Implementation
32
• Piggy-back on Kafka’s log replication utilities
• Record schema versioning
• Batching and compression
• Log recovery, segmentation, indexing, checksumming
• NIO network transfer, leader epoch caching, tooling
KRaft: Kafka’s Raft Implementation
33
• Piggy-back on Kafka’s log replication utilities
• Record schema versioning
• Batching and compression
• Log recovery, segmentation, indexing, checksumming
• NIO network transfer, leader epoch caching, tooling
• Non-ZK based leader election protocol
• No “split-brain” with multiple leaders
• No “grid-locking” with no leaders being elected
KRaft: Kafka’s Raft Implementation
34
• Leader election to allow only one leader at each epoch
Logs
Voter-1
Logs
Logs
Voter-2 Voter-3
KRaft: Kafka’s Raft Implementation
35
• Leader election to allow only one leader at each epoch
Logs
Candidate-1
Logs
Logs
Voter-2 Voter-3
Vote for Me (epoch=3, end=6)
Voter-1
Yes!
KRaft: Kafka’s Raft Implementation
36
• Leader election to allow only one leader at each epoch
Logs
Candidate-1
Logs
Logs
Voter-2 Voter-3
Leader-1
Begin Epoch (3)
KRaft: Kafka’s Raft Implementation
37
• Leader election to allow only one leader at each epoch
Logs
Candidate-1
Logs
Logs
Voter-2 Voter-3
Vote for Me (epoch=2, end=3)
No..
No..
KRaft: Kafka’s Raft Implementation
38
• Leader election to allow only one leader at each epoch
• Follower will become a candidate after timeout to ask others for votes
• Voters only give one vote per epoch: the vote decision is persisted locally
• Voters only grants vote if its own log is not “longer” than the candidate
• Simply majority wins; backoff to avoid gridlocked election
• As a result, elected leaders must have all committed records up to their epoch
KRaft: Kafka’s Raft Implementation
39
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
Fetch (epoch=3, end=7)
KRaft: Kafka’s Raft Implementation
40
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
OK (data)
KRaft: Kafka’s Raft Implementation
41
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
Fetch (epoch=2, end=8)
KRaft: Kafka’s Raft Implementation
42
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
No.. (epoch=2, end=6)
KRaft: Kafka’s Raft Implementation
43
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
Fetch (epoch=2, end=6)
KRaft: Kafka’s Raft Implementation
44
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
OK (data)
KRaft: Kafka’s Raft Implementation
45
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
• Pros: fewer round-trips to log reconciliation, no “disruptive servers”
• Cons: replication commit requires follower’s next fetch
KRaft: Kafka’s Raft Implementation
46
• More details
• Controller module isolation: separate ports and queues, separate node ID space
• Snapshots: consistent view of full state: allow fetch snapshots for new brokers
• Delta records: think about “PartitionChangeRecord” v.s. “PartitionRecord”
• State Machine API: trigger upon commit, used for metadata caching
• And more..
[KIP-500, KIP-595, KIP-630, KIP-640]
Quorum Controller on top of KRaft Logs
47
Quorum
Observers
Leader
Voter
Voter
Metadata Log
• Controller run in a broker JVM or standalone
• Single-node Kafka cluster is possible
• Controller operations can be pipelined
• Brokers cache metadata read from the log
• Metadata is naturally “versioned” by log offset
• Divergence / corner cases resolvable
• Brokers liveness check via heartbeat
• Controlled shutdown piggy-backed on hb req/resp
• Simpler implementation and less timeout risks
• Fence brokers during startup until they are ready
Quorum Controller: Broker Shutdown
48
Quorum
Observers
Leader
Voter
Voter
Metadata Log
Batch append partition / broker changes
Quorum Controller: Failover
49
Quorum
Observers
Leader
Voter
Voter
Metadata Log
Newly Elected Leader already
has committed metadata
Leader
Quorum Controller Scalability
50
Recap and Roadmap
51
• KRaft: the source-of-truth metadata log for all Kafka data logs!
• Co-locate metadata storage with processing
• Better replication latency with pull-based leader election and log reconciliation
• Quorum Controller: built on top of KRaft
• Strictly ordered metadata with single writer
• Fast controller failover and broker restarts
• Roadmap
• Early access in the 2.8 release: KRaft replication, state snapshot
• On-going works for missing features: quorum reconfiguration, security, JBOD, EOS, etc..
• ZK mode would first be deprecated in a coming bridge release and then removed in future
52
Thank you!
Guozhang Wang
guozhang@confluent.io in/guozhangwang @guozhangwang
cnfl.io/meetups cnfl.io/slack
cnfl.io/blog

More Related Content

More from Guozhang Wang

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaGuozhang Wang
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedInGuozhang Wang
 
Behavioral Simulations in MapReduce
Behavioral Simulations in MapReduceBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduceGuozhang Wang
 
Automatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsAutomatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsGuozhang Wang
 

More from Guozhang Wang (7)

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache Kafka
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Behavioral Simulations in MapReduce
Behavioral Simulations in MapReduceBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduce
 
Automatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsAutomatic Scaling Iterative Computations
Automatic Scaling Iterative Computations
 

Recently uploaded

Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfJNTUA
 
Attraction and Repulsion type Moving Iron Instruments.pptx
Attraction and Repulsion type Moving Iron Instruments.pptxAttraction and Repulsion type Moving Iron Instruments.pptx
Attraction and Repulsion type Moving Iron Instruments.pptxkarthikeyanS725446
 
Intelligent Agents, A discovery on How A Rational Agent Acts
Intelligent Agents, A discovery on How A Rational Agent ActsIntelligent Agents, A discovery on How A Rational Agent Acts
Intelligent Agents, A discovery on How A Rational Agent ActsSheetal Jain
 
BURGER ORDERING SYSYTEM PROJECT REPORT..pdf
BURGER ORDERING SYSYTEM PROJECT REPORT..pdfBURGER ORDERING SYSYTEM PROJECT REPORT..pdf
BURGER ORDERING SYSYTEM PROJECT REPORT..pdfKamal Acharya
 
ANSI(ST)-III_Manufacturing-I_05052020.pdf
ANSI(ST)-III_Manufacturing-I_05052020.pdfANSI(ST)-III_Manufacturing-I_05052020.pdf
ANSI(ST)-III_Manufacturing-I_05052020.pdfBertinKamsipa1
 
Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)NareenAsad
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Introduction to Heat Exchangers: Principle, Types and Applications
Introduction to Heat Exchangers: Principle, Types and ApplicationsIntroduction to Heat Exchangers: Principle, Types and Applications
Introduction to Heat Exchangers: Principle, Types and ApplicationsKineticEngineeringCo
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdfKamal Acharya
 
Quiz application system project report..pdf
Quiz application system project report..pdfQuiz application system project report..pdf
Quiz application system project report..pdfKamal Acharya
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGKOUSTAV SARKAR
 
Artificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian ReasoningArtificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian Reasoninghotman30312
 
ChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfqasastareekh
 
E-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentE-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentjatinraor66
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineJulioCesarSalazarHer1
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisDr.Costas Sachpazis
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1T.D. Shashikala
 
Circuit Breaker arc phenomenon.pdf engineering
Circuit Breaker arc phenomenon.pdf engineeringCircuit Breaker arc phenomenon.pdf engineering
Circuit Breaker arc phenomenon.pdf engineeringKanchhaTamang
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfMadan Karki
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringC Sai Kiran
 

Recently uploaded (20)

Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdf
 
Attraction and Repulsion type Moving Iron Instruments.pptx
Attraction and Repulsion type Moving Iron Instruments.pptxAttraction and Repulsion type Moving Iron Instruments.pptx
Attraction and Repulsion type Moving Iron Instruments.pptx
 
Intelligent Agents, A discovery on How A Rational Agent Acts
Intelligent Agents, A discovery on How A Rational Agent ActsIntelligent Agents, A discovery on How A Rational Agent Acts
Intelligent Agents, A discovery on How A Rational Agent Acts
 
BURGER ORDERING SYSYTEM PROJECT REPORT..pdf
BURGER ORDERING SYSYTEM PROJECT REPORT..pdfBURGER ORDERING SYSYTEM PROJECT REPORT..pdf
BURGER ORDERING SYSYTEM PROJECT REPORT..pdf
 
ANSI(ST)-III_Manufacturing-I_05052020.pdf
ANSI(ST)-III_Manufacturing-I_05052020.pdfANSI(ST)-III_Manufacturing-I_05052020.pdf
ANSI(ST)-III_Manufacturing-I_05052020.pdf
 
Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
Introduction to Heat Exchangers: Principle, Types and Applications
Introduction to Heat Exchangers: Principle, Types and ApplicationsIntroduction to Heat Exchangers: Principle, Types and Applications
Introduction to Heat Exchangers: Principle, Types and Applications
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 
Quiz application system project report..pdf
Quiz application system project report..pdfQuiz application system project report..pdf
Quiz application system project report..pdf
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
Artificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian ReasoningArtificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian Reasoning
 
ChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdf
 
E-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentE-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are present
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission line
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
Circuit Breaker arc phenomenon.pdf engineering
Circuit Breaker arc phenomenon.pdf engineeringCircuit Breaker arc phenomenon.pdf engineering
Circuit Breaker arc phenomenon.pdf engineering
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 

The Log of All Logs: Raft-based Consensus Inside Kafka

  • 1. The Log of All Logs Raft-based Consensus inside Kafka Guozhang Wang, Engineer @ Confluent Kafka Summit APAC
  • 2. 01 Why Replace Zookeeper? 02 Why Choose Raft? 03 KRaft v.s. Raft What’s the Difference? 04 The Quorum Controller (on top of KRaft)
  • 4. The (old) Controller 4 Controller Brokers Zookeeper • Single node, elected via ZK • Maintains cluster metadata, persisted in ZK as source of truth • Broker membership • Partition assignment • Configs, Auth, ACLs, etc
  • 5. The (old) Controller 5 Controller Brokers Zookeeper • Single node, elected via ZK • Maintains cluster metadata, persisted in ZK as source of truth • Broker membership • Partition assignment • Configs, Auth, ACLs, etc • Handles (most) cluster-level admin operations • (Mostly) single-threaded • Events triggered from ZK watches • Push-based metadata propagation • Handling logic often requires more ZK read/write
  • 6. The (old) Controller: Broker Shutdown 6 Controller Brokers Zookeeper ISR {1, 2, 3}
  • 7. The (old) Controller: Broker Shutdown 7 Controller Brokers Zookeeper SIG_TERM ISR {1, 2, 3}
  • 8. The (old) Controller: Broker Shutdown 8 Controller Brokers Zookeeper ISR {1, 2, 3} ControlledShutdown ISR {2, 3}
  • 9. The (old) Controller: Broker Shutdown 9 Controller Brokers Zookeeper ISR {2, 3} Write ZK
  • 10. The (old) Controller: Broker Shutdown 10 Controller Brokers Zookeeper ISR {2, 3} MetadataUpdate LeaderAndISR
  • 11. The (old) Controller: Broker Shutdown 11 Controller Brokers Zookeeper ISR {2, 3} ControlledShutdown
  • 12. The (old) Controller: Broker Shutdown 12 Controller Brokers Zookeeper Writes to ZK are synchronous Impact: longer shutdown time Metadata propagation req/resp is per-partition Impact: client timeout
  • 13. The (old) Controller: Controller Failover 13 Controller Brokers Zookeeper
  • 14. The (old) Controller: Controller Failover 14 Brokers Zookeeper
  • 15. The (old) Controller: Controller Failover 15 Brokers Zookeeper
  • 16. The (old) Controller: Controller Failover 16 Brokers Zookeeper Controller
  • 17. The (old) Controller: Controller Failover 17 Brokers Zookeeper Controller Read ZK ISR {1, 2, 3}
  • 18. The (old) Controller: Controller Failover 18 Brokers Zookeeper Controller ISR {1, 2, 3} ISR {2, 3} Write ZK
  • 19. The (old) Controller: Controller Failover 19 Brokers Zookeeper Controller ISR {2, 3} MetadataUpdate LeaderAndISR
  • 20. The (old) Controller: Controller Failover 20 Brokers Zookeeper Controller Reads from ZK are O(num.partition) Impact: longer unavailability window
  • 21. Controller Scalability Limitations 21 • Controller failover, broker start / shutdown events all require RPCs that are O(num.partitions)
  • 22. Controller Scalability Limitations 22 • Controller failover, broker start / shutdown events all require RPCs that are O(num.partitions) • Metadata persistence in ZK are synchronous and also O(num.partitions)
  • 23. Controller Scalability Limitations 23 • Controller failover, broker start / shutdown events all require RPCs that are O(num.partitions) • Metadata persistence in ZK are synchronous and also O(num.partitions) • ZK as the source-of-truth • Znode Size limit, max num.watchers limit, watcher fire orderings, etc • Controller’s metadata view is often out of date as all brokers (& admin clients) write to ZK • Broker’s metadata view can diverge over time due to push-based propagation How to get to 1K brokers and 1M partitions in a cluster?
  • 24. What’s really behind the Controller / Zookeeper? 24 • A metadata LOG! /brokers/topics/foo/partitions/0/state changed /topics changed /brokers/ids/0 changed /config/topics/bar changed /kafka-acl/group/grp1 changed …
  • 25. Rethinking the Controller based on the LOG 25 • Why not making the metadata log itself as the route of truth
  • 26. Rethinking the Controller based on the LOG 26 • Why not making the metadata log itself as the route of truth • Single writer, a.k.a. the controller • Multi-operations in flight • Async APIs to write to the log • Controller collocates with the metadata log, as internal Kafka topic
  • 27. Rethinking the Controller based on the LOG 27 • Why not making the metadata log itself as the route of truth • Single writer, a.k.a. the controller • Multi-operations in flight • Async APIs to write to the log • Controller collocates with the metadata log, as internal Kafka topic • Metadata propagation not via RPC, but log replication • Brokers locally cache metadata from the log, fix divergence • Isolate control plane from data path: separate ports, queues, metrics, etc
  • 28. Rethinking the Controller based on the LOG 28 • Why not making the metadata log itself as the route of truth • Single writer, a.k.a. the controller • Multi-operations in flight • Async APIs to write to the log • Controller collocates with the metadata log, as internal Kafka topic • Metadata propagation not via RPC, but log replication • Brokers locally cache metadata from the log, fix divergence • Isolate control plane from data path: separate ports, queues, metrics, etc • A quorum of controllers, not a single controller • Controller failover to standby is O(1) • KRaft protocol to favor latency over failure tolerance
  • 31. Primary-backup v.s. Quorum 31 • Kafka replication’s failure model is f+1 Quorum (Paxos, Raft, etc)’s failure model is 2f+1 • Kafka’s replication needs to wait for all followers (in ISR) Quorum (Paxos, Raft, etc)’s replication waits for majority and has better latency
  • 32. KRaft: Kafka’s Raft Implementation 32 • Piggy-back on Kafka’s log replication utilities • Record schema versioning • Batching and compression • Log recovery, segmentation, indexing, checksumming • NIO network transfer, leader epoch caching, tooling
  • 33. KRaft: Kafka’s Raft Implementation 33 • Piggy-back on Kafka’s log replication utilities • Record schema versioning • Batching and compression • Log recovery, segmentation, indexing, checksumming • NIO network transfer, leader epoch caching, tooling • Non-ZK based leader election protocol • No “split-brain” with multiple leaders • No “grid-locking” with no leaders being elected
  • 34. KRaft: Kafka’s Raft Implementation 34 • Leader election to allow only one leader at each epoch Logs Voter-1 Logs Logs Voter-2 Voter-3
  • 35. KRaft: Kafka’s Raft Implementation 35 • Leader election to allow only one leader at each epoch Logs Candidate-1 Logs Logs Voter-2 Voter-3 Vote for Me (epoch=3, end=6) Voter-1 Yes!
  • 36. KRaft: Kafka’s Raft Implementation 36 • Leader election to allow only one leader at each epoch Logs Candidate-1 Logs Logs Voter-2 Voter-3 Leader-1 Begin Epoch (3)
  • 37. KRaft: Kafka’s Raft Implementation 37 • Leader election to allow only one leader at each epoch Logs Candidate-1 Logs Logs Voter-2 Voter-3 Vote for Me (epoch=2, end=3) No.. No..
  • 38. KRaft: Kafka’s Raft Implementation 38 • Leader election to allow only one leader at each epoch • Follower will become a candidate after timeout to ask others for votes • Voters only give one vote per epoch: the vote decision is persisted locally • Voters only grants vote if its own log is not “longer” than the candidate • Simply majority wins; backoff to avoid gridlocked election • As a result, elected leaders must have all committed records up to their epoch
  • 39. KRaft: Kafka’s Raft Implementation 39 • Pull-based replication (like Kafka) instead of push-based in literature • Need specific API for begin a new epoch (in literature it is via PushRecords) • Log reconciliation happened at follower fetch (similar to Raft literature) Logs Leader-1 Logs Logs Voter-2 Voter-3 Fetch (epoch=3, end=7)
  • 40. KRaft: Kafka’s Raft Implementation 40 • Pull-based replication (like Kafka) instead of push-based in literature • Need specific API for begin a new epoch (in literature it is via PushRecords) • Log reconciliation happened at follower fetch (similar to Raft literature) Logs Leader-1 Logs Logs Voter-2 Voter-3 OK (data)
  • 41. KRaft: Kafka’s Raft Implementation 41 • Pull-based replication (like Kafka) instead of push-based in literature • Need specific API for begin a new epoch (in literature it is via PushRecords) • Log reconciliation happened at follower fetch (similar to Raft literature) Logs Leader-1 Logs Logs Voter-2 Voter-3 Fetch (epoch=2, end=8)
  • 42. KRaft: Kafka’s Raft Implementation 42 • Pull-based replication (like Kafka) instead of push-based in literature • Need specific API for begin a new epoch (in literature it is via PushRecords) • Log reconciliation happened at follower fetch (similar to Raft literature) Logs Leader-1 Logs Logs Voter-2 Voter-3 No.. (epoch=2, end=6)
  • 43. KRaft: Kafka’s Raft Implementation 43 • Pull-based replication (like Kafka) instead of push-based in literature • Need specific API for begin a new epoch (in literature it is via PushRecords) • Log reconciliation happened at follower fetch (similar to Raft literature) Logs Leader-1 Logs Logs Voter-2 Voter-3 Fetch (epoch=2, end=6)
  • 44. KRaft: Kafka’s Raft Implementation 44 • Pull-based replication (like Kafka) instead of push-based in literature • Need specific API for begin a new epoch (in literature it is via PushRecords) • Log reconciliation happened at follower fetch (similar to Raft literature) Logs Leader-1 Logs Logs Voter-2 Voter-3 OK (data)
  • 45. KRaft: Kafka’s Raft Implementation 45 • Pull-based replication (like Kafka) instead of push-based in literature • Need specific API for begin a new epoch (in literature it is via PushRecords) • Log reconciliation happened at follower fetch (similar to Raft literature) • Pros: fewer round-trips to log reconciliation, no “disruptive servers” • Cons: replication commit requires follower’s next fetch
  • 46. KRaft: Kafka’s Raft Implementation 46 • More details • Controller module isolation: separate ports and queues, separate node ID space • Snapshots: consistent view of full state: allow fetch snapshots for new brokers • Delta records: think about “PartitionChangeRecord” v.s. “PartitionRecord” • State Machine API: trigger upon commit, used for metadata caching • And more.. [KIP-500, KIP-595, KIP-630, KIP-640]
  • 47. Quorum Controller on top of KRaft Logs 47 Quorum Observers Leader Voter Voter Metadata Log • Controller run in a broker JVM or standalone • Single-node Kafka cluster is possible • Controller operations can be pipelined • Brokers cache metadata read from the log • Metadata is naturally “versioned” by log offset • Divergence / corner cases resolvable • Brokers liveness check via heartbeat • Controlled shutdown piggy-backed on hb req/resp • Simpler implementation and less timeout risks • Fence brokers during startup until they are ready
  • 48. Quorum Controller: Broker Shutdown 48 Quorum Observers Leader Voter Voter Metadata Log Batch append partition / broker changes
  • 49. Quorum Controller: Failover 49 Quorum Observers Leader Voter Voter Metadata Log Newly Elected Leader already has committed metadata Leader
  • 51. Recap and Roadmap 51 • KRaft: the source-of-truth metadata log for all Kafka data logs! • Co-locate metadata storage with processing • Better replication latency with pull-based leader election and log reconciliation • Quorum Controller: built on top of KRaft • Strictly ordered metadata with single writer • Fast controller failover and broker restarts • Roadmap • Early access in the 2.8 release: KRaft replication, state snapshot • On-going works for missing features: quorum reconfiguration, security, JBOD, EOS, etc.. • ZK mode would first be deprecated in a coming bridge release and then removed in future
  • 52. 52 Thank you! Guozhang Wang guozhang@confluent.io in/guozhangwang @guozhangwang cnfl.io/meetups cnfl.io/slack cnfl.io/blog