SlideShare a Scribd company logo
The Log of All Logs
Raft-based Consensus inside Kafka
Guozhang Wang, Engineer @ Confluent
Kafka Summit APAC
01
Why Replace
Zookeeper?
02
Why Choose
Raft?
03
KRaft v.s. Raft
What’s the Difference?
04
The Quorum
Controller
(on top of KRaft)
The (old) Controller
3
Controller
Brokers
Zookeeper
• Single node, elected via ZK
The (old) Controller
4
Controller
Brokers
Zookeeper
• Single node, elected via ZK
• Maintains cluster metadata, persisted
in ZK as source of truth
• Broker membership
• Partition assignment
• Configs, Auth, ACLs, etc
The (old) Controller
5
Controller
Brokers
Zookeeper
• Single node, elected via ZK
• Maintains cluster metadata, persisted
in ZK as source of truth
• Broker membership
• Partition assignment
• Configs, Auth, ACLs, etc
• Handles (most) cluster-level admin
operations
• (Mostly) single-threaded
• Events triggered from ZK watches
• Push-based metadata propagation
• Handling logic often requires more ZK read/write
The (old) Controller: Broker Shutdown
6
Controller
Brokers
Zookeeper
ISR {1, 2, 3}
The (old) Controller: Broker Shutdown
7
Controller
Brokers
Zookeeper
SIG_TERM
ISR {1, 2, 3}
The (old) Controller: Broker Shutdown
8
Controller
Brokers
Zookeeper
ISR {1, 2, 3}
ControlledShutdown
ISR {2, 3}
The (old) Controller: Broker Shutdown
9
Controller
Brokers
Zookeeper
ISR {2, 3}
Write ZK
The (old) Controller: Broker Shutdown
10
Controller
Brokers
Zookeeper
ISR {2, 3}
MetadataUpdate
LeaderAndISR
The (old) Controller: Broker Shutdown
11
Controller
Brokers
Zookeeper
ISR {2, 3}
ControlledShutdown
The (old) Controller: Broker Shutdown
12
Controller
Brokers
Zookeeper
Writes to ZK
are synchronous
Impact: longer
shutdown time
Metadata propagation req/resp
is per-partition
Impact: client timeout
The (old) Controller: Controller Failover
13
Controller
Brokers
Zookeeper
The (old) Controller: Controller Failover
14
Brokers
Zookeeper
The (old) Controller: Controller Failover
15
Brokers
Zookeeper
The (old) Controller: Controller Failover
16
Brokers
Zookeeper
Controller
The (old) Controller: Controller Failover
17
Brokers
Zookeeper
Controller
Read ZK
ISR {1, 2, 3}
The (old) Controller: Controller Failover
18
Brokers
Zookeeper
Controller ISR {1, 2, 3}
ISR {2, 3}
Write ZK
The (old) Controller: Controller Failover
19
Brokers
Zookeeper
Controller ISR {2, 3}
MetadataUpdate
LeaderAndISR
The (old) Controller: Controller Failover
20
Brokers
Zookeeper
Controller
Reads from ZK are
O(num.partition)
Impact: longer
unavailability window
Controller Scalability Limitations
21
• Controller failover, broker start / shutdown events all require RPCs that
are O(num.partitions)
Controller Scalability Limitations
22
• Controller failover, broker start / shutdown events all require RPCs that
are O(num.partitions)
• Metadata persistence in ZK are synchronous and also O(num.partitions)
Controller Scalability Limitations
23
• Controller failover, broker start / shutdown events all require RPCs that
are O(num.partitions)
• Metadata persistence in ZK are synchronous and also O(num.partitions)
• ZK as the source-of-truth
• Znode Size limit, max num.watchers limit, watcher fire orderings, etc
• Controller’s metadata view is often out of date as all brokers (& admin clients) write to ZK
• Broker’s metadata view can diverge over time due to push-based propagation
How to get to 1K brokers and 1M partitions in a cluster?
What’s really behind the Controller / Zookeeper?
24
• A metadata LOG!
/brokers/topics/foo/partitions/0/state changed
/topics changed
/brokers/ids/0 changed
/config/topics/bar changed
/kafka-acl/group/grp1 changed
…
Rethinking the Controller based on the LOG
25
• Why not making the metadata log itself as the route of truth
Rethinking the Controller based on the LOG
26
• Why not making the metadata log itself as the route of truth
• Single writer, a.k.a. the controller
• Multi-operations in flight
• Async APIs to write to the log
• Controller collocates with the metadata log, as internal Kafka topic
Rethinking the Controller based on the LOG
27
• Why not making the metadata log itself as the route of truth
• Single writer, a.k.a. the controller
• Multi-operations in flight
• Async APIs to write to the log
• Controller collocates with the metadata log, as internal Kafka topic
• Metadata propagation not via RPC, but log replication
• Brokers locally cache metadata from the log, fix divergence
• Isolate control plane from data path: separate ports, queues, metrics, etc
Rethinking the Controller based on the LOG
28
• Why not making the metadata log itself as the route of truth
• Single writer, a.k.a. the controller
• Multi-operations in flight
• Async APIs to write to the log
• Controller collocates with the metadata log, as internal Kafka topic
• Metadata propagation not via RPC, but log replication
• Brokers locally cache metadata from the log, fix divergence
• Isolate control plane from data path: separate ports, queues, metrics, etc
• A quorum of controllers, not a single controller
• Controller failover to standby is O(1)
• KRaft protocol to favor latency over failure tolerance
Log Replication: Primary-backup
29
Logs
Leader
Logs
Logs
Follower-1 Follower-2
Write
Log Replication: Quorum
30
Logs
Leader
Logs
Logs
Follower-1 Follower-2
Write
Primary-backup v.s. Quorum
31
• Kafka replication’s failure model is f+1
Quorum (Paxos, Raft, etc)’s failure model is 2f+1
• Kafka’s replication needs to wait for all followers (in ISR)
Quorum (Paxos, Raft, etc)’s replication waits for majority and has better latency
KRaft: Kafka’s Raft Implementation
32
• Piggy-back on Kafka’s log replication utilities
• Record schema versioning
• Batching and compression
• Log recovery, segmentation, indexing, checksumming
• NIO network transfer, leader epoch caching, tooling
KRaft: Kafka’s Raft Implementation
33
• Piggy-back on Kafka’s log replication utilities
• Record schema versioning
• Batching and compression
• Log recovery, segmentation, indexing, checksumming
• NIO network transfer, leader epoch caching, tooling
• Non-ZK based leader election protocol
• No “split-brain” with multiple leaders
• No “grid-locking” with no leaders being elected
KRaft: Kafka’s Raft Implementation
34
• Leader election to allow only one leader at each epoch
Logs
Voter-1
Logs
Logs
Voter-2 Voter-3
KRaft: Kafka’s Raft Implementation
35
• Leader election to allow only one leader at each epoch
Logs
Candidate-1
Logs
Logs
Voter-2 Voter-3
Vote for Me (epoch=3, end=6)
Voter-1
Yes!
KRaft: Kafka’s Raft Implementation
36
• Leader election to allow only one leader at each epoch
Logs
Candidate-1
Logs
Logs
Voter-2 Voter-3
Leader-1
Begin Epoch (3)
KRaft: Kafka’s Raft Implementation
37
• Leader election to allow only one leader at each epoch
Logs
Candidate-1
Logs
Logs
Voter-2 Voter-3
Vote for Me (epoch=2, end=3)
No..
No..
KRaft: Kafka’s Raft Implementation
38
• Leader election to allow only one leader at each epoch
• Follower will become a candidate after timeout to ask others for votes
• Voters only give one vote per epoch: the vote decision is persisted locally
• Voters only grants vote if its own log is not “longer” than the candidate
• Simply majority wins; backoff to avoid gridlocked election
• As a result, elected leaders must have all committed records up to their epoch
KRaft: Kafka’s Raft Implementation
39
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
Fetch (epoch=3, end=7)
KRaft: Kafka’s Raft Implementation
40
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
OK (data)
KRaft: Kafka’s Raft Implementation
41
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
Fetch (epoch=2, end=8)
KRaft: Kafka’s Raft Implementation
42
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
No.. (epoch=2, end=6)
KRaft: Kafka’s Raft Implementation
43
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
Fetch (epoch=2, end=6)
KRaft: Kafka’s Raft Implementation
44
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
OK (data)
KRaft: Kafka’s Raft Implementation
45
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
• Pros: fewer round-trips to log reconciliation, no “disruptive servers”
• Cons: replication commit requires follower’s next fetch
KRaft: Kafka’s Raft Implementation
46
• More details
• Controller module isolation: separate ports and queues, separate node ID space
• Snapshots: consistent view of full state: allow fetch snapshots for new brokers
• Delta records: think about “PartitionChangeRecord” v.s. “PartitionRecord”
• State Machine API: trigger upon commit, used for metadata caching
• And more..
[KIP-500, KIP-595, KIP-630, KIP-640]
Quorum Controller on top of KRaft Logs
47
Quorum
Observers
Leader
Voter
Voter
Metadata Log
• Controller run in a broker JVM or standalone
• Single-node Kafka cluster is possible
• Controller operations can be pipelined
• Brokers cache metadata read from the log
• Metadata is naturally “versioned” by log offset
• Divergence / corner cases resolvable
• Brokers liveness check via heartbeat
• Controlled shutdown piggy-backed on hb req/resp
• Simpler implementation and less timeout risks
• Fence brokers during startup until they are ready
Quorum Controller: Broker Shutdown
48
Quorum
Observers
Leader
Voter
Voter
Metadata Log
Batch append partition / broker changes
Quorum Controller: Failover
49
Quorum
Observers
Leader
Voter
Voter
Metadata Log
Newly Elected Leader already
has committed metadata
Leader
Quorum Controller Scalability
50
Recap and Roadmap
51
• KRaft: the source-of-truth metadata log for all Kafka data logs!
• Co-locate metadata storage with processing
• Better replication latency with pull-based leader election and log reconciliation
• Quorum Controller: built on top of KRaft
• Strictly ordered metadata with single writer
• Fast controller failover and broker restarts
• Roadmap
• Early access in the 2.8 release: KRaft replication, state snapshot
• On-going works for missing features: quorum reconfiguration, security, JBOD, EOS, etc..
• ZK mode would first be deprecated in a coming bridge release and then removed in future
52
Thank you!
Guozhang Wang
guozhang@confluent.io in/guozhangwang @guozhangwang
cnfl.io/meetups cnfl.io/slack
cnfl.io/blog

More Related Content

More from Guozhang Wang

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Guozhang Wang
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
Guozhang Wang
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache Kafka
Guozhang Wang
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
Guozhang Wang
 
Behavioral Simulations in MapReduce
Behavioral Simulations in MapReduceBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduce
Guozhang Wang
 
Automatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsAutomatic Scaling Iterative Computations
Automatic Scaling Iterative Computations
Guozhang Wang
 

More from Guozhang Wang (7)

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache Kafka
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Behavioral Simulations in MapReduce
Behavioral Simulations in MapReduceBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduce
 
Automatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsAutomatic Scaling Iterative Computations
Automatic Scaling Iterative Computations
 

Recently uploaded

Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
IJCNCJournal
 
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdfFUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
EMERSON EDUARDO RODRIGUES
 
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdfSELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
Pallavi Sharma
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
aryanpankaj78
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
Kamal Acharya
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
Dwarkadas J Sanghvi College of Engineering
 
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASICINTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
GOKULKANNANMMECLECTC
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
vmspraneeth
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
Indrajeet sahu
 
comptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdfcomptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdf
foxlyon
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
Lubi Valves
 
Ericsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.pptEricsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.ppt
wafawafa52
 
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
sydezfe
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
paraasingh12 #V08
 
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTERUNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
vmspraneeth
 
Properties of Fluids, Fluid Statics, Pressure Measurement
Properties of Fluids, Fluid Statics, Pressure MeasurementProperties of Fluids, Fluid Statics, Pressure Measurement
Properties of Fluids, Fluid Statics, Pressure Measurement
Indrajeet sahu
 
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfSri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Balvir Singh
 
309475979-Creativity-Innovation-notes-IV-Sem-2016-pdf.pdf
309475979-Creativity-Innovation-notes-IV-Sem-2016-pdf.pdf309475979-Creativity-Innovation-notes-IV-Sem-2016-pdf.pdf
309475979-Creativity-Innovation-notes-IV-Sem-2016-pdf.pdf
Sou Tibon
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdfAsymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
felixwold
 

Recently uploaded (20)

Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
 
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdfFUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
 
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdfSELENIUM CONF -PALLAVI SHARMA - 2024.pdf
SELENIUM CONF -PALLAVI SHARMA - 2024.pdf
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
 
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASICINTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
 
comptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdfcomptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdf
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
 
Ericsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.pptEricsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.ppt
 
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
 
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTERUNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
 
Properties of Fluids, Fluid Statics, Pressure Measurement
Properties of Fluids, Fluid Statics, Pressure MeasurementProperties of Fluids, Fluid Statics, Pressure Measurement
Properties of Fluids, Fluid Statics, Pressure Measurement
 
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfSri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
 
309475979-Creativity-Innovation-notes-IV-Sem-2016-pdf.pdf
309475979-Creativity-Innovation-notes-IV-Sem-2016-pdf.pdf309475979-Creativity-Innovation-notes-IV-Sem-2016-pdf.pdf
309475979-Creativity-Innovation-notes-IV-Sem-2016-pdf.pdf
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdfAsymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
Asymmetrical Repulsion Magnet Motor Ratio 6-7.pdf
 

The Log of All Logs: Raft-based Consensus Inside Kafka

  • 1. The Log of All Logs Raft-based Consensus inside Kafka Guozhang Wang, Engineer @ Confluent Kafka Summit APAC
  • 2. 01 Why Replace Zookeeper? 02 Why Choose Raft? 03 KRaft v.s. Raft What’s the Difference? 04 The Quorum Controller (on top of KRaft)
  • 4. The (old) Controller 4 Controller Brokers Zookeeper • Single node, elected via ZK • Maintains cluster metadata, persisted in ZK as source of truth • Broker membership • Partition assignment • Configs, Auth, ACLs, etc
  • 5. The (old) Controller 5 Controller Brokers Zookeeper • Single node, elected via ZK • Maintains cluster metadata, persisted in ZK as source of truth • Broker membership • Partition assignment • Configs, Auth, ACLs, etc • Handles (most) cluster-level admin operations • (Mostly) single-threaded • Events triggered from ZK watches • Push-based metadata propagation • Handling logic often requires more ZK read/write
  • 6. The (old) Controller: Broker Shutdown 6 Controller Brokers Zookeeper ISR {1, 2, 3}
  • 7. The (old) Controller: Broker Shutdown 7 Controller Brokers Zookeeper SIG_TERM ISR {1, 2, 3}
  • 8. The (old) Controller: Broker Shutdown 8 Controller Brokers Zookeeper ISR {1, 2, 3} ControlledShutdown ISR {2, 3}
  • 9. The (old) Controller: Broker Shutdown 9 Controller Brokers Zookeeper ISR {2, 3} Write ZK
  • 10. The (old) Controller: Broker Shutdown 10 Controller Brokers Zookeeper ISR {2, 3} MetadataUpdate LeaderAndISR
  • 11. The (old) Controller: Broker Shutdown 11 Controller Brokers Zookeeper ISR {2, 3} ControlledShutdown
  • 12. The (old) Controller: Broker Shutdown 12 Controller Brokers Zookeeper Writes to ZK are synchronous Impact: longer shutdown time Metadata propagation req/resp is per-partition Impact: client timeout
  • 13. The (old) Controller: Controller Failover 13 Controller Brokers Zookeeper
  • 14. The (old) Controller: Controller Failover 14 Brokers Zookeeper
  • 15. The (old) Controller: Controller Failover 15 Brokers Zookeeper
  • 16. The (old) Controller: Controller Failover 16 Brokers Zookeeper Controller
  • 17. The (old) Controller: Controller Failover 17 Brokers Zookeeper Controller Read ZK ISR {1, 2, 3}
  • 18. The (old) Controller: Controller Failover 18 Brokers Zookeeper Controller ISR {1, 2, 3} ISR {2, 3} Write ZK
  • 19. The (old) Controller: Controller Failover 19 Brokers Zookeeper Controller ISR {2, 3} MetadataUpdate LeaderAndISR
  • 20. The (old) Controller: Controller Failover 20 Brokers Zookeeper Controller Reads from ZK are O(num.partition) Impact: longer unavailability window
  • 21. Controller Scalability Limitations 21 • Controller failover, broker start / shutdown events all require RPCs that are O(num.partitions)
  • 22. Controller Scalability Limitations 22 • Controller failover, broker start / shutdown events all require RPCs that are O(num.partitions) • Metadata persistence in ZK are synchronous and also O(num.partitions)
  • 23. Controller Scalability Limitations 23 • Controller failover, broker start / shutdown events all require RPCs that are O(num.partitions) • Metadata persistence in ZK are synchronous and also O(num.partitions) • ZK as the source-of-truth • Znode Size limit, max num.watchers limit, watcher fire orderings, etc • Controller’s metadata view is often out of date as all brokers (& admin clients) write to ZK • Broker’s metadata view can diverge over time due to push-based propagation How to get to 1K brokers and 1M partitions in a cluster?
  • 24. What’s really behind the Controller / Zookeeper? 24 • A metadata LOG! /brokers/topics/foo/partitions/0/state changed /topics changed /brokers/ids/0 changed /config/topics/bar changed /kafka-acl/group/grp1 changed …
  • 25. Rethinking the Controller based on the LOG 25 • Why not making the metadata log itself as the route of truth
  • 26. Rethinking the Controller based on the LOG 26 • Why not making the metadata log itself as the route of truth • Single writer, a.k.a. the controller • Multi-operations in flight • Async APIs to write to the log • Controller collocates with the metadata log, as internal Kafka topic
  • 27. Rethinking the Controller based on the LOG 27 • Why not making the metadata log itself as the route of truth • Single writer, a.k.a. the controller • Multi-operations in flight • Async APIs to write to the log • Controller collocates with the metadata log, as internal Kafka topic • Metadata propagation not via RPC, but log replication • Brokers locally cache metadata from the log, fix divergence • Isolate control plane from data path: separate ports, queues, metrics, etc
  • 28. Rethinking the Controller based on the LOG 28 • Why not making the metadata log itself as the route of truth • Single writer, a.k.a. the controller • Multi-operations in flight • Async APIs to write to the log • Controller collocates with the metadata log, as internal Kafka topic • Metadata propagation not via RPC, but log replication • Brokers locally cache metadata from the log, fix divergence • Isolate control plane from data path: separate ports, queues, metrics, etc • A quorum of controllers, not a single controller • Controller failover to standby is O(1) • KRaft protocol to favor latency over failure tolerance
  • 31. Primary-backup v.s. Quorum 31 • Kafka replication’s failure model is f+1 Quorum (Paxos, Raft, etc)’s failure model is 2f+1 • Kafka’s replication needs to wait for all followers (in ISR) Quorum (Paxos, Raft, etc)’s replication waits for majority and has better latency
  • 32. KRaft: Kafka’s Raft Implementation 32 • Piggy-back on Kafka’s log replication utilities • Record schema versioning • Batching and compression • Log recovery, segmentation, indexing, checksumming • NIO network transfer, leader epoch caching, tooling
  • 33. KRaft: Kafka’s Raft Implementation 33 • Piggy-back on Kafka’s log replication utilities • Record schema versioning • Batching and compression • Log recovery, segmentation, indexing, checksumming • NIO network transfer, leader epoch caching, tooling • Non-ZK based leader election protocol • No “split-brain” with multiple leaders • No “grid-locking” with no leaders being elected
  • 34. KRaft: Kafka’s Raft Implementation 34 • Leader election to allow only one leader at each epoch Logs Voter-1 Logs Logs Voter-2 Voter-3
  • 35. KRaft: Kafka’s Raft Implementation 35 • Leader election to allow only one leader at each epoch Logs Candidate-1 Logs Logs Voter-2 Voter-3 Vote for Me (epoch=3, end=6) Voter-1 Yes!
  • 36. KRaft: Kafka’s Raft Implementation 36 • Leader election to allow only one leader at each epoch Logs Candidate-1 Logs Logs Voter-2 Voter-3 Leader-1 Begin Epoch (3)
  • 37. KRaft: Kafka’s Raft Implementation 37 • Leader election to allow only one leader at each epoch Logs Candidate-1 Logs Logs Voter-2 Voter-3 Vote for Me (epoch=2, end=3) No.. No..
  • 38. KRaft: Kafka’s Raft Implementation 38 • Leader election to allow only one leader at each epoch • Follower will become a candidate after timeout to ask others for votes • Voters only give one vote per epoch: the vote decision is persisted locally • Voters only grants vote if its own log is not “longer” than the candidate • Simply majority wins; backoff to avoid gridlocked election • As a result, elected leaders must have all committed records up to their epoch
  • 39. KRaft: Kafka’s Raft Implementation 39 • Pull-based replication (like Kafka) instead of push-based in literature • Need specific API for begin a new epoch (in literature it is via PushRecords) • Log reconciliation happened at follower fetch (similar to Raft literature) Logs Leader-1 Logs Logs Voter-2 Voter-3 Fetch (epoch=3, end=7)
  • 40. KRaft: Kafka’s Raft Implementation 40 • Pull-based replication (like Kafka) instead of push-based in literature • Need specific API for begin a new epoch (in literature it is via PushRecords) • Log reconciliation happened at follower fetch (similar to Raft literature) Logs Leader-1 Logs Logs Voter-2 Voter-3 OK (data)
  • 41. KRaft: Kafka’s Raft Implementation 41 • Pull-based replication (like Kafka) instead of push-based in literature • Need specific API for begin a new epoch (in literature it is via PushRecords) • Log reconciliation happened at follower fetch (similar to Raft literature) Logs Leader-1 Logs Logs Voter-2 Voter-3 Fetch (epoch=2, end=8)
  • 42. KRaft: Kafka’s Raft Implementation 42 • Pull-based replication (like Kafka) instead of push-based in literature • Need specific API for begin a new epoch (in literature it is via PushRecords) • Log reconciliation happened at follower fetch (similar to Raft literature) Logs Leader-1 Logs Logs Voter-2 Voter-3 No.. (epoch=2, end=6)
  • 43. KRaft: Kafka’s Raft Implementation 43 • Pull-based replication (like Kafka) instead of push-based in literature • Need specific API for begin a new epoch (in literature it is via PushRecords) • Log reconciliation happened at follower fetch (similar to Raft literature) Logs Leader-1 Logs Logs Voter-2 Voter-3 Fetch (epoch=2, end=6)
  • 44. KRaft: Kafka’s Raft Implementation 44 • Pull-based replication (like Kafka) instead of push-based in literature • Need specific API for begin a new epoch (in literature it is via PushRecords) • Log reconciliation happened at follower fetch (similar to Raft literature) Logs Leader-1 Logs Logs Voter-2 Voter-3 OK (data)
  • 45. KRaft: Kafka’s Raft Implementation 45 • Pull-based replication (like Kafka) instead of push-based in literature • Need specific API for begin a new epoch (in literature it is via PushRecords) • Log reconciliation happened at follower fetch (similar to Raft literature) • Pros: fewer round-trips to log reconciliation, no “disruptive servers” • Cons: replication commit requires follower’s next fetch
  • 46. KRaft: Kafka’s Raft Implementation 46 • More details • Controller module isolation: separate ports and queues, separate node ID space • Snapshots: consistent view of full state: allow fetch snapshots for new brokers • Delta records: think about “PartitionChangeRecord” v.s. “PartitionRecord” • State Machine API: trigger upon commit, used for metadata caching • And more.. [KIP-500, KIP-595, KIP-630, KIP-640]
  • 47. Quorum Controller on top of KRaft Logs 47 Quorum Observers Leader Voter Voter Metadata Log • Controller run in a broker JVM or standalone • Single-node Kafka cluster is possible • Controller operations can be pipelined • Brokers cache metadata read from the log • Metadata is naturally “versioned” by log offset • Divergence / corner cases resolvable • Brokers liveness check via heartbeat • Controlled shutdown piggy-backed on hb req/resp • Simpler implementation and less timeout risks • Fence brokers during startup until they are ready
  • 48. Quorum Controller: Broker Shutdown 48 Quorum Observers Leader Voter Voter Metadata Log Batch append partition / broker changes
  • 49. Quorum Controller: Failover 49 Quorum Observers Leader Voter Voter Metadata Log Newly Elected Leader already has committed metadata Leader
  • 51. Recap and Roadmap 51 • KRaft: the source-of-truth metadata log for all Kafka data logs! • Co-locate metadata storage with processing • Better replication latency with pull-based leader election and log reconciliation • Quorum Controller: built on top of KRaft • Strictly ordered metadata with single writer • Fast controller failover and broker restarts • Roadmap • Early access in the 2.8 release: KRaft replication, state snapshot • On-going works for missing features: quorum reconfiguration, security, JBOD, EOS, etc.. • ZK mode would first be deprecated in a coming bridge release and then removed in future
  • 52. 52 Thank you! Guozhang Wang guozhang@confluent.io in/guozhangwang @guozhangwang cnfl.io/meetups cnfl.io/slack cnfl.io/blog