Kafka organizes data as immutable append-only logs at its core, and relied on external consensus services (a.k.a. Zookeeper) to manage the metadata --- such as topic-level configs, leader replicas and ISR information, received admin requests --- of these logs. In this talk, I will discuss a recent core initiative, that migrates the management of such metadata from external services into Kafka as its own special logs. More specifically, I will cover the following:
1. Why we believe an internal consensus protocol provides Kafka more benefit than an external consensus service.
2. Why we choose to build this internal "metadata log" based on the Raft protocol, instead of Kafka's current leader-follower replication mechanism.
3. What are the key design decisions we made in its implementation, and how it is different from the standard Raft algorithm (KIP-595).
4. How this Raft-based metadata log is leveraged by the new Quorum Controller (KIP-500).
Consensus in Apache Kafka: From Theory to Production.pdfGuozhang Wang
In this talk I'd like to cover an everlasting story in distributed systems: consensus. More specifically, the consensus challenges in Apache Kafka, and how we addressed it starting from theory in papers to production in the cloud.
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Guozhang Wang
We present Apache Kafka’s core design for stream processing, which relies on its persistent log architecture as the storage and inter-processor communication layers to achieve correctness guarantees. Kafka Streams, a scalable stream processing client library in Apache Kafka, defines the processing logic as read process-write cycles in which all processing state updates and result outputs are captured as log appends. Idempotent and transactional write protocols are utilized to guarantee exactly once semantics. Furthermore, revision-based speculative processing is employed to emit results as soon as possible while handling out-of-order data. We also demonstrate how Kafka Streams behaves in practice with large-scale deployments and performance insights exhibiting its flexible and low-overhead trade-offs.
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Guozhang Wang
1) The document discusses Kafka transactions and exactly-once processing in Kafka.
2) It describes the current approach Kafka uses to achieve exactly-once semantics, including idempotent writes within a partition and transactional writes across partitions.
3) It also discusses challenges with the current approach, such as lack of scalability due to the need to create a producer for each input partition, and proposes solutions in KIP-447 to address these challenges.
Introduction to the Incremental Cooperative Protocol of KafkaGuozhang Wang
Anyone who has used Kafka consumer groups or operated a Kafka Streams application is likely familiar with the rebalancing protocol, which is used to (re)distribute partitions among the consumers of a group whenever there is a change in membership or in the topics subscribed to. The current protocol takes the safest possible approach of pausing all work and revoking ownership of all partitions so that a new assignment can be made. This “stop-the-world” approach can be frustrating especially when the mapping of partitions to the consumer that owns them barely changes. In KIP-429 we introduce incremental cooperative rebalancing for the consumer client, a new rebalancing protocol that allows consumers to retain ownership and continue fetching for their owned partitions while a rebalance is in progress. This proposal trades extra rebalances for the ability to revoke only those partitions which are to be migrated to another consumer for overall workload balance.
Performance Analysis and Optimizations for Kafka Streams ApplicationsGuozhang Wang
High-speed and low footprint data stream processing is high in demand for Kafka Streams applications. However, how to write an efficient streaming application using the Streams DSL has been asked by many users in the past since it requires some deep knowledge about Kafka Streams internals. In this talk, I will talk about how to analyze your Kafka Streams applications, target performance bottlenecks and unnecessary storage costs, and optimize your application code accordingly using the Streams DSL.
In addition, I will talk about the new optimization framework that we have been developed inside Kafka Streams since the 2.1 release which replaced the in-place translation of the Streams DSL into a comprehensive process composed of streams topology compilation and rewriting phases, with a focus on reducing various storage footprints of Streams applications, such as state stores, internal topics etc.
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
I will present the recent additions to Kafka to achieve exactly-once semantics (0.11.0) within its Streams API for stream processing use cases. This is achieved by leveraging the underlying idempotent and transactional client features. The main focus will be the specific semantics that Kafka distributed transactions enable in Streams and the underlying mechanics to let Streams scale efficiently.
Apache Kafka, and the Rise of Stream ProcessingGuozhang Wang
For a long time, a substantial portion of data processing that companies did ran as big batch jobs. But businesses operate in real-time and the software they run is catching up. Today, processing data in a streaming fashion becomes more and more popular in many companies over the more "traditional" way of batch-processing big data sets available as a whole.
Consensus in Apache Kafka: From Theory to Production.pdfGuozhang Wang
In this talk I'd like to cover an everlasting story in distributed systems: consensus. More specifically, the consensus challenges in Apache Kafka, and how we addressed it starting from theory in papers to production in the cloud.
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Guozhang Wang
We present Apache Kafka’s core design for stream processing, which relies on its persistent log architecture as the storage and inter-processor communication layers to achieve correctness guarantees. Kafka Streams, a scalable stream processing client library in Apache Kafka, defines the processing logic as read process-write cycles in which all processing state updates and result outputs are captured as log appends. Idempotent and transactional write protocols are utilized to guarantee exactly once semantics. Furthermore, revision-based speculative processing is employed to emit results as soon as possible while handling out-of-order data. We also demonstrate how Kafka Streams behaves in practice with large-scale deployments and performance insights exhibiting its flexible and low-overhead trade-offs.
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Guozhang Wang
1) The document discusses Kafka transactions and exactly-once processing in Kafka.
2) It describes the current approach Kafka uses to achieve exactly-once semantics, including idempotent writes within a partition and transactional writes across partitions.
3) It also discusses challenges with the current approach, such as lack of scalability due to the need to create a producer for each input partition, and proposes solutions in KIP-447 to address these challenges.
Introduction to the Incremental Cooperative Protocol of KafkaGuozhang Wang
Anyone who has used Kafka consumer groups or operated a Kafka Streams application is likely familiar with the rebalancing protocol, which is used to (re)distribute partitions among the consumers of a group whenever there is a change in membership or in the topics subscribed to. The current protocol takes the safest possible approach of pausing all work and revoking ownership of all partitions so that a new assignment can be made. This “stop-the-world” approach can be frustrating especially when the mapping of partitions to the consumer that owns them barely changes. In KIP-429 we introduce incremental cooperative rebalancing for the consumer client, a new rebalancing protocol that allows consumers to retain ownership and continue fetching for their owned partitions while a rebalance is in progress. This proposal trades extra rebalances for the ability to revoke only those partitions which are to be migrated to another consumer for overall workload balance.
Performance Analysis and Optimizations for Kafka Streams ApplicationsGuozhang Wang
High-speed and low footprint data stream processing is high in demand for Kafka Streams applications. However, how to write an efficient streaming application using the Streams DSL has been asked by many users in the past since it requires some deep knowledge about Kafka Streams internals. In this talk, I will talk about how to analyze your Kafka Streams applications, target performance bottlenecks and unnecessary storage costs, and optimize your application code accordingly using the Streams DSL.
In addition, I will talk about the new optimization framework that we have been developed inside Kafka Streams since the 2.1 release which replaced the in-place translation of the Streams DSL into a comprehensive process composed of streams topology compilation and rewriting phases, with a focus on reducing various storage footprints of Streams applications, such as state stores, internal topics etc.
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
I will present the recent additions to Kafka to achieve exactly-once semantics (0.11.0) within its Streams API for stream processing use cases. This is achieved by leveraging the underlying idempotent and transactional client features. The main focus will be the specific semantics that Kafka distributed transactions enable in Streams and the underlying mechanics to let Streams scale efficiently.
Apache Kafka, and the Rise of Stream ProcessingGuozhang Wang
For a long time, a substantial portion of data processing that companies did ran as big batch jobs. But businesses operate in real-time and the software they run is catching up. Today, processing data in a streaming fashion becomes more and more popular in many companies over the more "traditional" way of batch-processing big data sets available as a whole.
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier.
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
To manage the ever-increasing volume and velocity of data within your company, you have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center, powered by Apache Kafka. But what if one data center is not enough? I will describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence, and provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication, and mirroring as well as disaster scenarios and failure handling.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Building a Replicated Logging System with Apache KafkaGuozhang Wang
Apache Kafka is a scalable publish-subscribe messaging system
with its core architecture as a distributed commit log.
It was originally built as its centralized event
pipelining platform for online data integration tasks. Over
the past years developing and operating Kafka, we extend
its log-structured architecture as a replicated logging backbone
for much wider application scopes in the distributed
environment. I am going to talk about our design
and engineering experience to replicate Kafka logs for various
distributed data-driven systems, including
source-of-truth data storage and stream processing.
The document discusses Apache Kafka, a distributed publish-subscribe messaging system developed at LinkedIn. It describes how LinkedIn uses Kafka to integrate large amounts of user activity and other data across its products. Key aspects of Kafka's design allow it to scale to LinkedIn's high throughput requirements, including using a log structure and data partitioning for parallelism. LinkedIn relies on Kafka to transport over 500 billion messages per day between systems and for real-time analytics.
This document discusses behavioral simulations in MapReduce. It introduces behavioral simulations as simulations of individuals that interact to create emerging behavior in complex systems, such as traffic, ecology, and sociology systems. It then discusses the challenges of scaling behavioral simulations to large data sizes. The document proposes a new simulation platform that combines ease of programming through a state-effect programming pattern and scripting language called BRASIL with scalability through executing simulations in the MapReduce model using a special-purpose MapReduce engine called BRACE. Key aspects of BRACE include spatial partitioning of the simulation space and optimizations to minimize communication between partitions.
Automatic Scaling Iterative ComputationsGuozhang Wang
This document discusses iterative graph computations and limitations of MapReduce for such computations. It proposes GRACE, a graph processing framework that separates the vertex-centric computation logic from execution policies to allow both synchronous and asynchronous execution. As an example, it shows how belief propagation can be implemented in a vertex-centric manner and executed asynchronously using GRACE. This provides easier programming while enabling performance benefits of asynchronous execution.
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...IJCNCJournal
Paper Title
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation with Hybrid Beam Forming Power Transfer in WSN-IoT Applications
Authors
Reginald Jude Sixtus J and Tamilarasi Muthu, Puducherry Technological University, India
Abstract
Non-Orthogonal Multiple Access (NOMA) helps to overcome various difficulties in future technology wireless communications. NOMA, when utilized with millimeter wave multiple-input multiple-output (MIMO) systems, channel estimation becomes extremely difficult. For reaping the benefits of the NOMA and mm-Wave combination, effective channel estimation is required. In this paper, we propose an enhanced particle swarm optimization based long short-term memory estimator network (PSOLSTMEstNet), which is a neural network model that can be employed to forecast the bandwidth required in the mm-Wave MIMO network. The prime advantage of the LSTM is that it has the capability of dynamically adapting to the functioning pattern of fluctuating channel state. The LSTM stage with adaptive coding and modulation enhances the BER.PSO algorithm is employed to optimize input weights of LSTM network. The modified algorithm splits the power by channel condition of every single user. Participants will be first sorted into distinct groups depending upon respective channel conditions, using a hybrid beamforming approach. The network characteristics are fine-estimated using PSO-LSTMEstNet after a rough approximation of channels parameters derived from the received data.
Keywords
Signal to Noise Ratio (SNR), Bit Error Rate (BER), mm-Wave, MIMO, NOMA, deep learning, optimization.
Volume URL: https://airccse.org/journal/ijc2022.html
Abstract URL:https://aircconline.com/abstract/ijcnc/v14n5/14522cnc05.html
Pdf URL: https://aircconline.com/ijcnc/V14N5/14522cnc05.pdf
#scopuspublication #scopusindexed #callforpapers #researchpapers #cfp #researchers #phdstudent #researchScholar #journalpaper #submission #journalsubmission #WBAN #requirements #tailoredtreatment #MACstrategy #enhancedefficiency #protrcal #computing #analysis #wirelessbodyareanetworks #wirelessnetworks
#adhocnetwork #VANETs #OLSRrouting #routing #MPR #nderesidualenergy #korea #cognitiveradionetworks #radionetworks #rendezvoussequence
Here's where you can reach us : ijcnc@airccse.org or ijcnc@aircconline.com
Digital Twins Computer Networking Paper Presentation.pptxaryanpankaj78
A Digital Twin in computer networking is a virtual representation of a physical network, used to simulate, analyze, and optimize network performance and reliability. It leverages real-time data to enhance network management, predict issues, and improve decision-making processes.
Accident detection system project report.pdfKamal Acharya
The Rapid growth of technology and infrastructure has made our lives easier. The
advent of technology has also increased the traffic hazards and the road accidents take place
frequently which causes huge loss of life and property because of the poor emergency facilities.
Many lives could have been saved if emergency service could get accident information and
reach in time. Our project will provide an optimum solution to this draw back. A piezo electric
sensor can be used as a crash or rollover detector of the vehicle during and after a crash. With
signals from a piezo electric sensor, a severe accident can be recognized. According to this
project when a vehicle meets with an accident immediately piezo electric sensor will detect the
signal or if a car rolls over. Then with the help of GSM module and GPS module, the location
will be sent to the emergency contact. Then after conforming the location necessary action will
be taken. If the person meets with a small accident or if there is no serious threat to anyone’s
life, then the alert message can be terminated by the driver by a switch provided in order to
avoid wasting the valuable time of the medical rescue team.
Open Channel Flow: fluid flow with a free surfaceIndrajeet sahu
Open Channel Flow: This topic focuses on fluid flow with a free surface, such as in rivers, canals, and drainage ditches. Key concepts include the classification of flow types (steady vs. unsteady, uniform vs. non-uniform), hydraulic radius, flow resistance, Manning's equation, critical flow conditions, and energy and momentum principles. It also covers flow measurement techniques, gradually varied flow analysis, and the design of open channels. Understanding these principles is vital for effective water resource management and engineering applications.
We have designed & manufacture the Lubi Valves LBF series type of Butterfly Valves for General Utility Water applications as well as for HVAC applications.
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier.
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
To manage the ever-increasing volume and velocity of data within your company, you have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center, powered by Apache Kafka. But what if one data center is not enough? I will describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence, and provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication, and mirroring as well as disaster scenarios and failure handling.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Building a Replicated Logging System with Apache KafkaGuozhang Wang
Apache Kafka is a scalable publish-subscribe messaging system
with its core architecture as a distributed commit log.
It was originally built as its centralized event
pipelining platform for online data integration tasks. Over
the past years developing and operating Kafka, we extend
its log-structured architecture as a replicated logging backbone
for much wider application scopes in the distributed
environment. I am going to talk about our design
and engineering experience to replicate Kafka logs for various
distributed data-driven systems, including
source-of-truth data storage and stream processing.
The document discusses Apache Kafka, a distributed publish-subscribe messaging system developed at LinkedIn. It describes how LinkedIn uses Kafka to integrate large amounts of user activity and other data across its products. Key aspects of Kafka's design allow it to scale to LinkedIn's high throughput requirements, including using a log structure and data partitioning for parallelism. LinkedIn relies on Kafka to transport over 500 billion messages per day between systems and for real-time analytics.
This document discusses behavioral simulations in MapReduce. It introduces behavioral simulations as simulations of individuals that interact to create emerging behavior in complex systems, such as traffic, ecology, and sociology systems. It then discusses the challenges of scaling behavioral simulations to large data sizes. The document proposes a new simulation platform that combines ease of programming through a state-effect programming pattern and scripting language called BRASIL with scalability through executing simulations in the MapReduce model using a special-purpose MapReduce engine called BRACE. Key aspects of BRACE include spatial partitioning of the simulation space and optimizations to minimize communication between partitions.
Automatic Scaling Iterative ComputationsGuozhang Wang
This document discusses iterative graph computations and limitations of MapReduce for such computations. It proposes GRACE, a graph processing framework that separates the vertex-centric computation logic from execution policies to allow both synchronous and asynchronous execution. As an example, it shows how belief propagation can be implemented in a vertex-centric manner and executed asynchronously using GRACE. This provides easier programming while enabling performance benefits of asynchronous execution.
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...IJCNCJournal
Paper Title
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation with Hybrid Beam Forming Power Transfer in WSN-IoT Applications
Authors
Reginald Jude Sixtus J and Tamilarasi Muthu, Puducherry Technological University, India
Abstract
Non-Orthogonal Multiple Access (NOMA) helps to overcome various difficulties in future technology wireless communications. NOMA, when utilized with millimeter wave multiple-input multiple-output (MIMO) systems, channel estimation becomes extremely difficult. For reaping the benefits of the NOMA and mm-Wave combination, effective channel estimation is required. In this paper, we propose an enhanced particle swarm optimization based long short-term memory estimator network (PSOLSTMEstNet), which is a neural network model that can be employed to forecast the bandwidth required in the mm-Wave MIMO network. The prime advantage of the LSTM is that it has the capability of dynamically adapting to the functioning pattern of fluctuating channel state. The LSTM stage with adaptive coding and modulation enhances the BER.PSO algorithm is employed to optimize input weights of LSTM network. The modified algorithm splits the power by channel condition of every single user. Participants will be first sorted into distinct groups depending upon respective channel conditions, using a hybrid beamforming approach. The network characteristics are fine-estimated using PSO-LSTMEstNet after a rough approximation of channels parameters derived from the received data.
Keywords
Signal to Noise Ratio (SNR), Bit Error Rate (BER), mm-Wave, MIMO, NOMA, deep learning, optimization.
Volume URL: https://airccse.org/journal/ijc2022.html
Abstract URL:https://aircconline.com/abstract/ijcnc/v14n5/14522cnc05.html
Pdf URL: https://aircconline.com/ijcnc/V14N5/14522cnc05.pdf
#scopuspublication #scopusindexed #callforpapers #researchpapers #cfp #researchers #phdstudent #researchScholar #journalpaper #submission #journalsubmission #WBAN #requirements #tailoredtreatment #MACstrategy #enhancedefficiency #protrcal #computing #analysis #wirelessbodyareanetworks #wirelessnetworks
#adhocnetwork #VANETs #OLSRrouting #routing #MPR #nderesidualenergy #korea #cognitiveradionetworks #radionetworks #rendezvoussequence
Here's where you can reach us : ijcnc@airccse.org or ijcnc@aircconline.com
Digital Twins Computer Networking Paper Presentation.pptxaryanpankaj78
A Digital Twin in computer networking is a virtual representation of a physical network, used to simulate, analyze, and optimize network performance and reliability. It leverages real-time data to enhance network management, predict issues, and improve decision-making processes.
Accident detection system project report.pdfKamal Acharya
The Rapid growth of technology and infrastructure has made our lives easier. The
advent of technology has also increased the traffic hazards and the road accidents take place
frequently which causes huge loss of life and property because of the poor emergency facilities.
Many lives could have been saved if emergency service could get accident information and
reach in time. Our project will provide an optimum solution to this draw back. A piezo electric
sensor can be used as a crash or rollover detector of the vehicle during and after a crash. With
signals from a piezo electric sensor, a severe accident can be recognized. According to this
project when a vehicle meets with an accident immediately piezo electric sensor will detect the
signal or if a car rolls over. Then with the help of GSM module and GPS module, the location
will be sent to the emergency contact. Then after conforming the location necessary action will
be taken. If the person meets with a small accident or if there is no serious threat to anyone’s
life, then the alert message can be terminated by the driver by a switch provided in order to
avoid wasting the valuable time of the medical rescue team.
Open Channel Flow: fluid flow with a free surfaceIndrajeet sahu
Open Channel Flow: This topic focuses on fluid flow with a free surface, such as in rivers, canals, and drainage ditches. Key concepts include the classification of flow types (steady vs. unsteady, uniform vs. non-uniform), hydraulic radius, flow resistance, Manning's equation, critical flow conditions, and energy and momentum principles. It also covers flow measurement techniques, gradually varied flow analysis, and the design of open channels. Understanding these principles is vital for effective water resource management and engineering applications.
We have designed & manufacture the Lubi Valves LBF series type of Butterfly Valves for General Utility Water applications as well as for HVAC applications.
Properties of Fluids, Fluid Statics, Pressure MeasurementIndrajeet sahu
Properties of Fluids: Density, viscosity, surface tension, compressibility, and specific gravity define fluid behavior.
Fluid Statics: Studies pressure, hydrostatic pressure, buoyancy, and fluid forces on surfaces.
Pressure at a Point: In a static fluid, the pressure at any point is the same in all directions. This is known as Pascal's principle. The pressure increases with depth due to the weight of the fluid above.
Hydrostatic Pressure: The pressure exerted by a fluid at rest due to the force of gravity. It can be calculated using the formula P=ρghP=ρgh, where PP is the pressure, ρρ is the fluid density, gg is the acceleration due to gravity, and hh is the height of the fluid column above the point in question.
Buoyancy: The upward force exerted by a fluid on a submerged or partially submerged object. This force is equal to the weight of the fluid displaced by the object, as described by Archimedes' principle. Buoyancy explains why objects float or sink in fluids.
Fluid Pressure on Surfaces: The analysis of pressure forces on surfaces submerged in fluids. This includes calculating the total force and the center of pressure, which is the point where the resultant pressure force acts.
Pressure Measurement: Manometers, barometers, pressure gauges, and differential pressure transducers measure fluid pressure.
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfBalvir Singh
Sri Guru Hargobind Ji (19 June 1595 - 3 March 1644) is revered as the Sixth Nanak.
• On 25 May 1606 Guru Arjan nominated his son Sri Hargobind Ji as his successor. Shortly
afterwards, Guru Arjan was arrested, tortured and killed by order of the Mogul Emperor
Jahangir.
• Guru Hargobind's succession ceremony took place on 24 June 1606. He was barely
eleven years old when he became 6th Guru.
• As ordered by Guru Arjan Dev Ji, he put on two swords, one indicated his spiritual
authority (PIRI) and the other, his temporal authority (MIRI). He thus for the first time
initiated military tradition in the Sikh faith to resist religious persecution, protect
people’s freedom and independence to practice religion by choice. He transformed
Sikhs to be Saints and Soldier.
• He had a long tenure as Guru, lasting 37 years, 9 months and 3 days
4. The (old) Controller
4
Controller
Brokers
Zookeeper
• Single node, elected via ZK
• Maintains cluster metadata, persisted
in ZK as source of truth
• Broker membership
• Partition assignment
• Configs, Auth, ACLs, etc
5. The (old) Controller
5
Controller
Brokers
Zookeeper
• Single node, elected via ZK
• Maintains cluster metadata, persisted
in ZK as source of truth
• Broker membership
• Partition assignment
• Configs, Auth, ACLs, etc
• Handles (most) cluster-level admin
operations
• (Mostly) single-threaded
• Events triggered from ZK watches
• Push-based metadata propagation
• Handling logic often requires more ZK read/write
22. Controller Scalability Limitations
22
• Controller failover, broker start / shutdown events all require RPCs that
are O(num.partitions)
• Metadata persistence in ZK are synchronous and also O(num.partitions)
23. Controller Scalability Limitations
23
• Controller failover, broker start / shutdown events all require RPCs that
are O(num.partitions)
• Metadata persistence in ZK are synchronous and also O(num.partitions)
• ZK as the source-of-truth
• Znode Size limit, max num.watchers limit, watcher fire orderings, etc
• Controller’s metadata view is often out of date as all brokers (& admin clients) write to ZK
• Broker’s metadata view can diverge over time due to push-based propagation
How to get to 1K brokers and 1M partitions in a cluster?
25. Rethinking the Controller based on the LOG
25
• Why not making the metadata log itself as the route of truth
26. Rethinking the Controller based on the LOG
26
• Why not making the metadata log itself as the route of truth
• Single writer, a.k.a. the controller
• Multi-operations in flight
• Async APIs to write to the log
• Controller collocates with the metadata log, as internal Kafka topic
27. Rethinking the Controller based on the LOG
27
• Why not making the metadata log itself as the route of truth
• Single writer, a.k.a. the controller
• Multi-operations in flight
• Async APIs to write to the log
• Controller collocates with the metadata log, as internal Kafka topic
• Metadata propagation not via RPC, but log replication
• Brokers locally cache metadata from the log, fix divergence
• Isolate control plane from data path: separate ports, queues, metrics, etc
28. Rethinking the Controller based on the LOG
28
• Why not making the metadata log itself as the route of truth
• Single writer, a.k.a. the controller
• Multi-operations in flight
• Async APIs to write to the log
• Controller collocates with the metadata log, as internal Kafka topic
• Metadata propagation not via RPC, but log replication
• Brokers locally cache metadata from the log, fix divergence
• Isolate control plane from data path: separate ports, queues, metrics, etc
• A quorum of controllers, not a single controller
• Controller failover to standby is O(1)
• KRaft protocol to favor latency over failure tolerance
31. Primary-backup v.s. Quorum
31
• Kafka replication’s failure model is f+1
Quorum (Paxos, Raft, etc)’s failure model is 2f+1
• Kafka’s replication needs to wait for all followers (in ISR)
Quorum (Paxos, Raft, etc)’s replication waits for majority and has better latency
33. KRaft: Kafka’s Raft Implementation
33
• Piggy-back on Kafka’s log replication utilities
• Record schema versioning
• Batching and compression
• Log recovery, segmentation, indexing, checksumming
• NIO network transfer, leader epoch caching, tooling
• Non-ZK based leader election protocol
• No “split-brain” with multiple leaders
• No “grid-locking” with no leaders being elected
34. KRaft: Kafka’s Raft Implementation
34
• Leader election to allow only one leader at each epoch
Logs
Voter-1
Logs
Logs
Voter-2 Voter-3
35. KRaft: Kafka’s Raft Implementation
35
• Leader election to allow only one leader at each epoch
Logs
Candidate-1
Logs
Logs
Voter-2 Voter-3
Vote for Me (epoch=3, end=6)
Voter-1
Yes!
36. KRaft: Kafka’s Raft Implementation
36
• Leader election to allow only one leader at each epoch
Logs
Candidate-1
Logs
Logs
Voter-2 Voter-3
Leader-1
Begin Epoch (3)
37. KRaft: Kafka’s Raft Implementation
37
• Leader election to allow only one leader at each epoch
Logs
Candidate-1
Logs
Logs
Voter-2 Voter-3
Vote for Me (epoch=2, end=3)
No..
No..
38. KRaft: Kafka’s Raft Implementation
38
• Leader election to allow only one leader at each epoch
• Follower will become a candidate after timeout to ask others for votes
• Voters only give one vote per epoch: the vote decision is persisted locally
• Voters only grants vote if its own log is not “longer” than the candidate
• Simply majority wins; backoff to avoid gridlocked election
• As a result, elected leaders must have all committed records up to their epoch
39. KRaft: Kafka’s Raft Implementation
39
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
Fetch (epoch=3, end=7)
40. KRaft: Kafka’s Raft Implementation
40
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
OK (data)
41. KRaft: Kafka’s Raft Implementation
41
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
Fetch (epoch=2, end=8)
42. KRaft: Kafka’s Raft Implementation
42
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
No.. (epoch=2, end=6)
43. KRaft: Kafka’s Raft Implementation
43
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
Fetch (epoch=2, end=6)
44. KRaft: Kafka’s Raft Implementation
44
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
Logs
Leader-1
Logs
Logs
Voter-2 Voter-3
OK (data)
45. KRaft: Kafka’s Raft Implementation
45
• Pull-based replication (like Kafka) instead of push-based in literature
• Need specific API for begin a new epoch (in literature it is via PushRecords)
• Log reconciliation happened at follower fetch (similar to Raft literature)
• Pros: fewer round-trips to log reconciliation, no “disruptive servers”
• Cons: replication commit requires follower’s next fetch
46. KRaft: Kafka’s Raft Implementation
46
• More details
• Controller module isolation: separate ports and queues, separate node ID space
• Snapshots: consistent view of full state: allow fetch snapshots for new brokers
• Delta records: think about “PartitionChangeRecord” v.s. “PartitionRecord”
• State Machine API: trigger upon commit, used for metadata caching
• And more..
[KIP-500, KIP-595, KIP-630, KIP-640]
47. Quorum Controller on top of KRaft Logs
47
Quorum
Observers
Leader
Voter
Voter
Metadata Log
• Controller run in a broker JVM or standalone
• Single-node Kafka cluster is possible
• Controller operations can be pipelined
• Brokers cache metadata read from the log
• Metadata is naturally “versioned” by log offset
• Divergence / corner cases resolvable
• Brokers liveness check via heartbeat
• Controlled shutdown piggy-backed on hb req/resp
• Simpler implementation and less timeout risks
• Fence brokers during startup until they are ready
48. Quorum Controller: Broker Shutdown
48
Quorum
Observers
Leader
Voter
Voter
Metadata Log
Batch append partition / broker changes
51. Recap and Roadmap
51
• KRaft: the source-of-truth metadata log for all Kafka data logs!
• Co-locate metadata storage with processing
• Better replication latency with pull-based leader election and log reconciliation
• Quorum Controller: built on top of KRaft
• Strictly ordered metadata with single writer
• Fast controller failover and broker restarts
• Roadmap
• Early access in the 2.8 release: KRaft replication, state snapshot
• On-going works for missing features: quorum reconfiguration, security, JBOD, EOS, etc..
• ZK mode would first be deprecated in a coming bridge release and then removed in future