The story of Kafka began at LinkedIn, where their engineering team was challenged with the task of redefining their infrastructure. Breaking down monolithic applications into microservices allowed LinkedIn to scale their search, profile, communications, etc efficiently. However there was a need to share to data among these different services. Data sources: 1. User Activity - Page views, Ad Impressions, etc 2. Server Log Metrics & Monitoring Data 3. Computationally derived data from downstream systems. Data driven products: 1. Recommendation Engine - Connections, Endorsements 2. Profile Stats - How many searches did you appear in this week?, Who viewed your profile? 3. Visualizations - Graphs showing increase or dip in profile views 4. Infrastructure Monitoring - Restarts, Upgrades, Utilizations etc. Challenges: Varied Systems/Applications, Volume & Durability
What did this unified log/data pipeline model bring? Decouple producers & consumers Simplified addition of new producers or consumers. Single source of truth for real time and batch processing applications. Distributed scaling for varying volume demands
Traditional messaging systems follows a push based mechanism, pushing messages to consumers. If the speed of the push is faster than the consumer processing speed, then consumers may become overwhelmed (back pressure protocol). It does not offer a centralized data pipeline for addressing real time and batch processing consumers unanimously. RabbitMQ uses a push model and prevents overwhelming consumers via the consumer configured prefetch limit. This is great for low latency messaging and works well for RabbitMQ's queue based architecture. Kafka on the other hand uses a pull model where consumers request batches of messages from a given offset. To avoid tight loops when no messages exist beyond the current offset Kafka allows for long-polling. A pull model makes sense for Kafka due to its partitions. As Kafka guarantees message order in a partition with no competing consumers, we can leverage the batching of messages for a more efficient message delivery that gives us higher throughput.
Messages are transient at an exchange/routing which assumes availability of consumers as well as removed post consumption by the consumer.
Distributed: A distributed system in its most simplest definition is a group of computers working together as to appear as a single computer to the end-user.These machines have a shared state, operate concurrently and can fail independently without affecting the whole system’s uptime.
To achieve high throughput, Kafka implements a three tiered architecture, and can scale out any of the tiers. In the middle, we have the Kafka cluster which runs one or more brokers, the producer tier that publishes data into the cluster and the consumer tier consumes data from the cluster.
Topics in Kafka are logical category to which messages/records are published and consumed from. The little blue boxes in the Kafka brokers represent partitions of a topic. A topic can be partitioned into multiple topic partitions and a topic partition is the unit that is distributed to the brokers.
A message is the unit of data within Kafka aka record. A message is simply an array of bytes as far as Kafka is concerned, so the data contained within it does not have a specific format or meaning to Kafka. A message can have an optional key which provides the option of controlled distribution to partitions. The hash of the key is computed and the hash modulo based on the number of partitions for the topic.
Example: Assuming there are 5 partitions and the hash of keys compute as:
0%5=0 2%5=2 4%5=4 7%5=2 9%5=4 10%5=0
A topic is a named stream of records/messages. Topics are stored in commit logs as partition segments. Topic partitions are units of parallelism. Record orders are guaranteed only within a partition. Records in a partition are appended in a sequential order and are assigned a sequential id called Offset. Offsets are ever growing numbers. The offset identifies each record location within a partition.
A commit log is not a new concept. Its been used long enough in the database market for writing out information about the records they will be modifying, before applying the changes to all the various data structures it maintains - transaction logs. The log is the record of what happened, and each table or index is a projection of this history into some useful data structure or index. Since the log is immediately persisted it is used as the authoritative source in restoring all other persistent structures in the event of a crash. Eventually logs were used for replicating data between databases. Many databases allow transmitting portions of log to replica databases.
To handle retention, Kafka often needs to find messages that need to be purged. With a single long partition, this is going to be slow. A partition is therefore segmented into multiple segments. On a disk a partition is directory and each segment is a commit log. When Kafka writes to a partition, it writes to a segment — the active segment. If the segment’s size limit is reached, a new segment is opened and that becomes the new active segment.Segments are named by their base offset. The base offset of a segment is an offset greater than offsets in previous segments and less than or equal to offsets in that segment. Each message is its value, offset, timestamp, key, message size, compression codec, checksum, and version of the message format. The data format on disk is exactly the same as what the broker receives from the producer over the network and sends to its consumers. This allows Kafka to efficiently transfer data with zero copy. Segments are two files: its log and index The segment index maps offsets to their message positions in the segment. The index file is memory mapped, and the offset look up uses binary search to find the nearest offset less than or equal to the target offset.The index file is made up of 8 byte entries, 4 bytes to store the offset relative to the base offset and 4 bytes to store the position. The offset is relative to the base offset so that only 4 bytes is needed to store the offset. For example: let’s say the base offset is 10000000000000000000, rather than having to store subsequent offsets 10000000000000000001 and 10000000000000000002 they are just 1 and 2.
Kafka relies on the filesystem for the storage and caching. The problem is disks are slower than RAM. This is because the seek-time through a disk is large compared to the time required for actually reading the data.But if you can avoid seeking, then you can achieve latencies as low as RAM in some cases. This is done by Kafka through Sequential I/O.One advantage of Sequential I/O is you get a cache without writing any logic in your application for it. Modern operating systems allocate most of their free memory to disk-caching. So, if you are reading in an ordered fashion, the OS can always read-ahead and store data in a cache on each disk read. This is much better than maintaining a cache in a JVM application. This is because JVM objects are “heavy” and can lead to high garbage collection, which becomes worse as data size increases.
One of the major inefficiencies of data processing system is the serialization and deserialization of data into formats suitable for storing & transmitting (JSON). How does Kafka avoid this? By using the standardized binary data format between producers, consumers and brokers. Zero Copy Keeping data in the same format as it would be sent over the network helps copying directly from page cache to socket buffer removing the application context
The controller is responsible for administrative operations, including assigning partitions to brokers and monitoring for broker failures. A partition is owned by a single broker in the cluster, and that broker is called the leader of the partition. A partition may be assigned to multiple brokers, which will result in the partition being replicated. This provides redundancy of messages in the partition, such that another broker can take over leadership if there is a broker failure. However, all consumers and producers operating on that partition must connect to the leader.
Kafka brokers are configured with a default retention setting for topics, either retaining messages for some period of time (e.g., 7 days) or until the topic reaches a certain size in bytes (e.g., 1 GB). Once these limits are reached, messages are expired and deleted so that the retention configuration is a minimum amount of data available at any time. Individual topics can also be configured with their own retention settings so that messages are stored for only as long as they are useful.
Producers balance load as described above. Fire-and-forget We send a message to the server and don’t really care if it arrives succesfully or not. Most of the time, it will arrive successfully, since Kafka is highly available and the producer will retry sending messages automatically. However, some messages will get lost using this method.
Synchronous send We send a message, the send() method returns a Future object, and we use get() to wait on the future and see if the send() was successful or not.
Asynchronous send We call the send() method with a callback function, which gets triggered when it receives a response from the Kafka broker.
A producer object can be used by multiple threads to send messages. Its typical to start with one producer and one thread. If better throughput is needed, more threads that use the same producer can be added. Once this ceases to increase throughput, more producers to the application to achieve even higher throughput can be added.
If we add more consumers to a single group with a single topic than we have partitions, some of the consumers will be idle and get no messages at all. The main way we scale data consumption from a Kafka topic is by adding more consumers to a consumer group. It is common for Kafka consumers to do high-latency operations such as write to a database or a time-consuming computation on the data. In these cases, a single consumer can’t possibly keep up with the rate data flows into a topic, and adding more consumers that share the load by having each consumer own just a subset of the partitions and messages is our main method of scaling. This is a good reason to create topics with a large number of partitions—it allows adding more consumers when the load increases. In addition to having multiple consumers within a group, we may also have multiple consumer groups to the same topic. Kafka scales to large number of consumer groups and consumers without impacting performance.
Data Infrastructure - Centralized Data Pipeline
Why not traditional messaging
systems for the centralized
Transient Vs Durable Messages
Consumer Publish - Push vs Pull Based Mechanism
Offset Tracking - Replay Messages On Consumer
Distributed - Partitioning & Replication
Key Idea 1: Data parallelism leads to scale out
Randomly distribute clients across
Key Idea 2: Disks are fast when used sequentially
Store messages as a write ahead log
Key Idea 3: Batching makes best use of the network
Batched transfer, compression, no JVM
caching (low memory footprint) & Zero Copy
Why File System
& Not Memory?
Lean differences with sequential access b/w file
system & memory speeds
Kafka runs on JVM
● Heavy object overheads for data stored in
● Increased GC Time
Socket Buffer NIC Buffer
User Space Buffer
Receives messages from Producers,
Assigns Offset & Writes To Disk
Fetches Messages for consumers
reading partitions & responding
with committed messages.
One elected as Controller - Admin,
assigns partitions to brokers &
Topic Retention - Time or Size
Topic A Partition 0 Topic A Partition 1
Topic A Partition 0 Topic A Partition 1
Broker 0 (Controller)
Messages for A/0
Messages for A/1
Messages from A/0
Messages for A/1
Producers accept a ProducerRecord
ProducerRecord Key & Values are serialized
into byte array by Serializer
Partitioner - Chooses partition by key if not
specified & adds record to a specific batch for
Separate threads handles sending batches to
Three Methods: 1. Fire & Forget 2.
Synchronous 3. Asynchronous
Consumer Groups For Consumption Scaling
Topic Partitions distributed among consumers
in a group
Partitions are rebalanced on consumer
additions or crashes (consumer unavailability
& loss of consumer cache)