Kafka Fundamentals

Index
1. What is Apache Kafka?
2. What is a messaging Queue?
3. Two types of messaging Queue?
4. Components of Kafka
5. How Kafka Combines both types of Queue?

What is Apache Kafka?
Apache Kafka is a distributed streaming platform(Let’s ignore
other components of Kafka). It is currently most popular and
widely used messaging Queue. It started as a Pub-Sub
model messaging Queue. It leverages both Pub-Sub model
and Parallelism using Consumer group.

What is a Messaging Queue?
1. Stores messages to be processed. You can see
2. Short term storage system.
3. Conventional systems are not designed to handle data in
high velocity. So here comes the Messaging Queue to the
Rescue where an applications can write data in high
velocity and another application can read from the
queue in order the data were written and do the further
processing.

Two types of messaging
Queues.

Traditional Messaging Queue
1. An application writes data into the Queue and another
application reads data from it in sequence,
2. Messages read once are deleted from Queue.
3. To increase the parallelism(Speed up the consumption)
multiple consumers are run. Eg. RabbitMQ

Pub-Sub Model
1. Messages are persisted even though it is read.
2. Multiple consumer can read the same messages.
3. With pub-sub model, we can use multiple consumer to
consume data from same topic and increase parallelism.
Eg. Apache Kafka.

How Kafka combines both types of
Messaging Queue into one?
1. Kafka is pub-sub model queue.
2. Where multiple consumers can consume same messages
from the topic.
3. It persists data(Traditional Queueing system deletes the
messages once consumed) for configured period(By
default 168 hours in Kafka).
4. No partition of a topic will be read by more than one
Consumers from the same consumer group

Producers, Brokers, Consumers
Producers: Which produces
Messages to the Kafka Cluster.
Brokers: Are Kafka servers,
where messages are received, stored
and retrieved from.
Consumers: Which consumes data
from the Kafka cluster

Topics, Partitions
Topics are like buckets where messages
are produced and consumed from.
Topics can divided into multiple
Partition. It is defined while creating the topic.
Partitions helps us in parallelizing the message consumption by running multiple
consumers of the same group.

Groups and Partition
Concept of Groups are used when we want to increase the parallelism while
consuming the data.
Consumers are assigned a group. When consumers
from the same consumer group starts consuming
data from a topic, they are assigned certain partitions
. No two consumer will assigned same partition. A single consumer can be
assigned multiple partition but not vice versa So it advised to keep number of
consumers in CG less than or equal to no. of partitions of the topic it is consuming
data from.

Replication
In Kafka, we can maintain the replica of the data stored in a topic, partition
actually.
While creating a topic, we can specify the replication factor of a topic.
Every partition (replica) has one server acting as a leader and the rest of
them(replication factor) as followers.
Kafka only exposes a message to the consumer only when it has been committed
to defined number of replica. ISR, in-sync replica

Offset:
The offset is a unique identifier of a record within a partition. It denotes the
position of the consumer in the partition.

Apache Zookeeper
Apache Zookeeper is a distributed, open-source configuration, synchronization
service along with naming registry for distributed applications.
It is used in a distributed system for synchronizing configuration.

Zookeeper in Kafka
1. ZooKeeper is the default storage engine, for consumer offsets.
2. Leadership election of Kafka Broker and Topic Partition pairs.
3. It sends changes of the topology to Kafka, so each node in the cluster knows
when a new broker joined, a Broker died, a topic was removed or a topic was
added, etc.

Use case of Kafka(Messaging Queue).
1. Real time data ingestion eg. Spark Streaming and Kafka
pipeline.
2. Pushing API Request to Queue if application is not able
to handle requests.(@MobiKwik)
3. Sending mails, notifications etc.(@MobiKwik Bill
Reminder)
4. Real-time Data replication.

Configurations: server
<kafka-home>/config/server.properties
broker.id=0 When running multiple brokers on same cluster, assign diff id
num.partitions=1 Default number of partitions
log.retention.hours=168 Number of hours to retain the logs
min.insync.replicas specifies the minimum number of replicas that must acknowledge a write for the
write to be considered successful.
<kafka-home>/config/producer.properties
group.id= Assign a group to the consumer.
<kafka-home>/config/producer.properties
linger.ms= Producer will wait for up to the given delay to allow other records to be sent so that the sends
can be batched together
partitioner.class= for partitioning events; default partition spreads data randomly

Kafka Fundamentals

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka Fundamentals

Similar to Kafka Fundamentals (20)

Recently uploaded

Recently uploaded (20)

Kafka Fundamentals