Apache Kafka is a distributed messaging system originally developed by LinkedIn to handle high volumes of log data with low latency. It allows for both online and offline data analysis and is highly scalable and efficient. Kafka uses a "pull model" where consumers pull messages from brokers in a distributed, fault-tolerant way coordinated by Zookeeper. Producers push messages to topics which are partitioned across brokers for scalability.
2. WHAT IS KAFKA?
• KAFKA is a distributed messaging system that was
originally developed at LinkedIn to serve as the
foundation for LinkedIn's activity stream and operational
data processing pipeline. It is now used at a variety of
different companies for various data pipeline and
messaging uses.
• LinkedIn developed Kafka for collecting and delivering
high volumes of log data with low latency for real-time
log processing.
• Scala and Zookeeper based.
3. WHAT IS KAFKA? CONTD..
• KAFKA can be used for both ONLINE(realtime) and
OFFLINE (integration with hadoop) analysis of log data.
• Distributed and scalable .
• High Throughput.
• More Priority to efficiency then fancy features.
• Simple API.
• Low Overhead.
5. Where to use APACHE KAFKA?
• user activity events corresponding to logins, page-views,
clicks, “likes” ,sharing, comments, and search queries
• operational metrics (health monitoring)
• search relevance.
• Recommendations.
• Ad targeting and reporting.
• Protection against abnormal behavior.
• newsfeed
• Etc.
6. FLAWS in Traditional Messaging Systems.
• More focus on delivery guarantee.
• Less focus on throughput.
• Weak in distributed support.
• Performance degrades as Queue increases.
9. ARCHITECTURE contd..
• A stream of messages of particular type is defined by a topic.
• A topic is then divided into multiple partitions
• Each partition is a directory that has list of files and the files store
the message data.
• A producer can publish messages to a topic (push).
• Published messages are then stored on a set of servers called
brokers.
• Each broker stores one or more partitions
• Consumers can subscribe to any of the topics and can consume
messages by pulling data from the brokers.
• Supports both Point-to-point Delivery model and Publish/subscribe
Delivery model.
• Uses “Pull Based” system model.
10. WHY “PULL MODEL”?
• A “push-based” system has difficulty dealing with diverse
consumers as the broker controls the rate at which data
is transferred, (DOS attack = loss of data).
• In a Pull based system, a consumer simply falls behind
and catches up when it can.
• Consumer can rewind back to consume old messages.
14. Storage
• Simple Storage
• One each partition corresponds to a logical log.(a log ==
a list of files.)
• Physically a log is implemented using segmentation
where each segment is approximately of equal size.
• When a message is published , the broker simply
appends the message to the last segment file.
• Messages are flushed on to disk in a batch for better
performance.
• Messages is only exposed to consumers after they are
flushed.
• Messages are addressed by log offset. #lessoverhead.
15. Storage contd..
• Id of next message = length of current message + id of current
message.
16. Efficient Transfer
• Producer can submit multiple messages in a single send
request.( #End-to-endmessagebatching).
• Each pull request can consume multiple messages up to
a certain size.
• No caching of messages on the Kafta process layer.
Messages are only cached in the page cache, --“no
double buffering”. Hence very little overhead in garbage
collecting its memory.(#filesystemcaching)#lessoverhead
• Producer and Consumer access the data files
Sequentially. (disks are fast when accessed
sequentially).
17. Efficient Transfer :from local file to remote
socket
1. read data from the storage media to the page cache in an OS.
2. copy data in the page cache to an application buffer.
3. copy application buffer to another kernel buffer.
4. send the kernel buffer to the socket
• This process usually takes 4 data copying and one system call.
SENDFILE API
avoids 2 copies call and 1 system call
#zerocopytransfer
broker consumer
18. Stateless Broker
• Consumer maintains its own state hence reducing
complexity and overhead on the broker.
• Disadvantage is that broker doesn’t know whether all
subscriber’s have consumed the message , this problem
is solved by using a simple time based SLA for the
retention policy. A message is automatically deleted if it
has been
• A consumer can deliberately rewind back to an old offset
and re-consume data. #violateslogicofqueue.
20. Distributed Model contd..
• Consumer Group consists of one or more consumers.
• All messages from a partition are consumed by a single
consumer in a consumer group. #lessoverheadonbroker.
• No master node less complexity and no master
failures to worry about.
• Zookeeper co-ordinates between the producers,
consumers and brokers.
21. Zookeeper functions.
• Detection of addition/ removal of brokers, producers, consumers.
• Triggering rebalance process in effect to above detection.
• Tracking.
Broker registry hostname, port and
set of topics of broker.
ephemeral
Consumer Registry consumer group Ephemeral
Owner registry consumer currently
consuming a partition
persistent
Offset registry stores the offset of last
consumed partition for
each subscribed
partition.
persistent
22. Delivery Guarantee
• At least once delivery. #cancauseduplication.#cost-
effective
• Messages from single partition in order.
• Messages from multiple partitions not necessarily in
order. #noguarantee
• CRC for each message in the logs #avoidlogcorruption
• If I/O error on broker then kafka removes messages with
inconsistent CRCs.
• If the storage system is completely damaged and
consumer have not consumed the message then the
message is lost forever. #futureplanstoaddreplication.