Fundamentals and Architecture of Apache Kafka

Fundamentals and Architecture of
Apache Kafka®
Angelo Cesaro

Who am I?
• I’m Angelo!
• Consultant and Data Engineer at Cesaro.io
• More than 10 years of experience
• Worked at ServiceNow, Sky
• Follow me on
https://www.linkedin.com/in/angelocesaro
https://twitter.com/angelocesaro
https://github.com/cesaroangelo

Apache Kafka – Overview
• A distributed streaming platform used for building real time data
pipelines and mission-critical streaming applications with the
following characteristics:
1. Horizontally scalable
2. Fault tolerant
3. Really fast
4. Used by thousands of companies in production

Kafka’s benefits over traditional
messages queues
There are few key differences between kafka and other
traditional messages queues
• Durability and availability
1. Cluster can handle broker failures
2. Messages are replicated for reliability
• Very high throughput
• Data retention
• Excellent scalability
1. A small kafka cluster can process a large number of messages
• Support real-time and batch consumption
1. Kafka was born for real time processing of data, but can also handle
batch oriented jobs, for example feeding data to Hadoop or a data
warehouse

High level of a Kafka cluster
• Producers send data to the kafka cluster
• Consumers read data from the kafka cluster
• Brokers are the main storage and messaging components of the
kafka cluster
Note: the components above can be physical machines, VMs or docker containers, kafka works the
same on of those platforms.

Messages
• The basic unit of data in kafka is a message and the
messages are the atomic unit of data sent by producers
• A message is a key-value pair:
• All the data is stored in Kafka as byte arrays (very
important!)
• Producer provides serializers to convert the key and value
to byte arrays
• Key and value can be any data type

Topic
• Kafka keeps streams of messages called topic and they categorize
messages into groups
• Developers can decide which topics have to exist and by
default Kafka auto-create topics when they are first used
• Kafka has no limit to the number of topics that can be used
• Topics are logical representation that spans across brokers
Note: By analogy, we can think topics as tables in a dbms, just like
we separate data in a db in different tables, we do the same with
topics

Data partitioning
• Producers shard data over a group of partitions and this is needed
to allow for parallel access to the topic for increased throughput
• Each partition contains a subset of messages and they are
ordered and immutable
• Usually the message key is used to control which partition a
message is assigned to

Kafka components
• 4 key components are in a kafka system
• Brokers
• Producers
• Consumers
• Zookeeper

Kafka broker
• Brokers receive and store data sent by the producers
• Brokers are server class systems that provide messages to the
consumers when requested
• Messages are spread across multiple partitions in different brokers
• Kafka provides a configurable retention policy for messages and each
message is identified by its offset number
• The commit log is an append only data structure that lives in ram for
fast access and it’s flushed to disk periodically
• Producer sends requests to the brokers that append messages to the
end of the log
• Consumers consumes from a specific offset (usually the lowest
available) and consumes all messages sequentially

Kafka producers
• Each producer writes data as messages to the kafka cluster
• Producers can be written in any language
• Kafka provides a tool to send messages to the cluster
• Confluent develops a rest (representational state transfer) server
which can be used by clients written in any language
• Confluent Enterprise includes a MQTT (message queuing telemetry
transport) proxy that allows direct ingestion of IoT data

Kafka consumers
• Each consumer pull events from topics as they are written
• The latest message read are kept tracked in a special ‘consumer
offset’ topic
• If necessary the consumers can be reset to start reading from a
specific offset (parameter to set in the configuration for the
default behavior)
Note: other similar solutions use to push events

Distributed consumption
• The way kafka uses to scale the consumption is the combination
of multiple consumers into consumer groups
• Each consumer in that scenario will be assigned a subset of
partitions for consumption
It’s important to know that traditional systems tend to be point to
point, that means that a message is gone once it has been
consumed and can’t be read again. Kafka was designed to work
differently, to allow to use the data multiple times

Zookeeper
• Zookeeper is a centralized and distributed service that can be
used to enable highly reliable distributed coordination
• It maintains configuration information (in this context kafka
cluster configurations)
• It provides distributed synchronization
• It runs in cluster and provides resiliency against failures

Kafka & Zookeeper
Kafka uses Zookeeper for various important features
• Cluster management
• Storage of ACLs and passwords
• Failure detection and recovery
Note:
1. kafka can’t run without zookeeper
2. In the previous kafka releases (<0.11), the clients had to access
to zookeeper, from 0.11 only the brokers need that access and
then the cluster is isolated from the clients for better security and
performance

Advantages of a pull architecture
• Ability to add more consumers to the system without
reconfiguring the cluster
• Ability for a consumer to go offline and return back later,
resuming from where it left off
• Consumer won’t get overwhelmed by data, consumer decides
what speed to get data and slow consumers won’t affect fast
producers

Speeding up data transfer
Kafka is fast, but why?
• Kafka uses system page cache for producing and consuming
messages. (linux kernel feature)
• The use of page cache enables zero-copy, the feature that allows
to transfer data directly from local file channel to a remote socket.
that saves cpu cycles and memory bandwidth.

Kafka metrics
• Kafka metrics can be exposed via jmx and showed through jmx clients
• Type of metrics exposed are:
1. Gauge: instantaneous measurement of one value
2. Meter: measurement of ticks in a time range. E.g. one minute rate, 10 minutes
rate, etc
3. Histogram: measurement of a value variants. E.g. 50th percentile, 98th percentile,
etc
4. Timer: measurement of timings meter + histogram
Kafka uses yammer metrics on the broker and in the older <0.9 clients.
New clients uses new internal metric package. Confluent plans to
consolidate the jmx metrics packages in the future.

Why Replication?
• Each partition is stored in a broker
• If we wouldn’t have any replication and if a broker goes offline,
then the partitions stored in that broker won’t be available and a
permanent data loss could occur
• Without redundancy, partitions will be not available for reads and
writes if the server goes offline and if the server has a fatal crash,
the data is gone permanently
Kafka uses replication for durability and availability

Replica
• Each partition can have replicas
• Each replica is placed on different brokers
• Replicas are spread evenly across brokers for load balancing
We specify the replication factor at topic creation time

Rack awareness of replicas
• Rack awareness enables each replica to be placed on brokers in
different racks. That helps to improve performance and fault
tolerance.
• Each broker can be configured with a broker.rack property, e.g.
rack-1, us-east-1a
• It’s useful if we need to deploy kafka on AWS across availability
zones
• Rack awareness was introduced in Confluent 3.0

Replica configurations
• Increase the replication factor for better durability
• For auto created topics, by default Kafka use replication factor 1,
that needs to be configured accordingly in server.properties
Kafka-topics –create –zookeeper zookeeper:2181 –partitions 1 –
replication-factor 3 –topic mytopic

How brokers are involved in
replication
• Brokers ensure strongly consistent replicas
• One replica is on the leader broker
• All messages produced go to the leader
• The leader propagates those messages to the followers brokers
• All consumers read messages from the leader
Note: very important to understand this above in case of
troubleshooting ;)

Leaders and followers
• Leader:
1. Accepts all reads and writes
2. Manages replicas
3. Leader election rate (meter metric):
kafka.controller:type=controllerstatus,name=leaderelectionrateandtime
ms
• Follower:
• Provide fault tolerance
• keep up with the leader
• There is a special thread running in the cluster that manage the current list
of leaders and followers for every partition. It’s a complex and mission-
critical task, for this reason there is a replica of this information in
zookeeper and then cached on every broker for faster access

Partition leaders
• Leaders have to be evenly distributed across all brokers for 2
main reasons:
• Leaders can change in case of failure
• Leaders do more work as discussed in the previous slides

Preferred replica
• When we create a topic the preferred replica is set automatically.
• It’s the first replica in the list of assigned replicas
• Kafka-topics –zookeeper zookeeper:2181 –describe –topic my-topic
Topic:my-topic PartitionCount:1 ReplicationFactor:3 Configs:
Topic: my-
topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0

In Sync Replica (ISR)
• In sync replica is a list of the replicas – leader + followers
• A message is committed if it’s received by every replica in
the list
Note for troubleshooting: Where is it kept the isr list? It’s in
the leader

What does committed mean?
• Committed means in this context that the message is
received and written to the disk by all replicas
• The data is not available for consuming if it’s not
committed
• Who decides when to commit a message? The leader has
this responsibility

Using Kafka command line tools
#create topic with replication factor 1 and partition 1
• kafka-topics.sh --create --zookeeper localhost:2181 --replication-
factor 1 --partitions 1 --topic test
#delete topic with name test
• kafka-topics.sh --delete --zookeeper localhost:2181 --topic test
#list info regarding topic
• kafka-topics.sh --describe --zookeeper localhost:2181 --topic test
#list topics
• kafka-topics.sh --list --zookeeper localhost:2181

Links!
• https://kafka.apache.org
• https://www.confluent.io
• https://www.cesaro.io

Fundamentals and Architecture of Apache Kafka

More Related Content

What's hot

Similar to Fundamentals and Architecture of Apache Kafka

Recently uploaded

Fundamentals and Architecture of Apache Kafka