Fundamentals and Architecture of
Apache Kafka®
Angelo Cesaro
Who am I?
• I’m Angelo!
• Consultant and Data Engineer at Cesaro.io
• More than 10 years of experience
• Worked at ServiceNow, Sky
• Follow me on
https://www.linkedin.com/in/angelocesaro
https://twitter.com/angelocesaro
https://github.com/cesaroangelo
Apache Kafka – Overview
• A distributed streaming platform used for building real time data
pipelines and mission-critical streaming applications with the
following characteristics:
1. Horizontally scalable
2. Fault tolerant
3. Really fast
4. Used by thousands of companies in production
Kafka’s benefits over traditional
messages queues
There are few key differences between kafka and other
traditional messages queues
• Durability and availability
1. Cluster can handle broker failures
2. Messages are replicated for reliability
• Very high throughput
• Data retention
• Excellent scalability
1. A small kafka cluster can process a large number of messages
• Support real-time and batch consumption
1. Kafka was born for real time processing of data, but can also handle
batch oriented jobs, for example feeding data to Hadoop or a data
warehouse
High level of a Kafka cluster
• Producers send data to the kafka cluster
• Consumers read data from the kafka cluster
• Brokers are the main storage and messaging components of the
kafka cluster
Note: the components above can be physical machines, VMs or docker containers, kafka works the
same on of those platforms.
Messages
• The basic unit of data in kafka is a message and the
messages are the atomic unit of data sent by producers
• A message is a key-value pair:
• All the data is stored in Kafka as byte arrays (very
important!)
• Producer provides serializers to convert the key and value
to byte arrays
• Key and value can be any data type
Topic
• Kafka keeps streams of messages called topic and they categorize
messages into groups
• Developers can decide which topics have to exist and by
default Kafka auto-create topics when they are first used
• Kafka has no limit to the number of topics that can be used
• Topics are logical representation that spans across brokers
Note: By analogy, we can think topics as tables in a dbms, just like
we separate data in a db in different tables, we do the same with
topics
Data partitioning
• Producers shard data over a group of partitions and this is needed
to allow for parallel access to the topic for increased throughput
• Each partition contains a subset of messages and they are
ordered and immutable
• Usually the message key is used to control which partition a
message is assigned to
Kafka components
• 4 key components are in a kafka system
• Brokers
• Producers
• Consumers
• Zookeeper
Kafka broker
• Brokers receive and store data sent by the producers
• Brokers are server class systems that provide messages to the
consumers when requested
• Messages are spread across multiple partitions in different brokers
• Kafka provides a configurable retention policy for messages and each
message is identified by its offset number
• The commit log is an append only data structure that lives in ram for
fast access and it’s flushed to disk periodically
• Producer sends requests to the brokers that append messages to the
end of the log
• Consumers consumes from a specific offset (usually the lowest
available) and consumes all messages sequentially
Kafka producers
• Each producer writes data as messages to the kafka cluster
• Producers can be written in any language
• Kafka provides a tool to send messages to the cluster
• Confluent develops a rest (representational state transfer) server
which can be used by clients written in any language
• Confluent Enterprise includes a MQTT (message queuing telemetry
transport) proxy that allows direct ingestion of IoT data
Kafka consumers
• Each consumer pull events from topics as they are written
• The latest message read are kept tracked in a special ‘consumer
offset’ topic
• If necessary the consumers can be reset to start reading from a
specific offset (parameter to set in the configuration for the
default behavior)
Note: other similar solutions use to push events
Distributed consumption
• The way kafka uses to scale the consumption is the combination
of multiple consumers into consumer groups
• Each consumer in that scenario will be assigned a subset of
partitions for consumption
It’s important to know that traditional systems tend to be point to
point, that means that a message is gone once it has been
consumed and can’t be read again. Kafka was designed to work
differently, to allow to use the data multiple times
Zookeeper
• Zookeeper is a centralized and distributed service that can be
used to enable highly reliable distributed coordination
• It maintains configuration information (in this context kafka
cluster configurations)
• It provides distributed synchronization
• It runs in cluster and provides resiliency against failures
Kafka & Zookeeper
Kafka uses Zookeeper for various important features
• Cluster management
• Storage of ACLs and passwords
• Failure detection and recovery
Note:
1. kafka can’t run without zookeeper
2. In the previous kafka releases (<0.11), the clients had to access
to zookeeper, from 0.11 only the brokers need that access and
then the cluster is isolated from the clients for better security and
performance
Advantages of a pull architecture
• Ability to add more consumers to the system without
reconfiguring the cluster
• Ability for a consumer to go offline and return back later,
resuming from where it left off
• Consumer won’t get overwhelmed by data, consumer decides
what speed to get data and slow consumers won’t affect fast
producers
Speeding up data transfer
Kafka is fast, but why?
• Kafka uses system page cache for producing and consuming
messages. (linux kernel feature)
• The use of page cache enables zero-copy, the feature that allows
to transfer data directly from local file channel to a remote socket.
that saves cpu cycles and memory bandwidth.
Kafka metrics
• Kafka metrics can be exposed via jmx and showed through jmx clients
• Type of metrics exposed are:
1. Gauge: instantaneous measurement of one value
2. Meter: measurement of ticks in a time range. E.g. one minute rate, 10 minutes
rate, etc
3. Histogram: measurement of a value variants. E.g. 50th percentile, 98th percentile,
etc
4. Timer: measurement of timings meter + histogram
Kafka uses yammer metrics on the broker and in the older <0.9 clients.
New clients uses new internal metric package. Confluent plans to
consolidate the jmx metrics packages in the future.
Why Replication?
• Each partition is stored in a broker
• If we wouldn’t have any replication and if a broker goes offline,
then the partitions stored in that broker won’t be available and a
permanent data loss could occur
• Without redundancy, partitions will be not available for reads and
writes if the server goes offline and if the server has a fatal crash,
the data is gone permanently
Kafka uses replication for durability and availability
Replica
• Each partition can have replicas
• Each replica is placed on different brokers
• Replicas are spread evenly across brokers for load balancing
We specify the replication factor at topic creation time
Rack awareness of replicas
• Rack awareness enables each replica to be placed on brokers in
different racks. That helps to improve performance and fault
tolerance.
• Each broker can be configured with a broker.rack property, e.g.
rack-1, us-east-1a
• It’s useful if we need to deploy kafka on AWS across availability
zones
• Rack awareness was introduced in Confluent 3.0
Replica configurations
• Increase the replication factor for better durability
• For auto created topics, by default Kafka use replication factor 1,
that needs to be configured accordingly in server.properties
Kafka-topics –create –zookeeper zookeeper:2181 –partitions 1 –
replication-factor 3 –topic mytopic
How brokers are involved in
replication
• Brokers ensure strongly consistent replicas
• One replica is on the leader broker
• All messages produced go to the leader
• The leader propagates those messages to the followers brokers
• All consumers read messages from the leader
Note: very important to understand this above in case of
troubleshooting ;)
Leaders and followers
• Leader:
1. Accepts all reads and writes
2. Manages replicas
3. Leader election rate (meter metric):
kafka.controller:type=controllerstatus,name=leaderelectionrateandtime
ms
• Follower:
• Provide fault tolerance
• keep up with the leader
• There is a special thread running in the cluster that manage the current list
of leaders and followers for every partition. It’s a complex and mission-
critical task, for this reason there is a replica of this information in
zookeeper and then cached on every broker for faster access
Partition leaders
• Leaders have to be evenly distributed across all brokers for 2
main reasons:
• Leaders can change in case of failure
• Leaders do more work as discussed in the previous slides
Preferred replica
• When we create a topic the preferred replica is set automatically.
• It’s the first replica in the list of assigned replicas
• Kafka-topics –zookeeper zookeeper:2181 –describe –topic my-topic
Topic:my-topic PartitionCount:1 ReplicationFactor:3 Configs:
Topic: my-
topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
In Sync Replica (ISR)
• In sync replica is a list of the replicas – leader + followers
• A message is committed if it’s received by every replica in
the list
Note for troubleshooting: Where is it kept the isr list? It’s in
the leader
What does committed mean?
• Committed means in this context that the message is
received and written to the disk by all replicas
• The data is not available for consuming if it’s not
committed
• Who decides when to commit a message? The leader has
this responsibility
Using Kafka command line tools
#create topic with replication factor 1 and partition 1
• kafka-topics.sh --create --zookeeper localhost:2181 --replication-
factor 1 --partitions 1 --topic test
#delete topic with name test
• kafka-topics.sh --delete --zookeeper localhost:2181 --topic test
#list info regarding topic
• kafka-topics.sh --describe --zookeeper localhost:2181 --topic test
#list topics
• kafka-topics.sh --list --zookeeper localhost:2181
Links!
• https://kafka.apache.org
• https://www.confluent.io
• https://www.cesaro.io

Fundamentals and Architecture of Apache Kafka

  • 1.
    Fundamentals and Architectureof Apache Kafka® Angelo Cesaro
  • 2.
    Who am I? •I’m Angelo! • Consultant and Data Engineer at Cesaro.io • More than 10 years of experience • Worked at ServiceNow, Sky • Follow me on https://www.linkedin.com/in/angelocesaro https://twitter.com/angelocesaro https://github.com/cesaroangelo
  • 3.
    Apache Kafka –Overview • A distributed streaming platform used for building real time data pipelines and mission-critical streaming applications with the following characteristics: 1. Horizontally scalable 2. Fault tolerant 3. Really fast 4. Used by thousands of companies in production
  • 4.
    Kafka’s benefits overtraditional messages queues There are few key differences between kafka and other traditional messages queues • Durability and availability 1. Cluster can handle broker failures 2. Messages are replicated for reliability • Very high throughput • Data retention • Excellent scalability 1. A small kafka cluster can process a large number of messages • Support real-time and batch consumption 1. Kafka was born for real time processing of data, but can also handle batch oriented jobs, for example feeding data to Hadoop or a data warehouse
  • 5.
    High level ofa Kafka cluster • Producers send data to the kafka cluster • Consumers read data from the kafka cluster • Brokers are the main storage and messaging components of the kafka cluster Note: the components above can be physical machines, VMs or docker containers, kafka works the same on of those platforms.
  • 6.
    Messages • The basicunit of data in kafka is a message and the messages are the atomic unit of data sent by producers • A message is a key-value pair: • All the data is stored in Kafka as byte arrays (very important!) • Producer provides serializers to convert the key and value to byte arrays • Key and value can be any data type
  • 7.
    Topic • Kafka keepsstreams of messages called topic and they categorize messages into groups • Developers can decide which topics have to exist and by default Kafka auto-create topics when they are first used • Kafka has no limit to the number of topics that can be used • Topics are logical representation that spans across brokers Note: By analogy, we can think topics as tables in a dbms, just like we separate data in a db in different tables, we do the same with topics
  • 8.
    Data partitioning • Producersshard data over a group of partitions and this is needed to allow for parallel access to the topic for increased throughput • Each partition contains a subset of messages and they are ordered and immutable • Usually the message key is used to control which partition a message is assigned to
  • 9.
    Kafka components • 4key components are in a kafka system • Brokers • Producers • Consumers • Zookeeper
  • 10.
    Kafka broker • Brokersreceive and store data sent by the producers • Brokers are server class systems that provide messages to the consumers when requested • Messages are spread across multiple partitions in different brokers • Kafka provides a configurable retention policy for messages and each message is identified by its offset number • The commit log is an append only data structure that lives in ram for fast access and it’s flushed to disk periodically • Producer sends requests to the brokers that append messages to the end of the log • Consumers consumes from a specific offset (usually the lowest available) and consumes all messages sequentially
  • 11.
    Kafka producers • Eachproducer writes data as messages to the kafka cluster • Producers can be written in any language • Kafka provides a tool to send messages to the cluster • Confluent develops a rest (representational state transfer) server which can be used by clients written in any language • Confluent Enterprise includes a MQTT (message queuing telemetry transport) proxy that allows direct ingestion of IoT data
  • 12.
    Kafka consumers • Eachconsumer pull events from topics as they are written • The latest message read are kept tracked in a special ‘consumer offset’ topic • If necessary the consumers can be reset to start reading from a specific offset (parameter to set in the configuration for the default behavior) Note: other similar solutions use to push events
  • 13.
    Distributed consumption • Theway kafka uses to scale the consumption is the combination of multiple consumers into consumer groups • Each consumer in that scenario will be assigned a subset of partitions for consumption It’s important to know that traditional systems tend to be point to point, that means that a message is gone once it has been consumed and can’t be read again. Kafka was designed to work differently, to allow to use the data multiple times
  • 14.
    Zookeeper • Zookeeper isa centralized and distributed service that can be used to enable highly reliable distributed coordination • It maintains configuration information (in this context kafka cluster configurations) • It provides distributed synchronization • It runs in cluster and provides resiliency against failures
  • 15.
    Kafka & Zookeeper Kafkauses Zookeeper for various important features • Cluster management • Storage of ACLs and passwords • Failure detection and recovery Note: 1. kafka can’t run without zookeeper 2. In the previous kafka releases (<0.11), the clients had to access to zookeeper, from 0.11 only the brokers need that access and then the cluster is isolated from the clients for better security and performance
  • 16.
    Advantages of apull architecture • Ability to add more consumers to the system without reconfiguring the cluster • Ability for a consumer to go offline and return back later, resuming from where it left off • Consumer won’t get overwhelmed by data, consumer decides what speed to get data and slow consumers won’t affect fast producers
  • 17.
    Speeding up datatransfer Kafka is fast, but why? • Kafka uses system page cache for producing and consuming messages. (linux kernel feature) • The use of page cache enables zero-copy, the feature that allows to transfer data directly from local file channel to a remote socket. that saves cpu cycles and memory bandwidth.
  • 18.
    Kafka metrics • Kafkametrics can be exposed via jmx and showed through jmx clients • Type of metrics exposed are: 1. Gauge: instantaneous measurement of one value 2. Meter: measurement of ticks in a time range. E.g. one minute rate, 10 minutes rate, etc 3. Histogram: measurement of a value variants. E.g. 50th percentile, 98th percentile, etc 4. Timer: measurement of timings meter + histogram Kafka uses yammer metrics on the broker and in the older <0.9 clients. New clients uses new internal metric package. Confluent plans to consolidate the jmx metrics packages in the future.
  • 19.
    Why Replication? • Eachpartition is stored in a broker • If we wouldn’t have any replication and if a broker goes offline, then the partitions stored in that broker won’t be available and a permanent data loss could occur • Without redundancy, partitions will be not available for reads and writes if the server goes offline and if the server has a fatal crash, the data is gone permanently Kafka uses replication for durability and availability
  • 20.
    Replica • Each partitioncan have replicas • Each replica is placed on different brokers • Replicas are spread evenly across brokers for load balancing We specify the replication factor at topic creation time
  • 21.
    Rack awareness ofreplicas • Rack awareness enables each replica to be placed on brokers in different racks. That helps to improve performance and fault tolerance. • Each broker can be configured with a broker.rack property, e.g. rack-1, us-east-1a • It’s useful if we need to deploy kafka on AWS across availability zones • Rack awareness was introduced in Confluent 3.0
  • 22.
    Replica configurations • Increasethe replication factor for better durability • For auto created topics, by default Kafka use replication factor 1, that needs to be configured accordingly in server.properties Kafka-topics –create –zookeeper zookeeper:2181 –partitions 1 – replication-factor 3 –topic mytopic
  • 23.
    How brokers areinvolved in replication • Brokers ensure strongly consistent replicas • One replica is on the leader broker • All messages produced go to the leader • The leader propagates those messages to the followers brokers • All consumers read messages from the leader Note: very important to understand this above in case of troubleshooting ;)
  • 24.
    Leaders and followers •Leader: 1. Accepts all reads and writes 2. Manages replicas 3. Leader election rate (meter metric): kafka.controller:type=controllerstatus,name=leaderelectionrateandtime ms • Follower: • Provide fault tolerance • keep up with the leader • There is a special thread running in the cluster that manage the current list of leaders and followers for every partition. It’s a complex and mission- critical task, for this reason there is a replica of this information in zookeeper and then cached on every broker for faster access
  • 25.
    Partition leaders • Leadershave to be evenly distributed across all brokers for 2 main reasons: • Leaders can change in case of failure • Leaders do more work as discussed in the previous slides
  • 26.
    Preferred replica • Whenwe create a topic the preferred replica is set automatically. • It’s the first replica in the list of assigned replicas • Kafka-topics –zookeeper zookeeper:2181 –describe –topic my-topic Topic:my-topic PartitionCount:1 ReplicationFactor:3 Configs: Topic: my- topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
  • 27.
    In Sync Replica(ISR) • In sync replica is a list of the replicas – leader + followers • A message is committed if it’s received by every replica in the list Note for troubleshooting: Where is it kept the isr list? It’s in the leader
  • 28.
    What does committedmean? • Committed means in this context that the message is received and written to the disk by all replicas • The data is not available for consuming if it’s not committed • Who decides when to commit a message? The leader has this responsibility
  • 29.
    Using Kafka commandline tools #create topic with replication factor 1 and partition 1 • kafka-topics.sh --create --zookeeper localhost:2181 --replication- factor 1 --partitions 1 --topic test #delete topic with name test • kafka-topics.sh --delete --zookeeper localhost:2181 --topic test #list info regarding topic • kafka-topics.sh --describe --zookeeper localhost:2181 --topic test #list topics • kafka-topics.sh --list --zookeeper localhost:2181
  • 30.