An Introduction to Apache Kafka

By Amir Sedighi
@amirsedighi
Data Solutions Engineer at DatisPars
Nov 2014

2
References
● http://kafka.apache.org/documentation.html
● http://www.slideshare.net/charmalloc/current-an
d-future-of-apache-kafka
● http://www.michael-noll.com/blog/2013/03/13/ru
nning-a-multi-broker-apache-kafka-cluster-on-a
-single-node/

3
At first data pipelining looks easy!
● It often starts with one
data pipeline from a
producer to a
consumer.

4
It looks pretty wise either to reuse
things!
● Reusing the pipeline
for new producers.

5
We may handle some situations!
● Reusing added
producers for new
consumers.

6
But we can't go far!
● Eventually the
solution becomes the
problem!

7
The additional requirements make
things complicated!
● By later developments it gets even worse!

10
Message Delivery Semantics
● At most once
– Messages may be lost by are never delivered.
● At least once
– Messages are never lost byt may be redliverd.
● Exactly once
– This is what people actually want.

11
Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.

12
Apache Kafka
● Apache Kafka is publish-subscribe messaging
– Kafka is super fast.
– Kafka is scalable.
– Kafka is durable.
– Kafka is distributed by design.

13
Apache Kafka

14
Apache Kafka

15
Apache Kafka
● A single Kafka broker
(server) can handle
hundreds of
megabytes of reads
and writes per second
from thousands of
clients.

16
Apache Kafka

17
Apache Kafka
● Kafka is designed to
allow a single cluster
to serve as the central
data backbone for a
large organization. It
can be elastically and
transparently
expanded without
downtime.

18
Apache Kafka

19
Apache Kafka
● Messages are
persisted on disk and
replicated within the
cluster to prevent
data loss. Each
broker can handle
terabytes of
messages without
performance impact.

20
Apache Kafka

21
Apache Kafka
● Kafka has a modern
cluster-centric design
that offers strong
durability and fault-tolerance
guarantees.

24
Kafka is a distributed, partitioned, replicated
commit log service.

25
Main Components
● Topic
● Producer
● Consumer
● Broker

26
Topic
● Topic
● Producer
● Consumer
● Broker
● Kafka maintains feeds
of messages in
categories called
topics.
● Topics are the highest
level of abstraction
that Kafka provides.

30
Producer
● Topic
● Producer
● Consumer
● Broker
● We'll call processes
that publish
messages to a Kafka
topic producers.

34
Consumer
● Topic
● Producer
● Consumer
● Broker
● We'll call processes
that subscribe to
topics and process
the feed of published
messages,
consumers.
– Hadoop Consumer

36
Broker
● Topic
● Producer
● Consumer
● Broker
● Kafka is run as a
cluster comprised of
one or more servers
each of which is
called a broker.

39
Topics
● A topic is a category
or feed name to which
messages are
published.
● Kafka cluster
maintains a
partitioned log for
each topic.

40
Partition
● Is an ordered,
immutable sequence of
messages that is
continually appended to
a commit log.
● The messages in the
partitions are each
assigned a sequential id
number called the offset.

44
Producer
● The producer is responsible for choosing which
message to assign to which partition within the
topic.
– Round-Robin
– Load-Balanced
– Key-Based (Semantic-Oriented)

46
How a Kafka cluster looks Like?

47
How Kafka replicates a Topic's
partitions through the cluster?

49
What if we put jobs (Processors)
cross the flow?

50
Where to Start?
● http://kafka.apache.org/downloads.html

51
Run Zookeeper
● bin/zookeeper-server-start.sh
config/zookeeper.properties

52
Run kafka-server
● bin/kafka-server-start.sh
config/server.properties

53
Create Topic
● bin/kafka-topics.sh --create --zookeeper
localhost:2181 --replication-factor 1 --partitions
1 --topic test
> Created topic "test".

54
List all Topics
● bin/kafka-topics.sh --list --zookeeper
localhost:2181

55
Send some Messages by Producer
● bin/kafka-console-producer.sh --broker-list
localhost:9092 --topic test
Hello DatisPars Guys!
How is it going with you?

56
Start a Consumer
● bin/kafka-console-consumer.sh --zookeeper
localhost:2181 --topic test --from-beginning

59
Use Cases
● Messaging
– Kafka is comparable to traditional messaging
systems such as ActiveMQ and RabbitMQ.
● Kafka provides customizable latency
● Kafka has better throughput
● Kafka is highly Fault-tolerance

60
Use Cases
● Log Aggregation
– Many people use Kafka as a replacement for a log aggregation
solution.
– Log aggregation typically collects physical log files off servers
and puts them in a central place (a file server or HDFS perhaps)
for processing.
– In comparison to log-centric systems like Scribe or Flume, Kafka
offers equally good performance, stronger durability guarantees
due to replication, and much lower end-to-end latency.
● Lower-latency
● Easier support

61
Use Cases
● Stream Processing
– Storm and Samza are popular frameworks for stream processing. They
both use Kafka.
● Event Sourcing
– Event sourcing is a style of application design where state changes are
logged as a time-ordered sequence of records. Kafka's support for very
large stored log data makes it an excellent backend for an application
built in this style.
● Commit Log
– Kafka can serve as a kind of external commit-log for a distributed
system. The log helps replicate data between nodes and acts as a re-syncing
mechanism for failed nodes to restore their data.

62
Message Format
● /**
● * A message. The format of an N byte message is the following:
● * If magic byte is 0
● * 1. 1 byte "magic" identifier to allow format changes
● * 2. 4 byte CRC32 of the payload
● * 3. N - 5 byte payload
● * If magic byte is 1
● * 1. 1 byte "magic" identifier to allow format changes
● * 2. 1 byte "attributes" identifier to allow annotations on the message independent of the
version (e.g. compression enabled, type of codec used)
● * 3. 4 byte CRC32 of the payload
● * 4. N - 6 byte payload
● */

An Introduction to Apache Kafka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to An Introduction to Apache Kafka

Similar to An Introduction to Apache Kafka (20)

More from Amir Sedighi

More from Amir Sedighi (9)

Recently uploaded

Recently uploaded (20)

An Introduction to Apache Kafka