Distributed messaging with Apache Kafka

Distributed messaging with
Apache Kafka
Saumitra Srivastav
@_saumitra_
http://www.meetup.com/Bangalore-Apache-Kafka-Group/
1

Introduction
Kafka is a:
• distributed
• replicated
• persistent
• partitioned
• high throughput
• pub-sub
messaging system.
Incubated at LinkedIn. Written in Scala.
2

Demo Application
Twitter stream analytics
3

Stream
Producer
Broker-1 Broker-2 Broker-3
Twitter
Streaming API
Kafka Cluster
Solr-1
Realtime search
Solr-2 Cassandra-1
Data Store for longer retention
Cassandra-2
Sentiment Analysis
4

Terminology
Topics: categories in which message
feed is maintained
Producer: Processes that publish
messages to a Kafka topic.
Consumers: processes that subscribe
to topics and process the feed of
published messages
Brokers: Servers which form a kafka
cluster and act as a data transport
channel between producers and
consumers.
Producer Producer
Consumer Consumer
Broker
Kafka Cluster
Broker Broker
5

Simplified View of a Kafka System
ZookeeperBroker 1 Broker 2 Broker 3
Producer 1 Producer 2
Consumer 1 Consumer 2 Consumer 3
6

Topics and Partitions
TOPIC – 1
(error log)
TOPIC – 2
(security log)
7

Partitions
• Each partition is an ordered, immutable sequence of
messages.
• Messages are continuously appended to it.
• Each message in partition is assigned a unique
sequential id number called offset.
• Any message in partition can be accessed using this
offset.
8

Partitions
• Partition servers 2 purposes:
1. Scaling
2. Parallelism
• Scaling
A topic can be divided into multiple partition, and
each partition can be on different servers.
• Parallelism
A consumer can consume from multiple partitions at
same time(while maintaining ordering guarantee).
9

Distribution & Replication
• The partitions of the log are distributed over Kafka cluster
• Each server handles data and requests for some number of
partition
• Each partition is replicated for fault tolerance.
• Each partition has one server which acts as the leader.
• The leader handles all read and write requests for the
partition.
• Followers keep replicating the leader.
10

Producers
• Producers publish data to the topics of their choice.
• Producer can choose the topic’s partition to which
message should be assigned.
• Partition can be selected in a round robin manner for
load balancing.
• Kafka doesn’t care about serialization format. All it
need is a byte array.
11

Consumers
• Other messaging systems basically follow 2 models:
• Queuing
• Publish-Subscribe
• Kafka uses a concept of consumer group which generalizes
both these models.
• Consumers label themselves with a consumer group name
• Each message published to a topic, is delivered to one
consumer instance, within each subscribing consumer group.
12

Consumer Groups
ZookeeperBroker 1 Broker 2 Broker 3
Consumer 1 Consumer 2 Consumer 3
Consumer-Group A Consumer-Group B
14

Consumer groups
Zookeeper
Broker 1
Topic-1
Broker 2
Topic-1
Broker 3
Topic-1
Consumer 1
Consumer-Group A Consumer-Group B
P0 P3 P5 P2 P4
Consumer 2 Consumer 3
15

Message Persistence
• Unlike other messaging system, message are not
deleted on consumption.
• Message are retained until a configurable period of
time after which they are deleted (even if they are
NOT consumed).
• Consumers can re-consume any chunk of older
message using message offset.
• Kafka performance is effectively constant with respect
to data size, so huge data size is not an issue.
16

Demo
Running a multi-broker kafka cluster
17

Guarantees
1. Ordering guarantee
• Messages sent by a producer to a particular topic partition will be
appended in the order they are sent.
• A consumer instance sees messages in the order they are stored in the
log.
2. At least once delivery
3. Fault tolerance
For a topic with replication factor N, up to N-1 server failures will not cause
any data loss.
4. No corruption of data:
• over the network
• On the disk
18

Demo
Consumer/Producer Java API
19

Misc Design features
1. Stateless broker
• Each consumer maintains its own state(offset)
2. Load balancing
3. Asynchronous send
4. Push/pull model instead of Push/Push
5. Consumer Position
6. Offline Data Load
7. Simple API
8. Low Overhead
9. Batch send and receive
10. No message caching in JVM
11. Rely on file system buffering
• mostly sequential access patterns
12. Zero-copy transfer: file->socket
20

Use Cases
1. Messaging
2. Website Activity Tracking
3. Metrics
4. Log Aggregation
5. Stream Processing
21

Thanks
Website: http://kafka.apache.org/
Doc: http://kafka.apache.org/documentation.html
Mailing Lists: users@kafka.apache.org
Questions?
22

Distributed messaging with Apache Kafka

More Related Content

What's hot

Viewers also liked

Similar to Distributed messaging with Apache Kafka

Recently uploaded

Distributed messaging with Apache Kafka