Distributed messaging with
Apache Kafka
Saumitra Srivastav
@_saumitra_
http://www.meetup.com/Bangalore-Apache-Kafka-Group/...
Introduction
Kafka is a:
• distributed
• replicated
• persistent
• partitioned
• high throughput
• pub-sub
messaging syste...
Demo Application
Twitter stream analytics
3
Stream
Producer
Broker-1 Broker-2 Broker-3
Twitter
Streaming API
Kafka Cluster
Solr-1
Realtime search
Solr-2 Cassandra-1
D...
Terminology
Topics: categories in which message
feed is maintained
Producer: Processes that publish
messages to a Kafka to...
Simplified View of a Kafka System
ZookeeperBroker 1 Broker 2 Broker 3
Producer 1 Producer 2
Consumer 1 Consumer 2 Consumer...
Topics and Partitions
TOPIC – 1
(error log)
TOPIC – 2
(security log)
7
Partitions
• Each partition is an ordered, immutable sequence of
messages.
• Messages are continuously appended to it.
• E...
Partitions
• Partition servers 2 purposes:
1. Scaling
2. Parallelism
• Scaling
A topic can be divided into multiple partit...
Distribution & Replication
• The partitions of the log are distributed over Kafka cluster
• Each server handles data and r...
Producers
• Producers publish data to the topics of their choice.
• Producer can choose the topic’s partition to which
mes...
Consumers
• Other messaging systems basically follow 2 models:
• Queuing
• Publish-Subscribe
• Kafka uses a concept of con...
Consumers
13
Consumer Groups
ZookeeperBroker 1 Broker 2 Broker 3
Producer 1 Producer 2
Consumer 1 Consumer 2 Consumer 3
Consumer-Group ...
Consumer groups
Zookeeper
Broker 1
Topic-1
Broker 2
Topic-1
Broker 3
Topic-1
Producer 1 Producer 2
Consumer 1
Consumer-Gro...
Message Persistence
• Unlike other messaging system, message are not
deleted on consumption.
• Message are retained until ...
Demo
Running a multi-broker kafka cluster
17
Guarantees
1. Ordering guarantee
• Messages sent by a producer to a particular topic partition will be
appended in the ord...
Demo
Consumer/Producer Java API
19
Misc Design features
1. Stateless broker
• Each consumer maintains its own state(offset)
2. Load balancing
3. Asynchronous...
Use Cases
1. Messaging
2. Website Activity Tracking
3. Metrics
4. Log Aggregation
5. Stream Processing
21
Thanks
Website: http://kafka.apache.org/
Doc: http://kafka.apache.org/documentation.html
Mailing Lists: users@kafka.apache...
Upcoming SlideShare
Loading in...5
×

Distributed messaging with Apache Kafka

2,796

Published on

Slides from kafka meetup at http://www.meetup.com/Bangalore-Apache-Kafka-Group

0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,796
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
70
Comments
0
Likes
17
Embeds 0
No embeds

No notes for slide

Distributed messaging with Apache Kafka

  1. 1. Distributed messaging with Apache Kafka Saumitra Srivastav @_saumitra_ http://www.meetup.com/Bangalore-Apache-Kafka-Group/ 1
  2. 2. Introduction Kafka is a: • distributed • replicated • persistent • partitioned • high throughput • pub-sub messaging system. Incubated at LinkedIn. Written in Scala. 2
  3. 3. Demo Application Twitter stream analytics 3
  4. 4. Stream Producer Broker-1 Broker-2 Broker-3 Twitter Streaming API Kafka Cluster Solr-1 Realtime search Solr-2 Cassandra-1 Data Store for longer retention Cassandra-2 Sentiment Analysis 4
  5. 5. Terminology Topics: categories in which message feed is maintained Producer: Processes that publish messages to a Kafka topic. Consumers: processes that subscribe to topics and process the feed of published messages Brokers: Servers which form a kafka cluster and act as a data transport channel between producers and consumers. Producer Producer Consumer Consumer Broker Kafka Cluster Broker Broker 5
  6. 6. Simplified View of a Kafka System ZookeeperBroker 1 Broker 2 Broker 3 Producer 1 Producer 2 Consumer 1 Consumer 2 Consumer 3 6
  7. 7. Topics and Partitions TOPIC – 1 (error log) TOPIC – 2 (security log) 7
  8. 8. Partitions • Each partition is an ordered, immutable sequence of messages. • Messages are continuously appended to it. • Each message in partition is assigned a unique sequential id number called offset. • Any message in partition can be accessed using this offset. 8
  9. 9. Partitions • Partition servers 2 purposes: 1. Scaling 2. Parallelism • Scaling A topic can be divided into multiple partition, and each partition can be on different servers. • Parallelism A consumer can consume from multiple partitions at same time(while maintaining ordering guarantee). 9
  10. 10. Distribution & Replication • The partitions of the log are distributed over Kafka cluster • Each server handles data and requests for some number of partition • Each partition is replicated for fault tolerance. • Each partition has one server which acts as the leader. • The leader handles all read and write requests for the partition. • Followers keep replicating the leader. 10
  11. 11. Producers • Producers publish data to the topics of their choice. • Producer can choose the topic’s partition to which message should be assigned. • Partition can be selected in a round robin manner for load balancing. • Kafka doesn’t care about serialization format. All it need is a byte array. 11
  12. 12. Consumers • Other messaging systems basically follow 2 models: • Queuing • Publish-Subscribe • Kafka uses a concept of consumer group which generalizes both these models. • Consumers label themselves with a consumer group name • Each message published to a topic, is delivered to one consumer instance, within each subscribing consumer group. 12
  13. 13. Consumers 13
  14. 14. Consumer Groups ZookeeperBroker 1 Broker 2 Broker 3 Producer 1 Producer 2 Consumer 1 Consumer 2 Consumer 3 Consumer-Group A Consumer-Group B 14
  15. 15. Consumer groups Zookeeper Broker 1 Topic-1 Broker 2 Topic-1 Broker 3 Topic-1 Producer 1 Producer 2 Consumer 1 Consumer-Group A Consumer-Group B P0 P3 P5 P2 P4 Consumer 2 Consumer 3 15
  16. 16. Message Persistence • Unlike other messaging system, message are not deleted on consumption. • Message are retained until a configurable period of time after which they are deleted (even if they are NOT consumed). • Consumers can re-consume any chunk of older message using message offset. • Kafka performance is effectively constant with respect to data size, so huge data size is not an issue. 16
  17. 17. Demo Running a multi-broker kafka cluster 17
  18. 18. Guarantees 1. Ordering guarantee • Messages sent by a producer to a particular topic partition will be appended in the order they are sent. • A consumer instance sees messages in the order they are stored in the log. 2. At least once delivery 3. Fault tolerance For a topic with replication factor N, up to N-1 server failures will not cause any data loss. 4. No corruption of data: • over the network • On the disk 18
  19. 19. Demo Consumer/Producer Java API 19
  20. 20. Misc Design features 1. Stateless broker • Each consumer maintains its own state(offset) 2. Load balancing 3. Asynchronous send 4. Push/pull model instead of Push/Push 5. Consumer Position 6. Offline Data Load 7. Simple API 8. Low Overhead 9. Batch send and receive 10. No message caching in JVM 11. Rely on file system buffering • mostly sequential access patterns 12. Zero-copy transfer: file->socket 20
  21. 21. Use Cases 1. Messaging 2. Website Activity Tracking 3. Metrics 4. Log Aggregation 5. Stream Processing 21
  22. 22. Thanks Website: http://kafka.apache.org/ Doc: http://kafka.apache.org/documentation.html Mailing Lists: users@kafka.apache.org Questions? 22
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×