APACHE KAFKA DEMYSTIFIED
Shanki Singh Gandhi
@shankisg
OVERVIEW
Apache Kafka is an open-source stream processing platform developed by
the Apache Software Foundation written in Scala and Java. The project aims
to provide a unified, high-throughput, low-latency platform for handling
real-time data feeds.
KEY POINTS
 Kafka is run as a cluster on one or more servers.
 The Kafka cluster stores streams of records in categories called topics.
 Each record consists of a key, a value, and a timestamp.
CONCEPTS
 Producer: Application that sends the messages.
 Consumer: Application that receives the messages.
 Message: Information that is sent from the producer to a consumer through Apache Kafka.
 Connection: A connection is a TCP connection between your application and the Kafka broker.
 Topic: A Topic is a category/feed name to which messages are stored and published.
 Topic partition: Kafka topics are divided into a number of partitions, which allows you to split data across
multiple brokers.
 Replicas A replica of a partition is a "backup" of a partition. Replicas never read or write data. They are used
to prevent data loss.
 Consumer Group: A consumer group includes the set of consumer processes that are subscribing to a
specific topic.
 Offset: The offset is a unique identifier of a record within a partition. It denotes the position of the consumer
in the partition.
 Node: A node is a single computer in the Apache Kafka cluster.
 Cluster: A cluster is a group of nodes i.e., a group of computers.
KAFKA ARCHITECTURE
KAFKA TOPIC
 Topic is a category or feed name to which records are published.
 Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that
subscribe to the data written to it.
 Each partition is an ordered, immutable sequence of records that is continually appended to structured
commit log.
 The records in the partitions are each assigned a sequential id number called the offset that uniquely
identifies each record within the partition.
PARTITION AND BROKER
KAFKA APIS
• Producer Api
• Consumer Api
• Streams Api
• Connector Api
PRODUCER
 Producers publish data to the topics of their choice.
 Producers write to a single leader, this provides a means of load balancing production so that
each write can be serviced by a separate broker and machine.
CONSUMERS AND CONSUMER GROUPS
 Consumers label themselves with a consumer group name, and each record published to a topic is
delivered to one consumer instance within each subscribing consumer group. Consumer instances
can be in separate processes or on separate machines.
 If all the consumer instances have the same consumer group, then the records will effectively be
load balanced over the consumer instances.
 If all the consumer instances have different consumer groups, then each record will be broadcast to
all the consumer processes.
START KAFKA SERVER
 Download kafka1.0.0 from here
 Extract the code
tar -xzf kafka_2.11-1.0.0.tgz
cd kafka_2.11-1.0.0
 Start zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties
 Start kafka server
bin/kafka-server-start.sh config/server.properties
BASIC KAFKA CLI COMMANDS
Create topic
 bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
List topics
 bin/kafka-topics.sh --list --zookeeper localhost:2181
Start producer
 bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
Start consumer
 bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
CONNECTING KAFKA FROM PYTHON
 kafka-python
 Install kafka-python: pip install kafka-python
 Github: https://github.com/dpkp/kafka-python
 Documentation: https://kafka-python.readthedocs.io/en/master/index.html
PRODUCER SAMPLE CODE
import json
from kafka import KafkaProducer
# Send json data to a kafka topic
producer = KafkaProducer(value_serializer=json.dumps, bootstrap_servers=[kafka_url])
data = {key: value}
producer.send(“my-topic”, data )
CONSUMER SAMPLE CODE
from kafka import KafkaConsumer
# Connecting to kakfa and subscribing to a topic
consumer = KafkaConsumer(“my-topic”, group_id=“my-group”, bootstrap_servers=[kafka_url])
# Start consuming data
for msg in consumer:
print msg
IMPORTANT LINKS
 https://kafka.apache.org/intro
 https://kafka.apache.org/quickstart
 https://www.cloudkarafka.com/blog/2016-11-30-part1-kafka-for-beginners-what-is-
apache-kafka.html
 http://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/
 https://kafka-python.readthedocs.io/en/master/usage.html
DEMO
Q & A
THANKS

Kafka overview

  • 1.
    APACHE KAFKA DEMYSTIFIED ShankiSingh Gandhi @shankisg
  • 2.
    OVERVIEW Apache Kafka isan open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
  • 3.
    KEY POINTS  Kafkais run as a cluster on one or more servers.  The Kafka cluster stores streams of records in categories called topics.  Each record consists of a key, a value, and a timestamp.
  • 4.
    CONCEPTS  Producer: Applicationthat sends the messages.  Consumer: Application that receives the messages.  Message: Information that is sent from the producer to a consumer through Apache Kafka.  Connection: A connection is a TCP connection between your application and the Kafka broker.  Topic: A Topic is a category/feed name to which messages are stored and published.  Topic partition: Kafka topics are divided into a number of partitions, which allows you to split data across multiple brokers.  Replicas A replica of a partition is a "backup" of a partition. Replicas never read or write data. They are used to prevent data loss.  Consumer Group: A consumer group includes the set of consumer processes that are subscribing to a specific topic.  Offset: The offset is a unique identifier of a record within a partition. It denotes the position of the consumer in the partition.  Node: A node is a single computer in the Apache Kafka cluster.  Cluster: A cluster is a group of nodes i.e., a group of computers.
  • 5.
  • 6.
    KAFKA TOPIC  Topicis a category or feed name to which records are published.  Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.  Each partition is an ordered, immutable sequence of records that is continually appended to structured commit log.  The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
  • 7.
  • 8.
    KAFKA APIS • ProducerApi • Consumer Api • Streams Api • Connector Api
  • 9.
    PRODUCER  Producers publishdata to the topics of their choice.  Producers write to a single leader, this provides a means of load balancing production so that each write can be serviced by a separate broker and machine.
  • 10.
    CONSUMERS AND CONSUMERGROUPS  Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.  If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.  If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
  • 11.
    START KAFKA SERVER Download kafka1.0.0 from here  Extract the code tar -xzf kafka_2.11-1.0.0.tgz cd kafka_2.11-1.0.0  Start zookeeper bin/zookeeper-server-start.sh config/zookeeper.properties  Start kafka server bin/kafka-server-start.sh config/server.properties
  • 12.
    BASIC KAFKA CLICOMMANDS Create topic  bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test List topics  bin/kafka-topics.sh --list --zookeeper localhost:2181 Start producer  bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test Start consumer  bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
  • 13.
    CONNECTING KAFKA FROMPYTHON  kafka-python  Install kafka-python: pip install kafka-python  Github: https://github.com/dpkp/kafka-python  Documentation: https://kafka-python.readthedocs.io/en/master/index.html
  • 14.
    PRODUCER SAMPLE CODE importjson from kafka import KafkaProducer # Send json data to a kafka topic producer = KafkaProducer(value_serializer=json.dumps, bootstrap_servers=[kafka_url]) data = {key: value} producer.send(“my-topic”, data )
  • 15.
    CONSUMER SAMPLE CODE fromkafka import KafkaConsumer # Connecting to kakfa and subscribing to a topic consumer = KafkaConsumer(“my-topic”, group_id=“my-group”, bootstrap_servers=[kafka_url]) # Start consuming data for msg in consumer: print msg
  • 16.
    IMPORTANT LINKS  https://kafka.apache.org/intro https://kafka.apache.org/quickstart  https://www.cloudkarafka.com/blog/2016-11-30-part1-kafka-for-beginners-what-is- apache-kafka.html  http://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/  https://kafka-python.readthedocs.io/en/master/usage.html
  • 17.
  • 18.