Kafka for Big Data Streaming

Big Data
It is used to describe massive volumes of both structured and unstructured data that is so large
it is difficult to process in traditional database and software techniques.
Lots of Data (Terabytes and Petabytes)
Big data is a term for a collections of data sets so large and complex that is difficult to process
using on-hand database management tools or traditional processing applications.
The challenges inside include, capture, curation, storage, search, sharing, transfer, analysis, and
visualization.

Stock market generates about one terabyte of
new trade data per day to perform stock
trading analytics to determine trends for
optimal trades.

Unstructured data is exploding
By 2020 International data corporation predicts the number will exceed 40,000 EB or 40
Zettabytes.
The world information is doubling every 2 years.

IBM definition of big data
IBM Definition of big data.

What is Kafka
A distributed publish subscribe messaging system.
Developed by LinkedIn Corporation.
Provides solution to handle all activity stream data.
Fully supported Hadoop platform.
Partitions real time consumption across cluster of machines.
Provides a mechanism for parallel load into Hadoop.

Need of Kafka
Feature Description
High Throughput Provides support for hundreds and thousand of
message in a moderate software.
Scalability Highly scalable with no downtime
Replication Messages can be replicated across clusters
Durability Provides support for persistence of messages in disk.
Stream processing It can used for real time streaming
Data Loss Kafka with proper configuration can ensure zero data
loss.

Kafka Terminology
Producer
Consumer
Broker
Cluster
Topic
Partition
Offset

• What is a producer
An application that
sends data.

Producer
Application publishes messages to the topic in Kafka cluster.
Can be any kind front end or streaming.
While writing messages it is also possible to attach key with message.
By attaching key producer basically guarantees that all messages with same key in wrote in same
partition.
Supports both sync and async mode.

Application subscribes and consumes messages from broker in kafka cluster.
During consumption of messages from a topic a consumer group can be configured with
multiple consumers.
Each consumer from consumer group reads messages from different partition in a topic.
Consumer

producer
producer
Consumer
Consumer
Kafka
Server

Pull Mechanism
Producer
ConsumerBroker
Sends message
Message
Request for next
message

Each server is called as broker.
Handles hundreds of megabytes of writes from producers and reads from consumers.
Retains all published messages irrespective weather it is consumed or not.
If retention is configured for n days, then messages once published it is available for
consumption for configured for n days and thereafter it is discarded.
Works like a queue if consumer instances belong to same consumer group else works like
publisher and subscriber.
Brokers

A group of computer sharing workload for common purpose.
Kafka cluster is generally fast, highly scalable messaging system.
Effective for applications which involves large scale message processing.
Clusters

Producer
Consumer
Sends message
Message
Request for next
messagebroker1
broker2
broker3
zookeeper
Cluster set up

With kafka we can easily handle hundreds of thousands of messages in a second, which makes
kafka a high throughput system.
Cluster can be expanded with no downtime. Making kafka highly scalable.
Messages are replicated, which provides reliability and durability.
Fault tolerant.
Why Kafka Cluster

Topic
An user defined category where messages are published.
For each topic partition log is maintained.
Each topic basically maintains an ordered, immutable sequence of messages assigned a
sequential id number called offset.
Writes to a partition are generally sequential thereby reducing the number of hard disk seeks.
Reading messages from partition can either be from the beginning and also can rewind or skip
to any point in a partition by supplying an offset value.

Topic
Producer
Consumer
Sends message
Message
Global
orders
Other orders
Producer
Producer
Sends message
Sends message
Consumer

What is a Offset
A sequence id given to messages as they arrive in a partition.
m1 m2 m3 m4 m5 m6 m7 m8
0 1 2

Offset
Sent offset
Committed
offset

Offset
Committed offset is used to avoid resending of already processed data to the new consumer
during an event of partition rebalance.
Auto commit :- enable.auto.commit = true
Manual Commit :- :- enable.auto.commit = false.
auto.commit.interval.ms =4 – What is the purpose of this property.

What is a Consumer group
A group of consumers acting as single logical unit.

One consumer in clustered setup example

Multiple consumers within a group.

There will not be any duplicate reads.
Each consumer within a consumer group will be assigned a partition. Hence it will read message
only from the partitions assigned to it.
Once a partition is assigned to a consumer. It will not be assigned to another consumer within a
same group. Unless a rebalancing takes place.

Once consumer increases in a group.
Partitions are distributed.

One of Kafka broker is elected as group coordinator.
When new consumer joins group it sends message to coordinator.
So first consumer joining the consumer group becomes leader in the group.
Roles and Responsibilities.
◦ Coordinator manages list of group members.
◦ Coordinator initiates rebalance activity once list is modified.
◦ Consumer leader executes rebalance activity.
◦ Consumer leader assigns partition to new member and sends back to co Ordinator.
◦ Coordinator communicates to member consumer about its new assignment.
Group Coordinator

◦ Imagine when poll() pulls large amount of data and it takes lot of time to process. Which
means there will be delay in next polling.
◦ If there will be delay in next poll, group coordinator will assume that consumer is dead and
will issue a rebalancing. How will you know a rebalance is triggered and how will you commit
your offset in such cases?
Production Scenario – Problem statement.

Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
45
Solution
So in such scenario before it is rebalanced we have to commit till whatever is processed. So that when this
partition is assigned to new consumer. Already processed record by this consumer wont be again sent to the
newly assigned consumer.
ConsumerRebalanceListner class has certain methods like
onPartitionRevoked which will be invoked just before rebalance is issued.
onPartitionAsssigned which will be invoked after rebalance is complete.

Total number of copies made for a partition is Replication Factor.
The purpose of adding replication in Kafka is for stronger durability and higher availability. We
want to guarantee that any successfully published message will not be lost and can be
consumed, even when there are server failures. Such failures can be caused by machine error,
program error, or more commonly, software upgrades.
Replication Factor

Leader and Follower
For each partition one broker is chosen as a leader.
Leader copies data to all its replicas.

Client application sends message only to leader.

An open source Apache project.
Provides a centralised, infrastructure and services that enables synchronisation across clusters.
Common objects used across large cluster environment are maintained in zookeeper.
Objects such as configuration, hierarchical naming space are maintained in zookeeper.
Zookeeper services are used by large scale applications to coordinate distributed processing
across large cluster.
Zookeeper

Kafka can be downloaded from the following location https://kafka.apache.org/downloads.html
As per the current documentation the version of kafka is 0.11.0.0.
Installing Kafka

Kafka Configurations
Property Default Description
broker.id Each broker is uniquely identified by a non- negative integer id. This
id serves as the brokers “name” and allows the broker to be moved
to a different host/port without confusing consumers. You can
choose any number you like as it is unique.
logs.dirs. /tmp/kafka-
logs
A comma separated list or one or more directories in which kafka
data are stored. Each new partition that is created will be placed in
the directory which currently has the fewest partitions.
Port 6667 The port on which server accepts client connections.
keeper.connect null Specifies the zookeeper connection string in the form
hostname:port, where hostname and port are the host and port for
a node in your zookeeper cluster. To allow connecting through other
zookeeper nodes when that host is down you can also specify
multiple hosts in the form hostname1:port1, hostname2:port2,
hostname3:port3. Zookeeper also allows you to add a “chroot” path
which will make all Kafka data for the cluster appear under a
particular path. This is a way to setup multiple kafka clusters or

Kafka cluster can run against the following broker model.
Single Broker Cluster
Multi Broker Cluster
Single broker cluster generally runs only one instance compared to multi broker which runs
multiple broker.
To test the kafka cluster the following shell scripts can be used.
Testing Kafka Cluster
Kafka Shell Scripts
Zookeeper-server-start.sh
Kafka-server-start.sh
Kafka-topics.sh
Kafka-console-producer.sh
Kafka-console-consumer.sh

Header of section
Bala | 7/14/2017
55
Demo

Header of section
Bala | 7/14/2017
56
Start Kafka server
Create topic
Start a console producer
Start a console consumer.
Send and receive message.
Set up clustered broker

1. Auto.create.topics.enable
2. Default.replication.factor
3. Num.partition
4. Log.retention.ms
5. Log.retention.bytes.

<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.8.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.8.2.1</version>
</dependency>
Maven dependencies for Kafka Java API

import java.util.Properties;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
public class SimpleProducerToLearn {
public static void main(String[] args) {
String topicName = "SimpleTopic";
String key = "Key1";
String value = "Value-1";
Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("serializer.class", "org.apache.commmon.serialization.StringSerializer");
properties.put("request.required.acks", "1");
Producer<String, String> producer = new KafkaProducer<>(properties);
ProducerRecord<String, String> record = new ProducerRecord<>(topicName, key, value);
producer.send(record);
producer.close();
}
}
Producer Java API

Producer Record
Kafka comes with default practitioner.
Messages with same message key goes in same partition.
Key is optional, hence if message has no key then Kafka with evenly distribute messages across
the partitions.
If you pass partition in constructor then default partition is disabled.
Timestamp field in constructor denotes the time when message is sent in broker. If you don’t
pass this then broker will set timestamp as time at which messages received in broker.

Fire And
Forget
Synchronous
send
Asynchronous
Send
3 Different Send Requests

package com.Learning.co;
import java.util.Properties;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
public class SynchronousProducerToLearn {
public static void main(String[] args) {
String topicName = "SimpleTopic";
String key = "Key1";
String value = "Value-1";
Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("serializer.class", "org.apache.commmon.serialization.StringSerializer");
properties.put("request.required.acks", "1");
Producer<String, String> producer = new KafkaProducer<String, String>(properties);
ProducerRecord<String, String> record = new ProducerRecord<String, String>(topicName, key, value);
try{
RecordMetadata metadata = producer.send(record).get();
System.out.println("Synchronous completed with success" +"sent to partition"+metadata.partition()+" offset "+
metadata.offset());
}catch(Exception e){
System.out.println("Synchronous completed with failure" );
}
producer.close();
}
}
Sync

Async
producer.send(record, new MessageCallBack());
class MessageCallBack implements Callback{
public void onCompletion(RecordMetadata metadata, Exception e) {
// TODO Auto-generated method stub
if(e!=null){
System.out.println("Failed");
}else{
System.out.println("Success");
}
}
}

Header of section
Bala | 7/14/2017
69
Production Scenario – Problem statement.
Assume auto commit interval is set to 60 seconds (Default).
Now pull method in consumer A invokes and receives 6 records. All these 6 records are processed in less than
10 seconds. Since 60 seconds gap is not over these records are not committed in Kafka.
Now another set of records are received via pull method.
Now lets assume due to some reason a rebalance is triggered. First 6 records which is already processed is still
not committed.
After rebalance this partition which is assigned to this consumer A goes to a new consumer B. Now since none
of the records are committed by consumer A the first 6 messages are again resent to new consumer B.
This is clear case of data duplication and how to handle it?

Header of section
Bala | 7/14/2017
71
Critical Configs
Batch.size (size based batching)
Linger.ms ( time based batching)
Compression.type
Max.in.flight.requests.per.connection (affects ordering)
Acks ( affects durability)
retries

Bala | 7/14/2017
72
Acks
◦ Acks = 0
◦ Producer Doesn't wait for response from broker.
◦ High throughput
◦ No retries
◦ Loss of message is possible.
◦ Acks =1
◦ Producer waits for response from broker.
◦ Response is sent by leader after it receives the message from producer.
◦ Still message loss if possible.
◦ Acks = -1
◦ Response is sent after leader receives acknowledgement from all its replicas.
◦ Slow
◦ Highly reliable.

Comparisons
Bala | 7/14/2017
73
Acks mode
Acks Throughtput Latency Durability
0 High Low No Gurantee
1 Medium Medium Leader
-1 Low High ISR

Bala | 7/14/2017
74
Partitioner
Default Partitioner
 If a partition is specified in the record use it.
 If no partition is specified but a key is present choose a partition based on hash of the key.
 If no partition or key is present choose a partition in a round robin fashion.

Bala | 7/14/2017
75
Partitioner
Code snipped from default Partitioner
return Utils.toPositive(Utils.murmur2(keybytes))%numPartitions;

Bala | 7/14/2017
76
Custom Partitioner

import java.util.*;
import org.apache.kafka.clients.producer.*;
public class SensorProducer {
public static void main(String[] args) throws Exception{
String topicName = "SensorTopic";
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092,localhost:9093");
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("partitioner.class", "SensorPartitioner");
props.put("speed.sensor.name", "TSS");
Producer<String, String> producer = new KafkaProducer <>(props);
for (int i=0 ; i<10 ; i++)
producer.send(new ProducerRecord<>(topicName,"SSP"+i,"500"+i));
for (int i=0 ; i<10 ; i++)
producer.send(new ProducerRecord<>(topicName,"TSS","500"+i));
producer.close();
System.out.println("SimpleProducer Completed.");
}
}
Partition - Producer

import java.util.*;
import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.*;
import org.apache.kafka.common.utils.*;
import org.apache.kafka.common.record.*;
public class SensorPartitioner implements Partitioner {
private String speedSensorName;
public void configure(Map<String, ?> configs) {
speedSensorName = configs.get("speed.sensor.name").toString();
}
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
int sp = (int)Math.abs(numPartitions*0.3);
int p=0;
if ( (keyBytes == null) || (!(key instanceof String)) )
throw new InvalidRecordException("All messages must have sensor name as key");
if ( ((String)key).equals(speedSensorName) )
p = Utils.toPositive(Utils.murmur2(valueBytes)) % sp;
else
p = Utils.toPositive(Utils.murmur2(keyBytes)) % (numPartitions-sp) + sp ;
System.out.println("Key = " + (String)key + " Partition = " + p );
return p;
}
public void close() {}
}
Custom Partitioner

max.in.flight.request.per.connection:
Definition: How many request you can send to broker without getting any response.
Default value is 5.
High value will give high throughput and also use high memory consumption.
In Asncy commit set the value of this property to 1, to maintain ordering of messages.
May cause out of order delivery when retry occurs.

Bala | 7/14/2017
80
Scenario
Side effects of async commit is that, it may loose the ordering of data, which processing messages in batches.
Record1
Record2
Record3
Record4
Record5
Record6
Record7
Record8
Record9
Record10
Commits Successfully
Broker
Record6
Record7
Record8
Record9
Record10
Record1
Record2
Record3
Record4
Record5
Partition buffer
Callback with exception
Retries and successfull

Async producer sends message in background – no blocking in client.
Provides more powerful batching of messages.
Wraps a sync produce, or rather a pool of them.
Communication from asyncsync happens via a queue.
Which explains why you may see kafka.produce.async.QueueFullException.
Async produce may drop messages if its queue is full.
◦ Solution1 don’t push messages to producer faster than its able to send to queue.
◦ Solution 2 Queue full == need more brokers
◦ Solution 3 set queue.enqueuer.timeout.ms to -1. Now the producer will block indefinitely and will never drop messages.
◦ Solution 4 Increase queue.buffering.max.messages
For more in detailed study: https://engineering.gnip.com/kafka-async-producer/
Async Producer

Bala | 7/14/2017
82
Other producer config properties
retries: No of time producer retries to send messages. Default value 0
Retries.backoff.ms= time between each retries. Defaullt value 100ms

Lag
Lag = how far your producer is behind the consumer.
Older
message
Newer
message
producer
Consumer
lag

Lag is a consumer problem.
Too slow, too much GC , loosing connection to ZK or Kafka
Bug or design flaw in consumer.
Operational mistakes eg. You brought 6 kafka servers in parallel, each one in turn trigerring
rebalancing, then hit kafkas rebalance limit, cf.rebalance.max.retries
Lag

Under replicated partitions.
◦ For example because a broker is down.
Offline partitions
◦ Even worse than under replicated.
◦ Serious problem if anything but 0 offline partitions.

Partitions Leader broker ISR
paritition1 0 1,2
One of the replica broker say 2 goes down. – Under partitioned
paritition1 0 1
Again one of the replica say 1 goes down – Still Under partitioned
paritition1 0 0
Assume replication factor is set as 3 for this topic.

replica.lag.max.messages
Leader In sync replica 1 In sync replica 2
0 0 0
1 1 1
2 2
3 3
4 4
5 5
6 6
In sync replica 2 for some
reason messages are not
being copied. And this case
replica 2 is lagging 5
messages. Which is more than
value of property
replica.lag.max.messages =4
(default value). This broker
(replica 2) will go out of sync.
commit
commit

replica.lag.max.messages
Leader In sync replica 1
Record1
Record2
Record3
Record4
Record5
1. What happens when message coming in batches.
2. If the value of property is set to 3.
3. Assume batch one has 5 messages and first batch is replicated in all brokers
4. Second batch has another 5 messages. But since replica 1 is lagging behind more
than 3 messages it goes out of sync.
5. Hence though replica set 1 is not dead. It goes out of sync.
6. Solution is to use replica.lag.max.ms
commit Record1
Record2
Record3
Record4
Record5
Record1
Record2
Record3
Record4
Record5
Replica 1 goes
Out of Sync

What happens when broker goes down and comes up again.
Production Scenario 1
Partitions Leader broker Leader assignment after
one of broker 1 goes
down
Leader assignment after
broker 1 comes up
paritition1 0 0 0
paritition2 1 2 2
paritition3 2 2 2
paritition4 3 3 3
partition5 1 0 0
partition6 0 0 0
Sad reality is Broker 1
could never become leader
again. It will simply be as
ISR
Kafka-preferred-replica-
election.sh
Comes to your rescue. And
hence load is evenly
balanced.

How to increase or decrease no of node in kafka?
Increase or Add new Broker
◦ Just start a new instance of kafka. But this new instance will never be a leader. Hence after starting the
broker run kafka-preferred-replica-election.sh
Decrease or Cut down a Broker
◦ Run kafka-reassign-partition.sh
◦ This will show the current replica assignment and proposed replica assignment.
◦ kafka-reassign-partition.sh << list of brokers you want to keep>> --generate.
◦ Suppose you have 5 brokers 1,2,3,4,5 and you want to bring down 5. kafka-reassign-partition.sh <<1,2,3,4>> --generate.
◦ This will generate a json file with proposed assignment file.
◦ Now again run the script
◦ kafka-reassign-partition.sh --execute –reassignment-json-file <<json file name>>
◦ After this run preferred-replica-election.sh.
◦ Cross check using describe command.

What to do if broker 2 goes down and is not recoverable?
◦ Simple start a new broker with broker.id similar to one which is currently not recoverable.
◦ Then start kafka-preferred-replica-election.sh

Twitter Code URL
Viyaan | 7/14/2017
94
Git Hub Links
https://github.com/Viyaan/TwitterKafkaProducer
https://github.com/Viyaan/StormKafkaStreamingWordCount

Kafka for Big Data Streaming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka for Big Data Streaming

Similar to Kafka for Big Data Streaming (20)

More from Viyaan Jhiingade

More from Viyaan Jhiingade (7)

Recently uploaded

Recently uploaded (20)

Kafka for Big Data Streaming

Editor's Notes