2. Big Data
It is used to describe massive volumes of both structured and unstructured data that is so large
it is difficult to process in traditional database and software techniques.
Lots of Data (Terabytes and Petabytes)
Big data is a term for a collections of data sets so large and complex that is difficult to process
using on-hand database management tools or traditional processing applications.
The challenges inside include, capture, curation, storage, search, sharing, transfer, analysis, and
visualization.
3. Stock market generates about one terabyte of
new trade data per day to perform stock
trading analytics to determine trends for
optimal trades.
4. Unstructured data is exploding
By 2020 International data corporation predicts the number will exceed 40,000 EB or 40
Zettabytes.
The world information is doubling every 2 years.
6. What is Kafka
A distributed publish subscribe messaging system.
Developed by LinkedIn Corporation.
Provides solution to handle all activity stream data.
Fully supported Hadoop platform.
Partitions real time consumption across cluster of machines.
Provides a mechanism for parallel load into Hadoop.
9. Need of Kafka
Feature Description
High Throughput Provides support for hundreds and thousand of
message in a moderate software.
Scalability Highly scalable with no downtime
Replication Messages can be replicated across clusters
Durability Provides support for persistence of messages in disk.
Stream processing It can used for real time streaming
Data Loss Kafka with proper configuration can ensure zero data
loss.
12. • What is a producer
An application that
sends data.
13. Producer
Application publishes messages to the topic in Kafka cluster.
Can be any kind front end or streaming.
While writing messages it is also possible to attach key with message.
By attaching key producer basically guarantees that all messages with same key in wrote in same
partition.
Supports both sync and async mode.
14. Application subscribes and consumes messages from broker in kafka cluster.
During consumption of messages from a topic a consumer group can be configured with
multiple consumers.
Each consumer from consumer group reads messages from different partition in a topic.
Consumer
17. Each server is called as broker.
Handles hundreds of megabytes of writes from producers and reads from consumers.
Retains all published messages irrespective weather it is consumed or not.
If retention is configured for n days, then messages once published it is available for
consumption for configured for n days and thereafter it is discarded.
Works like a queue if consumer instances belong to same consumer group else works like
publisher and subscriber.
Brokers
18. A group of computer sharing workload for common purpose.
Kafka cluster is generally fast, highly scalable messaging system.
Effective for applications which involves large scale message processing.
Clusters
20. With kafka we can easily handle hundreds of thousands of messages in a second, which makes
kafka a high throughput system.
Cluster can be expanded with no downtime. Making kafka highly scalable.
Messages are replicated, which provides reliability and durability.
Fault tolerant.
Why Kafka Cluster
21. Topic
An user defined category where messages are published.
For each topic partition log is maintained.
Each topic basically maintains an ordered, immutable sequence of messages assigned a
sequential id number called offset.
Writes to a partition are generally sequential thereby reducing the number of hard disk seeks.
Reading messages from partition can either be from the beginning and also can rewind or skip
to any point in a partition by supplying an offset value.
26. Offset
Committed offset is used to avoid resending of already processed data to the new consumer
during an event of partition rebalance.
Auto commit :- enable.auto.commit = true
Manual Commit :- :- enable.auto.commit = false.
auto.commit.interval.ms =4 – What is the purpose of this property.
27. What is a Consumer group
A group of consumers acting as single logical unit.
32. There will not be any duplicate reads.
Each consumer within a consumer group will be assigned a partition. Hence it will read message
only from the partitions assigned to it.
Once a partition is assigned to a consumer. It will not be assigned to another consumer within a
same group. Unless a rebalancing takes place.
35. One of Kafka broker is elected as group coordinator.
When new consumer joins group it sends message to coordinator.
So first consumer joining the consumer group becomes leader in the group.
Roles and Responsibilities.
◦ Coordinator manages list of group members.
◦ Coordinator initiates rebalance activity once list is modified.
◦ Consumer leader executes rebalance activity.
◦ Consumer leader assigns partition to new member and sends back to co Ordinator.
◦ Coordinator communicates to member consumer about its new assignment.
Group Coordinator
36. ◦ Imagine when poll() pulls large amount of data and it takes lot of time to process. Which
means there will be delay in next polling.
◦ If there will be delay in next poll, group coordinator will assume that consumer is dead and
will issue a rebalancing. How will you know a rebalance is triggered and how will you commit
your offset in such cases?
Production Scenario – Problem statement.
40. Total number of copies made for a partition is Replication Factor.
The purpose of adding replication in Kafka is for stronger durability and higher availability. We
want to guarantee that any successfully published message will not be lost and can be
consumed, even when there are server failures. Such failures can be caused by machine error,
program error, or more commonly, software upgrades.
Replication Factor
41. Leader and Follower
For each partition one broker is chosen as a leader.
Leader copies data to all its replicas.
43. An open source Apache project.
Provides a centralised, infrastructure and services that enables synchronisation across clusters.
Common objects used across large cluster environment are maintained in zookeeper.
Objects such as configuration, hierarchical naming space are maintained in zookeeper.
Zookeeper services are used by large scale applications to coordinate distributed processing
across large cluster.
Zookeeper
45. Kafka can be downloaded from the following location https://kafka.apache.org/downloads.html
As per the current documentation the version of kafka is 0.11.0.0.
Installing Kafka
46. Kafka Configurations
Property Default Description
broker.id Each broker is uniquely identified by a non- negative integer id. This
id serves as the brokers “name” and allows the broker to be moved
to a different host/port without confusing consumers. You can
choose any number you like as it is unique.
logs.dirs. /tmp/kafka-
logs
A comma separated list or one or more directories in which kafka
data are stored. Each new partition that is created will be placed in
the directory which currently has the fewest partitions.
Port 6667 The port on which server accepts client connections.
keeper.connect null Specifies the zookeeper connection string in the form
hostname:port, where hostname and port are the host and port for
a node in your zookeeper cluster. To allow connecting through other
zookeeper nodes when that host is down you can also specify
multiple hosts in the form hostname1:port1, hostname2:port2,
hostname3:port3. Zookeeper also allows you to add a “chroot” path
which will make all Kafka data for the cluster appear under a
particular path. This is a way to setup multiple kafka clusters or
47. Kafka cluster can run against the following broker model.
Single Broker Cluster
Multi Broker Cluster
Single broker cluster generally runs only one instance compared to multi broker which runs
multiple broker.
To test the kafka cluster the following shell scripts can be used.
Testing Kafka Cluster
Kafka Shell Scripts
Zookeeper-server-start.sh
Kafka-server-start.sh
Kafka-topics.sh
Kafka-console-producer.sh
Kafka-console-consumer.sh
56. Producer Record
Kafka comes with default practitioner.
Messages with same message key goes in same partition.
Key is optional, hence if message has no key then Kafka with evenly distribute messages across
the partitions.
If you pass partition in constructor then default partition is disabled.
Timestamp field in constructor denotes the time when message is sent in broker. If you don’t
pass this then broker will set timestamp as time at which messages received in broker.
60. package com.Learning.co;
import java.util.Properties;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
public class SynchronousProducerToLearn {
public static void main(String[] args) {
String topicName = "SimpleTopic";
String key = "Key1";
String value = "Value-1";
Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("serializer.class", "org.apache.commmon.serialization.StringSerializer");
properties.put("request.required.acks", "1");
Producer<String, String> producer = new KafkaProducer<String, String>(properties);
ProducerRecord<String, String> record = new ProducerRecord<String, String>(topicName, key, value);
try{
RecordMetadata metadata = producer.send(record).get();
System.out.println("Synchronous completed with success" +"sent to partition"+metadata.partition()+" offset "+
metadata.offset());
}catch(Exception e){
System.out.println("Synchronous completed with failure" );
}
producer.close();
}
}
Sync
61. Async
producer.send(record, new MessageCallBack());
class MessageCallBack implements Callback{
public void onCompletion(RecordMetadata metadata, Exception e) {
// TODO Auto-generated method stub
if(e!=null){
System.out.println("Failed");
}else{
System.out.println("Success");
}
}
}
62. Header of section
Bala | 7/14/2017
69
Production Scenario – Problem statement.
Assume auto commit interval is set to 60 seconds (Default).
Now pull method in consumer A invokes and receives 6 records. All these 6 records are processed in less than
10 seconds. Since 60 seconds gap is not over these records are not committed in Kafka.
Now another set of records are received via pull method.
Now lets assume due to some reason a rebalance is triggered. First 6 records which is already processed is still
not committed.
After rebalance this partition which is assigned to this consumer A goes to a new consumer B. Now since none
of the records are committed by consumer A the first 6 messages are again resent to new consumer B.
This is clear case of data duplication and how to handle it?
70. import java.util.*;
import org.apache.kafka.clients.producer.*;
public class SensorProducer {
public static void main(String[] args) throws Exception{
String topicName = "SensorTopic";
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092,localhost:9093");
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("partitioner.class", "SensorPartitioner");
props.put("speed.sensor.name", "TSS");
Producer<String, String> producer = new KafkaProducer <>(props);
for (int i=0 ; i<10 ; i++)
producer.send(new ProducerRecord<>(topicName,"SSP"+i,"500"+i));
for (int i=0 ; i<10 ; i++)
producer.send(new ProducerRecord<>(topicName,"TSS","500"+i));
producer.close();
System.out.println("SimpleProducer Completed.");
}
}
Partition - Producer
71. import java.util.*;
import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.*;
import org.apache.kafka.common.utils.*;
import org.apache.kafka.common.record.*;
public class SensorPartitioner implements Partitioner {
private String speedSensorName;
public void configure(Map<String, ?> configs) {
speedSensorName = configs.get("speed.sensor.name").toString();
}
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
int sp = (int)Math.abs(numPartitions*0.3);
int p=0;
if ( (keyBytes == null) || (!(key instanceof String)) )
throw new InvalidRecordException("All messages must have sensor name as key");
if ( ((String)key).equals(speedSensorName) )
p = Utils.toPositive(Utils.murmur2(valueBytes)) % sp;
else
p = Utils.toPositive(Utils.murmur2(keyBytes)) % (numPartitions-sp) + sp ;
System.out.println("Key = " + (String)key + " Partition = " + p );
return p;
}
public void close() {}
}
Custom Partitioner
72. max.in.flight.request.per.connection:
Definition: How many request you can send to broker without getting any response.
Default value is 5.
High value will give high throughput and also use high memory consumption.
In Asncy commit set the value of this property to 1, to maintain ordering of messages.
May cause out of order delivery when retry occurs.
73. Bala | 7/14/2017
80
Scenario
Side effects of async commit is that, it may loose the ordering of data, which processing messages in batches.
Record1
Record2
Record3
Record4
Record5
Record6
Record7
Record8
Record9
Record10
Commits Successfully
Broker
Record6
Record7
Record8
Record9
Record10
Record1
Record2
Record3
Record4
Record5
Partition buffer
Callback with exception
Retries and successfull
74. Async producer sends message in background – no blocking in client.
Provides more powerful batching of messages.
Wraps a sync produce, or rather a pool of them.
Communication from asyncsync happens via a queue.
Which explains why you may see kafka.produce.async.QueueFullException.
Async produce may drop messages if its queue is full.
◦ Solution1 don’t push messages to producer faster than its able to send to queue.
◦ Solution 2 Queue full == need more brokers
◦ Solution 3 set queue.enqueuer.timeout.ms to -1. Now the producer will block indefinitely and will never drop messages.
◦ Solution 4 Increase queue.buffering.max.messages
For more in detailed study: https://engineering.gnip.com/kafka-async-producer/
Async Producer
77. Lag
Lag = how far your producer is behind the consumer.
Older
message
Newer
message
producer
Consumer
lag
78. Lag is a consumer problem.
Too slow, too much GC , loosing connection to ZK or Kafka
Bug or design flaw in consumer.
Operational mistakes eg. You brought 6 kafka servers in parallel, each one in turn trigerring
rebalancing, then hit kafkas rebalance limit, cf.rebalance.max.retries
Lag
79. Under replicated partitions.
◦ For example because a broker is down.
Offline partitions
◦ Even worse than under replicated.
◦ Serious problem if anything but 0 offline partitions.
80. Partitions Leader broker ISR
paritition1 0 1,2
One of the replica broker say 2 goes down. – Under partitioned
paritition1 0 1
Again one of the replica say 1 goes down – Still Under partitioned
paritition1 0 0
Assume replication factor is set as 3 for this topic.
81. replica.lag.max.messages
Leader In sync replica 1 In sync replica 2
0 0 0
1 1 1
2 2
3 3
4 4
5 5
6 6
In sync replica 2 for some
reason messages are not
being copied. And this case
replica 2 is lagging 5
messages. Which is more than
value of property
replica.lag.max.messages =4
(default value). This broker
(replica 2) will go out of sync.
commit
commit
82. replica.lag.max.messages
Leader In sync replica 1
Record1
Record2
Record3
Record4
Record5
1. What happens when message coming in batches.
2. If the value of property is set to 3.
3. Assume batch one has 5 messages and first batch is replicated in all brokers
4. Second batch has another 5 messages. But since replica 1 is lagging behind more
than 3 messages it goes out of sync.
5. Hence though replica set 1 is not dead. It goes out of sync.
6. Solution is to use replica.lag.max.ms
commit Record1
Record2
Record3
Record4
Record5
Record1
Record2
Record3
Record4
Record5
Replica 1 goes
Out of Sync
83. What happens when broker goes down and comes up again.
Production Scenario 1
Partitions Leader broker Leader assignment after
one of broker 1 goes
down
Leader assignment after
broker 1 comes up
paritition1 0 0 0
paritition2 1 2 2
paritition3 2 2 2
paritition4 3 3 3
partition5 1 0 0
partition6 0 0 0
Sad reality is Broker 1
could never become leader
again. It will simply be as
ISR
Kafka-preferred-replica-
election.sh
Comes to your rescue. And
hence load is evenly
balanced.
84. How to increase or decrease no of node in kafka?
Increase or Add new Broker
◦ Just start a new instance of kafka. But this new instance will never be a leader. Hence after starting the
broker run kafka-preferred-replica-election.sh
Decrease or Cut down a Broker
◦ Run kafka-reassign-partition.sh
◦ This will show the current replica assignment and proposed replica assignment.
◦ kafka-reassign-partition.sh << list of brokers you want to keep>> --generate.
◦ Suppose you have 5 brokers 1,2,3,4,5 and you want to bring down 5. kafka-reassign-partition.sh <<1,2,3,4>> --generate.
◦ This will generate a json file with proposed assignment file.
◦ Now again run the script
◦ kafka-reassign-partition.sh --execute –reassignment-json-file <<json file name>>
◦ After this run preferred-replica-election.sh.
◦ Cross check using describe command.
Production Scenario 2
85. What to do if broker 2 goes down and is not recoverable?
◦ Simple start a new broker with broker.id similar to one which is currently not recoverable.
◦ Then start kafka-preferred-replica-election.sh
Production Scenario 3