SlideShare a Scribd company logo
1 of 87
Kafka
Big Data
It is used to describe massive volumes of both structured and unstructured data that is so large
it is difficult to process in traditional database and software techniques.
Lots of Data (Terabytes and Petabytes)
Big data is a term for a collections of data sets so large and complex that is difficult to process
using on-hand database management tools or traditional processing applications.
The challenges inside include, capture, curation, storage, search, sharing, transfer, analysis, and
visualization.
Stock market generates about one terabyte of
new trade data per day to perform stock
trading analytics to determine trends for
optimal trades.
Unstructured data is exploding
By 2020 International data corporation predicts the number will exceed 40,000 EB or 40
Zettabytes.
The world information is doubling every 2 years.
IBM definition of big data
IBM Definition of big data.
What is Kafka
A distributed publish subscribe messaging system.
Developed by LinkedIn Corporation.
Provides solution to handle all activity stream data.
Fully supported Hadoop platform.
Partitions real time consumption across cluster of machines.
Provides a mechanism for parallel load into Hadoop.
What it offers.
Need of Kafka
Feature Description
High Throughput Provides support for hundreds and thousand of
message in a moderate software.
Scalability Highly scalable with no downtime
Replication Messages can be replicated across clusters
Durability Provides support for persistence of messages in disk.
Stream processing It can used for real time streaming
Data Loss Kafka with proper configuration can ensure zero data
loss.
Kafka Core Concepts
Kafka Terminology
Producer
Consumer
Broker
Cluster
Topic
Partition
Offset
• What is a producer
An application that
sends data.
Producer
Application publishes messages to the topic in Kafka cluster.
Can be any kind front end or streaming.
While writing messages it is also possible to attach key with message.
By attaching key producer basically guarantees that all messages with same key in wrote in same
partition.
Supports both sync and async mode.
Application subscribes and consumes messages from broker in kafka cluster.
During consumption of messages from a topic a consumer group can be configured with
multiple consumers.
Each consumer from consumer group reads messages from different partition in a topic.
Consumer
producer
producer
Consumer
Consumer
Kafka
Server
Pull Mechanism
Producer
ConsumerBroker
Sends message
Message
Request for next
message
Each server is called as broker.
Handles hundreds of megabytes of writes from producers and reads from consumers.
Retains all published messages irrespective weather it is consumed or not.
If retention is configured for n days, then messages once published it is available for
consumption for configured for n days and thereafter it is discarded.
Works like a queue if consumer instances belong to same consumer group else works like
publisher and subscriber.
Brokers
A group of computer sharing workload for common purpose.
Kafka cluster is generally fast, highly scalable messaging system.
Effective for applications which involves large scale message processing.
Clusters
Producer
Consumer
Sends message
Message
Request for next
messagebroker1
broker2
broker3
zookeeper
Cluster set up
With kafka we can easily handle hundreds of thousands of messages in a second, which makes
kafka a high throughput system.
Cluster can be expanded with no downtime. Making kafka highly scalable.
Messages are replicated, which provides reliability and durability.
Fault tolerant.
Why Kafka Cluster
Topic
An user defined category where messages are published.
For each topic partition log is maintained.
Each topic basically maintains an ordered, immutable sequence of messages assigned a
sequential id number called offset.
Writes to a partition are generally sequential thereby reducing the number of hard disk seeks.
Reading messages from partition can either be from the beginning and also can rewind or skip
to any point in a partition by supplying an offset value.
Topic
Producer
Consumer
Sends message
Message
Global
orders
Other orders
Producer
Producer
Sends message
Sends message
Consumer
Partition
What is a Offset
A sequence id given to messages as they arrive in a partition.
m1 m2 m3 m4 m5 m6 m7 m8
0 1 2
Offset
Sent offset
Committed
offset
Offset
Committed offset is used to avoid resending of already processed data to the new consumer
during an event of partition rebalance.
Auto commit :- enable.auto.commit = true
Manual Commit :- :- enable.auto.commit = false.
auto.commit.interval.ms =4 – What is the purpose of this property.
What is a Consumer group
A group of consumers acting as single logical unit.
One consumer example
One consumer in clustered setup example
Multiple consumers within a group.
There will not be any duplicate reads.
Each consumer within a consumer group will be assigned a partition. Hence it will read message
only from the partitions assigned to it.
Once a partition is assigned to a consumer. It will not be assigned to another consumer within a
same group. Unless a rebalancing takes place.
Once consumer increases in a group.
Partitions are distributed.
Rebalancing
One of Kafka broker is elected as group coordinator.
When new consumer joins group it sends message to coordinator.
So first consumer joining the consumer group becomes leader in the group.
Roles and Responsibilities.
◦ Coordinator manages list of group members.
◦ Coordinator initiates rebalance activity once list is modified.
◦ Consumer leader executes rebalance activity.
◦ Consumer leader assigns partition to new member and sends back to co Ordinator.
◦ Coordinator communicates to member consumer about its new assignment.
Group Coordinator
◦ Imagine when poll() pulls large amount of data and it takes lot of time to process. Which
means there will be delay in next polling.
◦ If there will be delay in next poll, group coordinator will assume that consumer is dead and
will issue a rebalancing. How will you know a rebalance is triggered and how will you commit
your offset in such cases?
Production Scenario – Problem statement.
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
45
Solution
So in such scenario before it is rebalanced we have to commit till whatever is processed. So that when this
partition is assigned to new consumer. Already processed record by this consumer wont be again sent to the
newly assigned consumer.
ConsumerRebalanceListner class has certain methods like
onPartitionRevoked which will be invoked just before rebalance is issued.
onPartitionAsssigned which will be invoked after rebalance is complete.
Fault Tolerance
Fault Tolerance
Total number of copies made for a partition is Replication Factor.
The purpose of adding replication in Kafka is for stronger durability and higher availability. We
want to guarantee that any successfully published message will not be lost and can be
consumed, even when there are server failures. Such failures can be caused by machine error,
program error, or more commonly, software upgrades.
Replication Factor
Leader and Follower
For each partition one broker is chosen as a leader.
Leader copies data to all its replicas.
Client application sends message only to leader.
An open source Apache project.
Provides a centralised, infrastructure and services that enables synchronisation across clusters.
Common objects used across large cluster environment are maintained in zookeeper.
Objects such as configuration, hierarchical naming space are maintained in zookeeper.
Zookeeper services are used by large scale applications to coordinate distributed processing
across large cluster.
Zookeeper
Zookeeper
Kafka can be downloaded from the following location https://kafka.apache.org/downloads.html
As per the current documentation the version of kafka is 0.11.0.0.
Installing Kafka
Kafka Configurations
Property Default Description
broker.id Each broker is uniquely identified by a non- negative integer id. This
id serves as the brokers “name” and allows the broker to be moved
to a different host/port without confusing consumers. You can
choose any number you like as it is unique.
logs.dirs. /tmp/kafka-
logs
A comma separated list or one or more directories in which kafka
data are stored. Each new partition that is created will be placed in
the directory which currently has the fewest partitions.
Port 6667 The port on which server accepts client connections.
keeper.connect null Specifies the zookeeper connection string in the form
hostname:port, where hostname and port are the host and port for
a node in your zookeeper cluster. To allow connecting through other
zookeeper nodes when that host is down you can also specify
multiple hosts in the form hostname1:port1, hostname2:port2,
hostname3:port3. Zookeeper also allows you to add a “chroot” path
which will make all Kafka data for the cluster appear under a
particular path. This is a way to setup multiple kafka clusters or
Kafka cluster can run against the following broker model.
Single Broker Cluster
Multi Broker Cluster
Single broker cluster generally runs only one instance compared to multi broker which runs
multiple broker.
To test the kafka cluster the following shell scripts can be used.
Testing Kafka Cluster
Kafka Shell Scripts
Zookeeper-server-start.sh
Kafka-server-start.sh
Kafka-topics.sh
Kafka-console-producer.sh
Kafka-console-consumer.sh
Header of section
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
55
Demo
Header of section
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
56
Start Kafka server
Create topic
Start a console producer
Start a console consumer.
Send and receive message.
Set up clustered broker
Broker
Configuration
1. Auto.create.topics.enable
2. Default.replication.factor
3. Num.partition
4. Log.retention.ms
5. Log.retention.bytes.
Producer
API
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.8.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.8.2.1</version>
</dependency>
Maven dependencies for Kafka Java API
import java.util.Properties;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
public class SimpleProducerToLearn {
public static void main(String[] args) {
String topicName = "SimpleTopic";
String key = "Key1";
String value = "Value-1";
Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("serializer.class", "org.apache.commmon.serialization.StringSerializer");
properties.put("request.required.acks", "1");
Producer<String, String> producer = new KafkaProducer<>(properties);
ProducerRecord<String, String> record = new ProducerRecord<>(topicName, key, value);
producer.send(record);
producer.close();
}
}
Producer Java API
Documentation
Producer Record
Kafka comes with default practitioner.
Messages with same message key goes in same partition.
Key is optional, hence if message has no key then Kafka with evenly distribute messages across
the partitions.
If you pass partition in constructor then default partition is disabled.
Timestamp field in constructor denotes the time when message is sent in broker. If you don’t
pass this then broker will set timestamp as time at which messages received in broker.
Producer Workflow
Callback and
Acknowledgement
Fire And
Forget
Synchronous
send
Asynchronous
Send
3 Different Send Requests
package com.Learning.co;
import java.util.Properties;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
public class SynchronousProducerToLearn {
public static void main(String[] args) {
String topicName = "SimpleTopic";
String key = "Key1";
String value = "Value-1";
Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("serializer.class", "org.apache.commmon.serialization.StringSerializer");
properties.put("request.required.acks", "1");
Producer<String, String> producer = new KafkaProducer<String, String>(properties);
ProducerRecord<String, String> record = new ProducerRecord<String, String>(topicName, key, value);
try{
RecordMetadata metadata = producer.send(record).get();
System.out.println("Synchronous completed with success" +"sent to partition"+metadata.partition()+" offset "+
metadata.offset());
}catch(Exception e){
System.out.println("Synchronous completed with failure" );
}
producer.close();
}
}
Sync
Async
producer.send(record, new MessageCallBack());
class MessageCallBack implements Callback{
public void onCompletion(RecordMetadata metadata, Exception e) {
// TODO Auto-generated method stub
if(e!=null){
System.out.println("Failed");
}else{
System.out.println("Success");
}
}
}
Header of section
Bala | 7/14/2017
69
Production Scenario – Problem statement.
Assume auto commit interval is set to 60 seconds (Default).
Now pull method in consumer A invokes and receives 6 records. All these 6 records are processed in less than
10 seconds. Since 60 seconds gap is not over these records are not committed in Kafka.
Now another set of records are received via pull method.
Now lets assume due to some reason a rebalance is triggered. First 6 records which is already processed is still
not committed.
After rebalance this partition which is assigned to this consumer A goes to a new consumer B. Now since none
of the records are committed by consumer A the first 6 messages are again resent to new consumer B.
This is clear case of data duplication and how to handle it?
Producer
Configs
Header of section
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
71
Critical Configs
Batch.size (size based batching)
Linger.ms ( time based batching)
Compression.type
Max.in.flight.requests.per.connection (affects ordering)
Acks ( affects durability)
retries
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
72
Acks
◦ Acks = 0
◦ Producer Doesn't wait for response from broker.
◦ High throughput
◦ No retries
◦ Loss of message is possible.
◦ Acks =1
◦ Producer waits for response from broker.
◦ Response is sent by leader after it receives the message from producer.
◦ Still message loss if possible.
◦ Acks = -1
◦ Response is sent after leader receives acknowledgement from all its replicas.
◦ Slow
◦ Highly reliable.
Comparisons
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
73
Acks mode
Acks Throughtput Latency Durability
0 High Low No Gurantee
1 Medium Medium Leader
-1 Low High ISR
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
74
Partitioner
Default Partitioner
 If a partition is specified in the record use it.
 If no partition is specified but a key is present choose a partition based on hash of the key.
 If no partition or key is present choose a partition in a round robin fashion.
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
75
Partitioner
Code snipped from default Partitioner
return Utils.toPositive(Utils.murmur2(keybytes))%numPartitions;
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
76
Custom Partitioner
import java.util.*;
import org.apache.kafka.clients.producer.*;
public class SensorProducer {
public static void main(String[] args) throws Exception{
String topicName = "SensorTopic";
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092,localhost:9093");
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("partitioner.class", "SensorPartitioner");
props.put("speed.sensor.name", "TSS");
Producer<String, String> producer = new KafkaProducer <>(props);
for (int i=0 ; i<10 ; i++)
producer.send(new ProducerRecord<>(topicName,"SSP"+i,"500"+i));
for (int i=0 ; i<10 ; i++)
producer.send(new ProducerRecord<>(topicName,"TSS","500"+i));
producer.close();
System.out.println("SimpleProducer Completed.");
}
}
Partition - Producer
import java.util.*;
import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.*;
import org.apache.kafka.common.utils.*;
import org.apache.kafka.common.record.*;
public class SensorPartitioner implements Partitioner {
private String speedSensorName;
public void configure(Map<String, ?> configs) {
speedSensorName = configs.get("speed.sensor.name").toString();
}
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
int sp = (int)Math.abs(numPartitions*0.3);
int p=0;
if ( (keyBytes == null) || (!(key instanceof String)) )
throw new InvalidRecordException("All messages must have sensor name as key");
if ( ((String)key).equals(speedSensorName) )
p = Utils.toPositive(Utils.murmur2(valueBytes)) % sp;
else
p = Utils.toPositive(Utils.murmur2(keyBytes)) % (numPartitions-sp) + sp ;
System.out.println("Key = " + (String)key + " Partition = " + p );
return p;
}
public void close() {}
}
Custom Partitioner
max.in.flight.request.per.connection:
Definition: How many request you can send to broker without getting any response.
Default value is 5.
High value will give high throughput and also use high memory consumption.
In Asncy commit set the value of this property to 1, to maintain ordering of messages.
May cause out of order delivery when retry occurs.
Bala | 7/14/2017
80
Scenario
Side effects of async commit is that, it may loose the ordering of data, which processing messages in batches.
Record1
Record2
Record3
Record4
Record5
Record6
Record7
Record8
Record9
Record10
Commits Successfully
Broker
Record6
Record7
Record8
Record9
Record10
Record1
Record2
Record3
Record4
Record5
Partition buffer
Callback with exception
Retries and successfull
Async producer sends message in background – no blocking in client.
Provides more powerful batching of messages.
Wraps a sync produce, or rather a pool of them.
Communication from asyncsync happens via a queue.
Which explains why you may see kafka.produce.async.QueueFullException.
Async produce may drop messages if its queue is full.
◦ Solution1 don’t push messages to producer faster than its able to send to queue.
◦ Solution 2 Queue full == need more brokers
◦ Solution 3 set queue.enqueuer.timeout.ms to -1. Now the producer will block indefinitely and will never drop messages.
◦ Solution 4 Increase queue.buffering.max.messages
For more in detailed study: https://engineering.gnip.com/kafka-async-producer/
Async Producer
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
82
Other producer config properties
retries: No of time producer retries to send messages. Default value 0
Retries.backoff.ms= time between each retries. Defaullt value 100ms
Kafka
Monitoring
Lag
Lag = how far your producer is behind the consumer.
Older
message
Newer
message
producer
Consumer
lag
Lag is a consumer problem.
Too slow, too much GC , loosing connection to ZK or Kafka
Bug or design flaw in consumer.
Operational mistakes eg. You brought 6 kafka servers in parallel, each one in turn trigerring
rebalancing, then hit kafkas rebalance limit, cf.rebalance.max.retries
Lag
Under replicated partitions.
◦ For example because a broker is down.
Offline partitions
◦ Even worse than under replicated.
◦ Serious problem if anything but 0 offline partitions.
Partitions Leader broker ISR
paritition1 0 1,2
One of the replica broker say 2 goes down. – Under partitioned
paritition1 0 1
Again one of the replica say 1 goes down – Still Under partitioned
paritition1 0 0
Assume replication factor is set as 3 for this topic.
replica.lag.max.messages
Leader In sync replica 1 In sync replica 2
0 0 0
1 1 1
2 2
3 3
4 4
5 5
6 6
In sync replica 2 for some
reason messages are not
being copied. And this case
replica 2 is lagging 5
messages. Which is more than
value of property
replica.lag.max.messages =4
(default value). This broker
(replica 2) will go out of sync.
commit
commit
replica.lag.max.messages
Leader In sync replica 1
Record1
Record2
Record3
Record4
Record5
1. What happens when message coming in batches.
2. If the value of property is set to 3.
3. Assume batch one has 5 messages and first batch is replicated in all brokers
4. Second batch has another 5 messages. But since replica 1 is lagging behind more
than 3 messages it goes out of sync.
5. Hence though replica set 1 is not dead. It goes out of sync.
6. Solution is to use replica.lag.max.ms
commit Record1
Record2
Record3
Record4
Record5
Record1
Record2
Record3
Record4
Record5
Replica 1 goes
Out of Sync
What happens when broker goes down and comes up again.
Production Scenario 1
Partitions Leader broker Leader assignment after
one of broker 1 goes
down
Leader assignment after
broker 1 comes up
paritition1 0 0 0
paritition2 1 2 2
paritition3 2 2 2
paritition4 3 3 3
partition5 1 0 0
partition6 0 0 0
Sad reality is Broker 1
could never become leader
again. It will simply be as
ISR
Kafka-preferred-replica-
election.sh
Comes to your rescue. And
hence load is evenly
balanced.
How to increase or decrease no of node in kafka?
Increase or Add new Broker
◦ Just start a new instance of kafka. But this new instance will never be a leader. Hence after starting the
broker run kafka-preferred-replica-election.sh
Decrease or Cut down a Broker
◦ Run kafka-reassign-partition.sh
◦ This will show the current replica assignment and proposed replica assignment.
◦ kafka-reassign-partition.sh << list of brokers you want to keep>> --generate.
◦ Suppose you have 5 brokers 1,2,3,4,5 and you want to bring down 5. kafka-reassign-partition.sh <<1,2,3,4>> --generate.
◦ This will generate a json file with proposed assignment file.
◦ Now again run the script
◦ kafka-reassign-partition.sh --execute –reassignment-json-file <<json file name>>
◦ After this run preferred-replica-election.sh.
◦ Cross check using describe command.
Production Scenario 2
What to do if broker 2 goes down and is not recoverable?
◦ Simple start a new broker with broker.id similar to one which is currently not recoverable.
◦ Then start kafka-preferred-replica-election.sh
Production Scenario 3
Twitter Live Streaming Demo
Twitter Code URL
Viyaan | 7/14/2017
94
Git Hub Links
https://github.com/Viyaan/TwitterKafkaProducer
https://github.com/Viyaan/StormKafkaStreamingWordCount

More Related Content

What's hot

Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Kai Wähner
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
confluent
 

What's hot (20)

Kafka connect
Kafka connectKafka connect
Kafka connect
 
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, ConfluentMaking Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
What's new in Confluent 3.2 and Apache Kafka 0.10.2
What's new in Confluent 3.2 and Apache Kafka 0.10.2 What's new in Confluent 3.2 and Apache Kafka 0.10.2
What's new in Confluent 3.2 and Apache Kafka 0.10.2
 
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafka
 
Fault Tolerance with Kafka
Fault Tolerance with KafkaFault Tolerance with Kafka
Fault Tolerance with Kafka
 
Message Driven and Event Sourcing
Message Driven and Event SourcingMessage Driven and Event Sourcing
Message Driven and Event Sourcing
 
Leveraging Microservice Architectures & Event-Driven Systems for Global APIs
Leveraging Microservice Architectures & Event-Driven Systems for Global APIsLeveraging Microservice Architectures & Event-Driven Systems for Global APIs
Leveraging Microservice Architectures & Event-Driven Systems for Global APIs
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with Debezium
 
Kafka Connect by Datio
Kafka Connect by DatioKafka Connect by Datio
Kafka Connect by Datio
 
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
dotScale 2017 Keynote: The Rise of Real Time by Neha NarkhededotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
 
Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython NotebookCooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
 

Similar to Kafka RealTime Streaming

Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Shameera Rathnayaka
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
Deep Shah
 

Similar to Kafka RealTime Streaming (20)

Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka Deep Dive
Kafka Deep DiveKafka Deep Dive
Kafka Deep Dive
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configuration
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
 
Kafka Fundamentals
Kafka FundamentalsKafka Fundamentals
Kafka Fundamentals
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka Overview
Kafka OverviewKafka Overview
Kafka Overview
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kafka
Apache Kafka Apache Kafka
Apache Kafka
 
A Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka SkillsA Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka Skills
 
Apache kafka introduction
Apache kafka introductionApache kafka introduction
Apache kafka introduction
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
 
ActiveMQ interview Questions and Answers
ActiveMQ interview Questions and AnswersActiveMQ interview Questions and Answers
ActiveMQ interview Questions and Answers
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
kafka_session_updated.pptx
kafka_session_updated.pptxkafka_session_updated.pptx
kafka_session_updated.pptx
 

More from Viyaan Jhiingade (7)

Rate limiting
Rate limitingRate limiting
Rate limiting
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
No sql
No sqlNo sql
No sql
 
Rest Webservice
Rest WebserviceRest Webservice
Rest Webservice
 
Storm
StormStorm
Storm
 
Git commands
Git commandsGit commands
Git commands
 
Jenkins CI
Jenkins CIJenkins CI
Jenkins CI
 

Recently uploaded

Recently uploaded (20)

Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 

Kafka RealTime Streaming

  • 2. Big Data It is used to describe massive volumes of both structured and unstructured data that is so large it is difficult to process in traditional database and software techniques. Lots of Data (Terabytes and Petabytes) Big data is a term for a collections of data sets so large and complex that is difficult to process using on-hand database management tools or traditional processing applications. The challenges inside include, capture, curation, storage, search, sharing, transfer, analysis, and visualization.
  • 3. Stock market generates about one terabyte of new trade data per day to perform stock trading analytics to determine trends for optimal trades.
  • 4. Unstructured data is exploding By 2020 International data corporation predicts the number will exceed 40,000 EB or 40 Zettabytes. The world information is doubling every 2 years.
  • 5. IBM definition of big data IBM Definition of big data.
  • 6. What is Kafka A distributed publish subscribe messaging system. Developed by LinkedIn Corporation. Provides solution to handle all activity stream data. Fully supported Hadoop platform. Partitions real time consumption across cluster of machines. Provides a mechanism for parallel load into Hadoop.
  • 7.
  • 9. Need of Kafka Feature Description High Throughput Provides support for hundreds and thousand of message in a moderate software. Scalability Highly scalable with no downtime Replication Messages can be replicated across clusters Durability Provides support for persistence of messages in disk. Stream processing It can used for real time streaming Data Loss Kafka with proper configuration can ensure zero data loss.
  • 12. • What is a producer An application that sends data.
  • 13. Producer Application publishes messages to the topic in Kafka cluster. Can be any kind front end or streaming. While writing messages it is also possible to attach key with message. By attaching key producer basically guarantees that all messages with same key in wrote in same partition. Supports both sync and async mode.
  • 14. Application subscribes and consumes messages from broker in kafka cluster. During consumption of messages from a topic a consumer group can be configured with multiple consumers. Each consumer from consumer group reads messages from different partition in a topic. Consumer
  • 17. Each server is called as broker. Handles hundreds of megabytes of writes from producers and reads from consumers. Retains all published messages irrespective weather it is consumed or not. If retention is configured for n days, then messages once published it is available for consumption for configured for n days and thereafter it is discarded. Works like a queue if consumer instances belong to same consumer group else works like publisher and subscriber. Brokers
  • 18. A group of computer sharing workload for common purpose. Kafka cluster is generally fast, highly scalable messaging system. Effective for applications which involves large scale message processing. Clusters
  • 19. Producer Consumer Sends message Message Request for next messagebroker1 broker2 broker3 zookeeper Cluster set up
  • 20. With kafka we can easily handle hundreds of thousands of messages in a second, which makes kafka a high throughput system. Cluster can be expanded with no downtime. Making kafka highly scalable. Messages are replicated, which provides reliability and durability. Fault tolerant. Why Kafka Cluster
  • 21. Topic An user defined category where messages are published. For each topic partition log is maintained. Each topic basically maintains an ordered, immutable sequence of messages assigned a sequential id number called offset. Writes to a partition are generally sequential thereby reducing the number of hard disk seeks. Reading messages from partition can either be from the beginning and also can rewind or skip to any point in a partition by supplying an offset value.
  • 24. What is a Offset A sequence id given to messages as they arrive in a partition. m1 m2 m3 m4 m5 m6 m7 m8 0 1 2
  • 26. Offset Committed offset is used to avoid resending of already processed data to the new consumer during an event of partition rebalance. Auto commit :- enable.auto.commit = true Manual Commit :- :- enable.auto.commit = false. auto.commit.interval.ms =4 – What is the purpose of this property.
  • 27. What is a Consumer group A group of consumers acting as single logical unit.
  • 29. One consumer in clustered setup example
  • 31.
  • 32. There will not be any duplicate reads. Each consumer within a consumer group will be assigned a partition. Hence it will read message only from the partitions assigned to it. Once a partition is assigned to a consumer. It will not be assigned to another consumer within a same group. Unless a rebalancing takes place.
  • 33. Once consumer increases in a group. Partitions are distributed.
  • 35. One of Kafka broker is elected as group coordinator. When new consumer joins group it sends message to coordinator. So first consumer joining the consumer group becomes leader in the group. Roles and Responsibilities. ◦ Coordinator manages list of group members. ◦ Coordinator initiates rebalance activity once list is modified. ◦ Consumer leader executes rebalance activity. ◦ Consumer leader assigns partition to new member and sends back to co Ordinator. ◦ Coordinator communicates to member consumer about its new assignment. Group Coordinator
  • 36. ◦ Imagine when poll() pulls large amount of data and it takes lot of time to process. Which means there will be delay in next polling. ◦ If there will be delay in next poll, group coordinator will assume that consumer is dead and will issue a rebalancing. How will you know a rebalance is triggered and how will you commit your offset in such cases? Production Scenario – Problem statement.
  • 37. Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 45 Solution So in such scenario before it is rebalanced we have to commit till whatever is processed. So that when this partition is assigned to new consumer. Already processed record by this consumer wont be again sent to the newly assigned consumer. ConsumerRebalanceListner class has certain methods like onPartitionRevoked which will be invoked just before rebalance is issued. onPartitionAsssigned which will be invoked after rebalance is complete.
  • 40. Total number of copies made for a partition is Replication Factor. The purpose of adding replication in Kafka is for stronger durability and higher availability. We want to guarantee that any successfully published message will not be lost and can be consumed, even when there are server failures. Such failures can be caused by machine error, program error, or more commonly, software upgrades. Replication Factor
  • 41. Leader and Follower For each partition one broker is chosen as a leader. Leader copies data to all its replicas.
  • 42. Client application sends message only to leader.
  • 43. An open source Apache project. Provides a centralised, infrastructure and services that enables synchronisation across clusters. Common objects used across large cluster environment are maintained in zookeeper. Objects such as configuration, hierarchical naming space are maintained in zookeeper. Zookeeper services are used by large scale applications to coordinate distributed processing across large cluster. Zookeeper
  • 45. Kafka can be downloaded from the following location https://kafka.apache.org/downloads.html As per the current documentation the version of kafka is 0.11.0.0. Installing Kafka
  • 46. Kafka Configurations Property Default Description broker.id Each broker is uniquely identified by a non- negative integer id. This id serves as the brokers “name” and allows the broker to be moved to a different host/port without confusing consumers. You can choose any number you like as it is unique. logs.dirs. /tmp/kafka- logs A comma separated list or one or more directories in which kafka data are stored. Each new partition that is created will be placed in the directory which currently has the fewest partitions. Port 6667 The port on which server accepts client connections. keeper.connect null Specifies the zookeeper connection string in the form hostname:port, where hostname and port are the host and port for a node in your zookeeper cluster. To allow connecting through other zookeeper nodes when that host is down you can also specify multiple hosts in the form hostname1:port1, hostname2:port2, hostname3:port3. Zookeeper also allows you to add a “chroot” path which will make all Kafka data for the cluster appear under a particular path. This is a way to setup multiple kafka clusters or
  • 47. Kafka cluster can run against the following broker model. Single Broker Cluster Multi Broker Cluster Single broker cluster generally runs only one instance compared to multi broker which runs multiple broker. To test the kafka cluster the following shell scripts can be used. Testing Kafka Cluster Kafka Shell Scripts Zookeeper-server-start.sh Kafka-server-start.sh Kafka-topics.sh Kafka-console-producer.sh Kafka-console-consumer.sh
  • 48. Header of section Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 55 Demo
  • 49. Header of section Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 56 Start Kafka server Create topic Start a console producer Start a console consumer. Send and receive message. Set up clustered broker
  • 51. 1. Auto.create.topics.enable 2. Default.replication.factor 3. Num.partition 4. Log.retention.ms 5. Log.retention.bytes.
  • 54. import java.util.Properties; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; public class SimpleProducerToLearn { public static void main(String[] args) { String topicName = "SimpleTopic"; String key = "Key1"; String value = "Value-1"; Properties properties = new Properties(); properties.put("bootstrap.servers", "localhost:9092"); properties.put("serializer.class", "org.apache.commmon.serialization.StringSerializer"); properties.put("request.required.acks", "1"); Producer<String, String> producer = new KafkaProducer<>(properties); ProducerRecord<String, String> record = new ProducerRecord<>(topicName, key, value); producer.send(record); producer.close(); } } Producer Java API
  • 56. Producer Record Kafka comes with default practitioner. Messages with same message key goes in same partition. Key is optional, hence if message has no key then Kafka with evenly distribute messages across the partitions. If you pass partition in constructor then default partition is disabled. Timestamp field in constructor denotes the time when message is sent in broker. If you don’t pass this then broker will set timestamp as time at which messages received in broker.
  • 60. package com.Learning.co; import java.util.Properties; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; public class SynchronousProducerToLearn { public static void main(String[] args) { String topicName = "SimpleTopic"; String key = "Key1"; String value = "Value-1"; Properties properties = new Properties(); properties.put("bootstrap.servers", "localhost:9092"); properties.put("serializer.class", "org.apache.commmon.serialization.StringSerializer"); properties.put("request.required.acks", "1"); Producer<String, String> producer = new KafkaProducer<String, String>(properties); ProducerRecord<String, String> record = new ProducerRecord<String, String>(topicName, key, value); try{ RecordMetadata metadata = producer.send(record).get(); System.out.println("Synchronous completed with success" +"sent to partition"+metadata.partition()+" offset "+ metadata.offset()); }catch(Exception e){ System.out.println("Synchronous completed with failure" ); } producer.close(); } } Sync
  • 61. Async producer.send(record, new MessageCallBack()); class MessageCallBack implements Callback{ public void onCompletion(RecordMetadata metadata, Exception e) { // TODO Auto-generated method stub if(e!=null){ System.out.println("Failed"); }else{ System.out.println("Success"); } } }
  • 62. Header of section Bala | 7/14/2017 69 Production Scenario – Problem statement. Assume auto commit interval is set to 60 seconds (Default). Now pull method in consumer A invokes and receives 6 records. All these 6 records are processed in less than 10 seconds. Since 60 seconds gap is not over these records are not committed in Kafka. Now another set of records are received via pull method. Now lets assume due to some reason a rebalance is triggered. First 6 records which is already processed is still not committed. After rebalance this partition which is assigned to this consumer A goes to a new consumer B. Now since none of the records are committed by consumer A the first 6 messages are again resent to new consumer B. This is clear case of data duplication and how to handle it?
  • 64. Header of section Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 71 Critical Configs Batch.size (size based batching) Linger.ms ( time based batching) Compression.type Max.in.flight.requests.per.connection (affects ordering) Acks ( affects durability) retries
  • 65. Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 72 Acks ◦ Acks = 0 ◦ Producer Doesn't wait for response from broker. ◦ High throughput ◦ No retries ◦ Loss of message is possible. ◦ Acks =1 ◦ Producer waits for response from broker. ◦ Response is sent by leader after it receives the message from producer. ◦ Still message loss if possible. ◦ Acks = -1 ◦ Response is sent after leader receives acknowledgement from all its replicas. ◦ Slow ◦ Highly reliable.
  • 66. Comparisons Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 73 Acks mode Acks Throughtput Latency Durability 0 High Low No Gurantee 1 Medium Medium Leader -1 Low High ISR
  • 67. Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 74 Partitioner Default Partitioner  If a partition is specified in the record use it.  If no partition is specified but a key is present choose a partition based on hash of the key.  If no partition or key is present choose a partition in a round robin fashion.
  • 68. Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 75 Partitioner Code snipped from default Partitioner return Utils.toPositive(Utils.murmur2(keybytes))%numPartitions;
  • 69. Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 76 Custom Partitioner
  • 70. import java.util.*; import org.apache.kafka.clients.producer.*; public class SensorProducer { public static void main(String[] args) throws Exception{ String topicName = "SensorTopic"; Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9092,localhost:9093"); props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("partitioner.class", "SensorPartitioner"); props.put("speed.sensor.name", "TSS"); Producer<String, String> producer = new KafkaProducer <>(props); for (int i=0 ; i<10 ; i++) producer.send(new ProducerRecord<>(topicName,"SSP"+i,"500"+i)); for (int i=0 ; i<10 ; i++) producer.send(new ProducerRecord<>(topicName,"TSS","500"+i)); producer.close(); System.out.println("SimpleProducer Completed."); } } Partition - Producer
  • 71. import java.util.*; import org.apache.kafka.clients.producer.*; import org.apache.kafka.common.*; import org.apache.kafka.common.utils.*; import org.apache.kafka.common.record.*; public class SensorPartitioner implements Partitioner { private String speedSensorName; public void configure(Map<String, ?> configs) { speedSensorName = configs.get("speed.sensor.name").toString(); } public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) { List<PartitionInfo> partitions = cluster.partitionsForTopic(topic); int numPartitions = partitions.size(); int sp = (int)Math.abs(numPartitions*0.3); int p=0; if ( (keyBytes == null) || (!(key instanceof String)) ) throw new InvalidRecordException("All messages must have sensor name as key"); if ( ((String)key).equals(speedSensorName) ) p = Utils.toPositive(Utils.murmur2(valueBytes)) % sp; else p = Utils.toPositive(Utils.murmur2(keyBytes)) % (numPartitions-sp) + sp ; System.out.println("Key = " + (String)key + " Partition = " + p ); return p; } public void close() {} } Custom Partitioner
  • 72. max.in.flight.request.per.connection: Definition: How many request you can send to broker without getting any response. Default value is 5. High value will give high throughput and also use high memory consumption. In Asncy commit set the value of this property to 1, to maintain ordering of messages. May cause out of order delivery when retry occurs.
  • 73. Bala | 7/14/2017 80 Scenario Side effects of async commit is that, it may loose the ordering of data, which processing messages in batches. Record1 Record2 Record3 Record4 Record5 Record6 Record7 Record8 Record9 Record10 Commits Successfully Broker Record6 Record7 Record8 Record9 Record10 Record1 Record2 Record3 Record4 Record5 Partition buffer Callback with exception Retries and successfull
  • 74. Async producer sends message in background – no blocking in client. Provides more powerful batching of messages. Wraps a sync produce, or rather a pool of them. Communication from asyncsync happens via a queue. Which explains why you may see kafka.produce.async.QueueFullException. Async produce may drop messages if its queue is full. ◦ Solution1 don’t push messages to producer faster than its able to send to queue. ◦ Solution 2 Queue full == need more brokers ◦ Solution 3 set queue.enqueuer.timeout.ms to -1. Now the producer will block indefinitely and will never drop messages. ◦ Solution 4 Increase queue.buffering.max.messages For more in detailed study: https://engineering.gnip.com/kafka-async-producer/ Async Producer
  • 75. Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 82 Other producer config properties retries: No of time producer retries to send messages. Default value 0 Retries.backoff.ms= time between each retries. Defaullt value 100ms
  • 77. Lag Lag = how far your producer is behind the consumer. Older message Newer message producer Consumer lag
  • 78. Lag is a consumer problem. Too slow, too much GC , loosing connection to ZK or Kafka Bug or design flaw in consumer. Operational mistakes eg. You brought 6 kafka servers in parallel, each one in turn trigerring rebalancing, then hit kafkas rebalance limit, cf.rebalance.max.retries Lag
  • 79. Under replicated partitions. ◦ For example because a broker is down. Offline partitions ◦ Even worse than under replicated. ◦ Serious problem if anything but 0 offline partitions.
  • 80. Partitions Leader broker ISR paritition1 0 1,2 One of the replica broker say 2 goes down. – Under partitioned paritition1 0 1 Again one of the replica say 1 goes down – Still Under partitioned paritition1 0 0 Assume replication factor is set as 3 for this topic.
  • 81. replica.lag.max.messages Leader In sync replica 1 In sync replica 2 0 0 0 1 1 1 2 2 3 3 4 4 5 5 6 6 In sync replica 2 for some reason messages are not being copied. And this case replica 2 is lagging 5 messages. Which is more than value of property replica.lag.max.messages =4 (default value). This broker (replica 2) will go out of sync. commit commit
  • 82. replica.lag.max.messages Leader In sync replica 1 Record1 Record2 Record3 Record4 Record5 1. What happens when message coming in batches. 2. If the value of property is set to 3. 3. Assume batch one has 5 messages and first batch is replicated in all brokers 4. Second batch has another 5 messages. But since replica 1 is lagging behind more than 3 messages it goes out of sync. 5. Hence though replica set 1 is not dead. It goes out of sync. 6. Solution is to use replica.lag.max.ms commit Record1 Record2 Record3 Record4 Record5 Record1 Record2 Record3 Record4 Record5 Replica 1 goes Out of Sync
  • 83. What happens when broker goes down and comes up again. Production Scenario 1 Partitions Leader broker Leader assignment after one of broker 1 goes down Leader assignment after broker 1 comes up paritition1 0 0 0 paritition2 1 2 2 paritition3 2 2 2 paritition4 3 3 3 partition5 1 0 0 partition6 0 0 0 Sad reality is Broker 1 could never become leader again. It will simply be as ISR Kafka-preferred-replica- election.sh Comes to your rescue. And hence load is evenly balanced.
  • 84. How to increase or decrease no of node in kafka? Increase or Add new Broker ◦ Just start a new instance of kafka. But this new instance will never be a leader. Hence after starting the broker run kafka-preferred-replica-election.sh Decrease or Cut down a Broker ◦ Run kafka-reassign-partition.sh ◦ This will show the current replica assignment and proposed replica assignment. ◦ kafka-reassign-partition.sh << list of brokers you want to keep>> --generate. ◦ Suppose you have 5 brokers 1,2,3,4,5 and you want to bring down 5. kafka-reassign-partition.sh <<1,2,3,4>> --generate. ◦ This will generate a json file with proposed assignment file. ◦ Now again run the script ◦ kafka-reassign-partition.sh --execute –reassignment-json-file <<json file name>> ◦ After this run preferred-replica-election.sh. ◦ Cross check using describe command. Production Scenario 2
  • 85. What to do if broker 2 goes down and is not recoverable? ◦ Simple start a new broker with broker.id similar to one which is currently not recoverable. ◦ Then start kafka-preferred-replica-election.sh Production Scenario 3
  • 87. Twitter Code URL Viyaan | 7/14/2017 94 Git Hub Links https://github.com/Viyaan/TwitterKafkaProducer https://github.com/Viyaan/StormKafkaStreamingWordCount

Editor's Notes

  1. If there are more consumers in a group. Then the extra consumer will be idle. In no case a partition will be assigned to more than one consumer.
  2. When a new consumer enters or exits the group. A rebalance is triggered by the group co Ordinator.