SlideShare a Scribd company logo
1 of 87
Kafka
Big Data
It is used to describe massive volumes of both structured and unstructured data that is so large
it is difficult to process in traditional database and software techniques.
Lots of Data (Terabytes and Petabytes)
Big data is a term for a collections of data sets so large and complex that is difficult to process
using on-hand database management tools or traditional processing applications.
The challenges inside include, capture, curation, storage, search, sharing, transfer, analysis, and
visualization.
Stock market generates about one terabyte of
new trade data per day to perform stock
trading analytics to determine trends for
optimal trades.
Unstructured data is exploding
By 2020 International data corporation predicts the number will exceed 40,000 EB or 40
Zettabytes.
The world information is doubling every 2 years.
IBM definition of big data
IBM Definition of big data.
What is Kafka
A distributed publish subscribe messaging system.
Developed by LinkedIn Corporation.
Provides solution to handle all activity stream data.
Fully supported Hadoop platform.
Partitions real time consumption across cluster of machines.
Provides a mechanism for parallel load into Hadoop.
What it offers.
Need of Kafka
Feature Description
High Throughput Provides support for hundreds and thousand of
message in a moderate software.
Scalability Highly scalable with no downtime
Replication Messages can be replicated across clusters
Durability Provides support for persistence of messages in disk.
Stream processing It can used for real time streaming
Data Loss Kafka with proper configuration can ensure zero data
loss.
Kafka Core Concepts
Kafka Terminology
Producer
Consumer
Broker
Cluster
Topic
Partition
Offset
• What is a producer
An application that
sends data.
Producer
Application publishes messages to the topic in Kafka cluster.
Can be any kind front end or streaming.
While writing messages it is also possible to attach key with message.
By attaching key producer basically guarantees that all messages with same key in wrote in same
partition.
Supports both sync and async mode.
Application subscribes and consumes messages from broker in kafka cluster.
During consumption of messages from a topic a consumer group can be configured with
multiple consumers.
Each consumer from consumer group reads messages from different partition in a topic.
Consumer
producer
producer
Consumer
Consumer
Kafka
Server
Pull Mechanism
Producer
ConsumerBroker
Sends message
Message
Request for next
message
Each server is called as broker.
Handles hundreds of megabytes of writes from producers and reads from consumers.
Retains all published messages irrespective weather it is consumed or not.
If retention is configured for n days, then messages once published it is available for
consumption for configured for n days and thereafter it is discarded.
Works like a queue if consumer instances belong to same consumer group else works like
publisher and subscriber.
Brokers
A group of computer sharing workload for common purpose.
Kafka cluster is generally fast, highly scalable messaging system.
Effective for applications which involves large scale message processing.
Clusters
Producer
Consumer
Sends message
Message
Request for next
messagebroker1
broker2
broker3
zookeeper
Cluster set up
With kafka we can easily handle hundreds of thousands of messages in a second, which makes
kafka a high throughput system.
Cluster can be expanded with no downtime. Making kafka highly scalable.
Messages are replicated, which provides reliability and durability.
Fault tolerant.
Why Kafka Cluster
Topic
An user defined category where messages are published.
For each topic partition log is maintained.
Each topic basically maintains an ordered, immutable sequence of messages assigned a
sequential id number called offset.
Writes to a partition are generally sequential thereby reducing the number of hard disk seeks.
Reading messages from partition can either be from the beginning and also can rewind or skip
to any point in a partition by supplying an offset value.
Topic
Producer
Consumer
Sends message
Message
Global
orders
Other orders
Producer
Producer
Sends message
Sends message
Consumer
Partition
What is a Offset
A sequence id given to messages as they arrive in a partition.
m1 m2 m3 m4 m5 m6 m7 m8
0 1 2
Offset
Sent offset
Committed
offset
Offset
Committed offset is used to avoid resending of already processed data to the new consumer
during an event of partition rebalance.
Auto commit :- enable.auto.commit = true
Manual Commit :- :- enable.auto.commit = false.
auto.commit.interval.ms =4 – What is the purpose of this property.
What is a Consumer group
A group of consumers acting as single logical unit.
One consumer example
One consumer in clustered setup example
Multiple consumers within a group.
There will not be any duplicate reads.
Each consumer within a consumer group will be assigned a partition. Hence it will read message
only from the partitions assigned to it.
Once a partition is assigned to a consumer. It will not be assigned to another consumer within a
same group. Unless a rebalancing takes place.
Once consumer increases in a group.
Partitions are distributed.
Rebalancing
One of Kafka broker is elected as group coordinator.
When new consumer joins group it sends message to coordinator.
So first consumer joining the consumer group becomes leader in the group.
Roles and Responsibilities.
◦ Coordinator manages list of group members.
◦ Coordinator initiates rebalance activity once list is modified.
◦ Consumer leader executes rebalance activity.
◦ Consumer leader assigns partition to new member and sends back to co Ordinator.
◦ Coordinator communicates to member consumer about its new assignment.
Group Coordinator
◦ Imagine when poll() pulls large amount of data and it takes lot of time to process. Which
means there will be delay in next polling.
◦ If there will be delay in next poll, group coordinator will assume that consumer is dead and
will issue a rebalancing. How will you know a rebalance is triggered and how will you commit
your offset in such cases?
Production Scenario – Problem statement.
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
45
Solution
So in such scenario before it is rebalanced we have to commit till whatever is processed. So that when this
partition is assigned to new consumer. Already processed record by this consumer wont be again sent to the
newly assigned consumer.
ConsumerRebalanceListner class has certain methods like
onPartitionRevoked which will be invoked just before rebalance is issued.
onPartitionAsssigned which will be invoked after rebalance is complete.
Fault Tolerance
Fault Tolerance
Total number of copies made for a partition is Replication Factor.
The purpose of adding replication in Kafka is for stronger durability and higher availability. We
want to guarantee that any successfully published message will not be lost and can be
consumed, even when there are server failures. Such failures can be caused by machine error,
program error, or more commonly, software upgrades.
Replication Factor
Leader and Follower
For each partition one broker is chosen as a leader.
Leader copies data to all its replicas.
Client application sends message only to leader.
An open source Apache project.
Provides a centralised, infrastructure and services that enables synchronisation across clusters.
Common objects used across large cluster environment are maintained in zookeeper.
Objects such as configuration, hierarchical naming space are maintained in zookeeper.
Zookeeper services are used by large scale applications to coordinate distributed processing
across large cluster.
Zookeeper
Zookeeper
Kafka can be downloaded from the following location https://kafka.apache.org/downloads.html
As per the current documentation the version of kafka is 0.11.0.0.
Installing Kafka
Kafka Configurations
Property Default Description
broker.id Each broker is uniquely identified by a non- negative integer id. This
id serves as the brokers “name” and allows the broker to be moved
to a different host/port without confusing consumers. You can
choose any number you like as it is unique.
logs.dirs. /tmp/kafka-
logs
A comma separated list or one or more directories in which kafka
data are stored. Each new partition that is created will be placed in
the directory which currently has the fewest partitions.
Port 6667 The port on which server accepts client connections.
keeper.connect null Specifies the zookeeper connection string in the form
hostname:port, where hostname and port are the host and port for
a node in your zookeeper cluster. To allow connecting through other
zookeeper nodes when that host is down you can also specify
multiple hosts in the form hostname1:port1, hostname2:port2,
hostname3:port3. Zookeeper also allows you to add a “chroot” path
which will make all Kafka data for the cluster appear under a
particular path. This is a way to setup multiple kafka clusters or
Kafka cluster can run against the following broker model.
Single Broker Cluster
Multi Broker Cluster
Single broker cluster generally runs only one instance compared to multi broker which runs
multiple broker.
To test the kafka cluster the following shell scripts can be used.
Testing Kafka Cluster
Kafka Shell Scripts
Zookeeper-server-start.sh
Kafka-server-start.sh
Kafka-topics.sh
Kafka-console-producer.sh
Kafka-console-consumer.sh
Header of section
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
55
Demo
Header of section
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
56
Start Kafka server
Create topic
Start a console producer
Start a console consumer.
Send and receive message.
Set up clustered broker
Broker
Configuration
1. Auto.create.topics.enable
2. Default.replication.factor
3. Num.partition
4. Log.retention.ms
5. Log.retention.bytes.
Producer
API
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.8.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.8.2.1</version>
</dependency>
Maven dependencies for Kafka Java API
import java.util.Properties;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
public class SimpleProducerToLearn {
public static void main(String[] args) {
String topicName = "SimpleTopic";
String key = "Key1";
String value = "Value-1";
Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("serializer.class", "org.apache.commmon.serialization.StringSerializer");
properties.put("request.required.acks", "1");
Producer<String, String> producer = new KafkaProducer<>(properties);
ProducerRecord<String, String> record = new ProducerRecord<>(topicName, key, value);
producer.send(record);
producer.close();
}
}
Producer Java API
Documentation
Producer Record
Kafka comes with default practitioner.
Messages with same message key goes in same partition.
Key is optional, hence if message has no key then Kafka with evenly distribute messages across
the partitions.
If you pass partition in constructor then default partition is disabled.
Timestamp field in constructor denotes the time when message is sent in broker. If you don’t
pass this then broker will set timestamp as time at which messages received in broker.
Producer Workflow
Callback and
Acknowledgement
Fire And
Forget
Synchronous
send
Asynchronous
Send
3 Different Send Requests
package com.Learning.co;
import java.util.Properties;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
public class SynchronousProducerToLearn {
public static void main(String[] args) {
String topicName = "SimpleTopic";
String key = "Key1";
String value = "Value-1";
Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("serializer.class", "org.apache.commmon.serialization.StringSerializer");
properties.put("request.required.acks", "1");
Producer<String, String> producer = new KafkaProducer<String, String>(properties);
ProducerRecord<String, String> record = new ProducerRecord<String, String>(topicName, key, value);
try{
RecordMetadata metadata = producer.send(record).get();
System.out.println("Synchronous completed with success" +"sent to partition"+metadata.partition()+" offset "+
metadata.offset());
}catch(Exception e){
System.out.println("Synchronous completed with failure" );
}
producer.close();
}
}
Sync
Async
producer.send(record, new MessageCallBack());
class MessageCallBack implements Callback{
public void onCompletion(RecordMetadata metadata, Exception e) {
// TODO Auto-generated method stub
if(e!=null){
System.out.println("Failed");
}else{
System.out.println("Success");
}
}
}
Header of section
Bala | 7/14/2017
69
Production Scenario – Problem statement.
Assume auto commit interval is set to 60 seconds (Default).
Now pull method in consumer A invokes and receives 6 records. All these 6 records are processed in less than
10 seconds. Since 60 seconds gap is not over these records are not committed in Kafka.
Now another set of records are received via pull method.
Now lets assume due to some reason a rebalance is triggered. First 6 records which is already processed is still
not committed.
After rebalance this partition which is assigned to this consumer A goes to a new consumer B. Now since none
of the records are committed by consumer A the first 6 messages are again resent to new consumer B.
This is clear case of data duplication and how to handle it?
Producer
Configs
Header of section
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
71
Critical Configs
Batch.size (size based batching)
Linger.ms ( time based batching)
Compression.type
Max.in.flight.requests.per.connection (affects ordering)
Acks ( affects durability)
retries
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
72
Acks
◦ Acks = 0
◦ Producer Doesn't wait for response from broker.
◦ High throughput
◦ No retries
◦ Loss of message is possible.
◦ Acks =1
◦ Producer waits for response from broker.
◦ Response is sent by leader after it receives the message from producer.
◦ Still message loss if possible.
◦ Acks = -1
◦ Response is sent after leader receives acknowledgement from all its replicas.
◦ Slow
◦ Highly reliable.
Comparisons
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
73
Acks mode
Acks Throughtput Latency Durability
0 High Low No Gurantee
1 Medium Medium Leader
-1 Low High ISR
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
74
Partitioner
Default Partitioner
 If a partition is specified in the record use it.
 If no partition is specified but a key is present choose a partition based on hash of the key.
 If no partition or key is present choose a partition in a round robin fashion.
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
75
Partitioner
Code snipped from default Partitioner
return Utils.toPositive(Utils.murmur2(keybytes))%numPartitions;
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
76
Custom Partitioner
import java.util.*;
import org.apache.kafka.clients.producer.*;
public class SensorProducer {
public static void main(String[] args) throws Exception{
String topicName = "SensorTopic";
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092,localhost:9093");
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("partitioner.class", "SensorPartitioner");
props.put("speed.sensor.name", "TSS");
Producer<String, String> producer = new KafkaProducer <>(props);
for (int i=0 ; i<10 ; i++)
producer.send(new ProducerRecord<>(topicName,"SSP"+i,"500"+i));
for (int i=0 ; i<10 ; i++)
producer.send(new ProducerRecord<>(topicName,"TSS","500"+i));
producer.close();
System.out.println("SimpleProducer Completed.");
}
}
Partition - Producer
import java.util.*;
import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.*;
import org.apache.kafka.common.utils.*;
import org.apache.kafka.common.record.*;
public class SensorPartitioner implements Partitioner {
private String speedSensorName;
public void configure(Map<String, ?> configs) {
speedSensorName = configs.get("speed.sensor.name").toString();
}
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
int sp = (int)Math.abs(numPartitions*0.3);
int p=0;
if ( (keyBytes == null) || (!(key instanceof String)) )
throw new InvalidRecordException("All messages must have sensor name as key");
if ( ((String)key).equals(speedSensorName) )
p = Utils.toPositive(Utils.murmur2(valueBytes)) % sp;
else
p = Utils.toPositive(Utils.murmur2(keyBytes)) % (numPartitions-sp) + sp ;
System.out.println("Key = " + (String)key + " Partition = " + p );
return p;
}
public void close() {}
}
Custom Partitioner
max.in.flight.request.per.connection:
Definition: How many request you can send to broker without getting any response.
Default value is 5.
High value will give high throughput and also use high memory consumption.
In Asncy commit set the value of this property to 1, to maintain ordering of messages.
May cause out of order delivery when retry occurs.
Bala | 7/14/2017
80
Scenario
Side effects of async commit is that, it may loose the ordering of data, which processing messages in batches.
Record1
Record2
Record3
Record4
Record5
Record6
Record7
Record8
Record9
Record10
Commits Successfully
Broker
Record6
Record7
Record8
Record9
Record10
Record1
Record2
Record3
Record4
Record5
Partition buffer
Callback with exception
Retries and successfull
Async producer sends message in background – no blocking in client.
Provides more powerful batching of messages.
Wraps a sync produce, or rather a pool of them.
Communication from asyncsync happens via a queue.
Which explains why you may see kafka.produce.async.QueueFullException.
Async produce may drop messages if its queue is full.
◦ Solution1 don’t push messages to producer faster than its able to send to queue.
◦ Solution 2 Queue full == need more brokers
◦ Solution 3 set queue.enqueuer.timeout.ms to -1. Now the producer will block indefinitely and will never drop messages.
◦ Solution 4 Increase queue.buffering.max.messages
For more in detailed study: https://engineering.gnip.com/kafka-async-producer/
Async Producer
Bala | 7/14/2017
© Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights.
82
Other producer config properties
retries: No of time producer retries to send messages. Default value 0
Retries.backoff.ms= time between each retries. Defaullt value 100ms
Kafka
Monitoring
Lag
Lag = how far your producer is behind the consumer.
Older
message
Newer
message
producer
Consumer
lag
Lag is a consumer problem.
Too slow, too much GC , loosing connection to ZK or Kafka
Bug or design flaw in consumer.
Operational mistakes eg. You brought 6 kafka servers in parallel, each one in turn trigerring
rebalancing, then hit kafkas rebalance limit, cf.rebalance.max.retries
Lag
Under replicated partitions.
◦ For example because a broker is down.
Offline partitions
◦ Even worse than under replicated.
◦ Serious problem if anything but 0 offline partitions.
Partitions Leader broker ISR
paritition1 0 1,2
One of the replica broker say 2 goes down. – Under partitioned
paritition1 0 1
Again one of the replica say 1 goes down – Still Under partitioned
paritition1 0 0
Assume replication factor is set as 3 for this topic.
replica.lag.max.messages
Leader In sync replica 1 In sync replica 2
0 0 0
1 1 1
2 2
3 3
4 4
5 5
6 6
In sync replica 2 for some
reason messages are not
being copied. And this case
replica 2 is lagging 5
messages. Which is more than
value of property
replica.lag.max.messages =4
(default value). This broker
(replica 2) will go out of sync.
commit
commit
replica.lag.max.messages
Leader In sync replica 1
Record1
Record2
Record3
Record4
Record5
1. What happens when message coming in batches.
2. If the value of property is set to 3.
3. Assume batch one has 5 messages and first batch is replicated in all brokers
4. Second batch has another 5 messages. But since replica 1 is lagging behind more
than 3 messages it goes out of sync.
5. Hence though replica set 1 is not dead. It goes out of sync.
6. Solution is to use replica.lag.max.ms
commit Record1
Record2
Record3
Record4
Record5
Record1
Record2
Record3
Record4
Record5
Replica 1 goes
Out of Sync
What happens when broker goes down and comes up again.
Production Scenario 1
Partitions Leader broker Leader assignment after
one of broker 1 goes
down
Leader assignment after
broker 1 comes up
paritition1 0 0 0
paritition2 1 2 2
paritition3 2 2 2
paritition4 3 3 3
partition5 1 0 0
partition6 0 0 0
Sad reality is Broker 1
could never become leader
again. It will simply be as
ISR
Kafka-preferred-replica-
election.sh
Comes to your rescue. And
hence load is evenly
balanced.
How to increase or decrease no of node in kafka?
Increase or Add new Broker
◦ Just start a new instance of kafka. But this new instance will never be a leader. Hence after starting the
broker run kafka-preferred-replica-election.sh
Decrease or Cut down a Broker
◦ Run kafka-reassign-partition.sh
◦ This will show the current replica assignment and proposed replica assignment.
◦ kafka-reassign-partition.sh << list of brokers you want to keep>> --generate.
◦ Suppose you have 5 brokers 1,2,3,4,5 and you want to bring down 5. kafka-reassign-partition.sh <<1,2,3,4>> --generate.
◦ This will generate a json file with proposed assignment file.
◦ Now again run the script
◦ kafka-reassign-partition.sh --execute –reassignment-json-file <<json file name>>
◦ After this run preferred-replica-election.sh.
◦ Cross check using describe command.
Production Scenario 2
What to do if broker 2 goes down and is not recoverable?
◦ Simple start a new broker with broker.id similar to one which is currently not recoverable.
◦ Then start kafka-preferred-replica-election.sh
Production Scenario 3
Twitter Live Streaming Demo
Twitter Code URL
Viyaan | 7/14/2017
94
Git Hub Links
https://github.com/Viyaan/TwitterKafkaProducer
https://github.com/Viyaan/StormKafkaStreamingWordCount

More Related Content

What's hot

Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, ConfluentMaking Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, ConfluentHostedbyConfluent
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?confluent
 
What's new in Confluent 3.2 and Apache Kafka 0.10.2
What's new in Confluent 3.2 and Apache Kafka 0.10.2 What's new in Confluent 3.2 and Apache Kafka 0.10.2
What's new in Confluent 3.2 and Apache Kafka 0.10.2 confluent
 
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...Kai Wähner
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams APIconfluent
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafkaconfluent
 
Fault Tolerance with Kafka
Fault Tolerance with KafkaFault Tolerance with Kafka
Fault Tolerance with KafkaEdureka!
 
Message Driven and Event Sourcing
Message Driven and Event SourcingMessage Driven and Event Sourcing
Message Driven and Event SourcingPaolo Castagna
 
Leveraging Microservice Architectures & Event-Driven Systems for Global APIs
Leveraging Microservice Architectures & Event-Driven Systems for Global APIsLeveraging Microservice Architectures & Event-Driven Systems for Global APIs
Leveraging Microservice Architectures & Event-Driven Systems for Global APIsconfluent
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumChengKuan Gan
 
Kafka Connect by Datio
Kafka Connect by DatioKafka Connect by Datio
Kafka Connect by DatioDatio Big Data
 
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
dotScale 2017 Keynote: The Rise of Real Time by Neha NarkhededotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhedeconfluent
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformconfluent
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?confluent
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
 
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...HostedbyConfluent
 

What's hot (20)

Kafka connect
Kafka connectKafka connect
Kafka connect
 
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, ConfluentMaking Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
Making Kafka Cloud Native | Jay Kreps, Co-Founder & CEO, Confluent
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
What's new in Confluent 3.2 and Apache Kafka 0.10.2
What's new in Confluent 3.2 and Apache Kafka 0.10.2 What's new in Confluent 3.2 and Apache Kafka 0.10.2
What's new in Confluent 3.2 and Apache Kafka 0.10.2
 
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
Apache Kafka + Apache Mesos + Kafka Streams - Highly Scalable Streaming Micro...
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafka
 
Fault Tolerance with Kafka
Fault Tolerance with KafkaFault Tolerance with Kafka
Fault Tolerance with Kafka
 
Message Driven and Event Sourcing
Message Driven and Event SourcingMessage Driven and Event Sourcing
Message Driven and Event Sourcing
 
Leveraging Microservice Architectures & Event-Driven Systems for Global APIs
Leveraging Microservice Architectures & Event-Driven Systems for Global APIsLeveraging Microservice Architectures & Event-Driven Systems for Global APIs
Leveraging Microservice Architectures & Event-Driven Systems for Global APIs
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with Debezium
 
Kafka Connect by Datio
Kafka Connect by DatioKafka Connect by Datio
Kafka Connect by Datio
 
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
dotScale 2017 Keynote: The Rise of Real Time by Neha NarkhededotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
 
Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython NotebookCooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
 

Similar to Kafka for Big Data Streaming

Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQShameera Rathnayaka
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperAnandMHadoop
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationKnoldus Inc.
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewDmitry Tolpeko
 
Kafka Fundamentals
Kafka FundamentalsKafka Fundamentals
Kafka FundamentalsKetan Keshri
 
Kafka Overview
Kafka OverviewKafka Overview
Kafka Overviewiamtodor
 
A Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka SkillsA Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka SkillsRavindra kumar
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-CamusDeep Shah
 
ActiveMQ interview Questions and Answers
ActiveMQ interview Questions and AnswersActiveMQ interview Questions and Answers
ActiveMQ interview Questions and Answersjeetendra mandal
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
 
kafka_session_updated.pptx
kafka_session_updated.pptxkafka_session_updated.pptx
kafka_session_updated.pptxKoiuyt1
 

Similar to Kafka for Big Data Streaming (20)

Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka Deep Dive
Kafka Deep DiveKafka Deep Dive
Kafka Deep Dive
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configuration
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
 
Kafka Fundamentals
Kafka FundamentalsKafka Fundamentals
Kafka Fundamentals
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka Overview
Kafka OverviewKafka Overview
Kafka Overview
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kafka
Apache Kafka Apache Kafka
Apache Kafka
 
A Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka SkillsA Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka Skills
 
Apache kafka introduction
Apache kafka introductionApache kafka introduction
Apache kafka introduction
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
 
ActiveMQ interview Questions and Answers
ActiveMQ interview Questions and AnswersActiveMQ interview Questions and Answers
ActiveMQ interview Questions and Answers
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
kafka_session_updated.pptx
kafka_session_updated.pptxkafka_session_updated.pptx
kafka_session_updated.pptx
 

More from Viyaan Jhiingade (7)

Rate limiting
Rate limitingRate limiting
Rate limiting
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
No sql
No sqlNo sql
No sql
 
Rest Webservice
Rest WebserviceRest Webservice
Rest Webservice
 
Storm
StormStorm
Storm
 
Git commands
Git commandsGit commands
Git commands
 
Jenkins CI
Jenkins CIJenkins CI
Jenkins CI
 

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 

Kafka for Big Data Streaming

  • 2. Big Data It is used to describe massive volumes of both structured and unstructured data that is so large it is difficult to process in traditional database and software techniques. Lots of Data (Terabytes and Petabytes) Big data is a term for a collections of data sets so large and complex that is difficult to process using on-hand database management tools or traditional processing applications. The challenges inside include, capture, curation, storage, search, sharing, transfer, analysis, and visualization.
  • 3. Stock market generates about one terabyte of new trade data per day to perform stock trading analytics to determine trends for optimal trades.
  • 4. Unstructured data is exploding By 2020 International data corporation predicts the number will exceed 40,000 EB or 40 Zettabytes. The world information is doubling every 2 years.
  • 5. IBM definition of big data IBM Definition of big data.
  • 6. What is Kafka A distributed publish subscribe messaging system. Developed by LinkedIn Corporation. Provides solution to handle all activity stream data. Fully supported Hadoop platform. Partitions real time consumption across cluster of machines. Provides a mechanism for parallel load into Hadoop.
  • 7.
  • 9. Need of Kafka Feature Description High Throughput Provides support for hundreds and thousand of message in a moderate software. Scalability Highly scalable with no downtime Replication Messages can be replicated across clusters Durability Provides support for persistence of messages in disk. Stream processing It can used for real time streaming Data Loss Kafka with proper configuration can ensure zero data loss.
  • 12. • What is a producer An application that sends data.
  • 13. Producer Application publishes messages to the topic in Kafka cluster. Can be any kind front end or streaming. While writing messages it is also possible to attach key with message. By attaching key producer basically guarantees that all messages with same key in wrote in same partition. Supports both sync and async mode.
  • 14. Application subscribes and consumes messages from broker in kafka cluster. During consumption of messages from a topic a consumer group can be configured with multiple consumers. Each consumer from consumer group reads messages from different partition in a topic. Consumer
  • 17. Each server is called as broker. Handles hundreds of megabytes of writes from producers and reads from consumers. Retains all published messages irrespective weather it is consumed or not. If retention is configured for n days, then messages once published it is available for consumption for configured for n days and thereafter it is discarded. Works like a queue if consumer instances belong to same consumer group else works like publisher and subscriber. Brokers
  • 18. A group of computer sharing workload for common purpose. Kafka cluster is generally fast, highly scalable messaging system. Effective for applications which involves large scale message processing. Clusters
  • 19. Producer Consumer Sends message Message Request for next messagebroker1 broker2 broker3 zookeeper Cluster set up
  • 20. With kafka we can easily handle hundreds of thousands of messages in a second, which makes kafka a high throughput system. Cluster can be expanded with no downtime. Making kafka highly scalable. Messages are replicated, which provides reliability and durability. Fault tolerant. Why Kafka Cluster
  • 21. Topic An user defined category where messages are published. For each topic partition log is maintained. Each topic basically maintains an ordered, immutable sequence of messages assigned a sequential id number called offset. Writes to a partition are generally sequential thereby reducing the number of hard disk seeks. Reading messages from partition can either be from the beginning and also can rewind or skip to any point in a partition by supplying an offset value.
  • 24. What is a Offset A sequence id given to messages as they arrive in a partition. m1 m2 m3 m4 m5 m6 m7 m8 0 1 2
  • 26. Offset Committed offset is used to avoid resending of already processed data to the new consumer during an event of partition rebalance. Auto commit :- enable.auto.commit = true Manual Commit :- :- enable.auto.commit = false. auto.commit.interval.ms =4 – What is the purpose of this property.
  • 27. What is a Consumer group A group of consumers acting as single logical unit.
  • 29. One consumer in clustered setup example
  • 31.
  • 32. There will not be any duplicate reads. Each consumer within a consumer group will be assigned a partition. Hence it will read message only from the partitions assigned to it. Once a partition is assigned to a consumer. It will not be assigned to another consumer within a same group. Unless a rebalancing takes place.
  • 33. Once consumer increases in a group. Partitions are distributed.
  • 35. One of Kafka broker is elected as group coordinator. When new consumer joins group it sends message to coordinator. So first consumer joining the consumer group becomes leader in the group. Roles and Responsibilities. ◦ Coordinator manages list of group members. ◦ Coordinator initiates rebalance activity once list is modified. ◦ Consumer leader executes rebalance activity. ◦ Consumer leader assigns partition to new member and sends back to co Ordinator. ◦ Coordinator communicates to member consumer about its new assignment. Group Coordinator
  • 36. ◦ Imagine when poll() pulls large amount of data and it takes lot of time to process. Which means there will be delay in next polling. ◦ If there will be delay in next poll, group coordinator will assume that consumer is dead and will issue a rebalancing. How will you know a rebalance is triggered and how will you commit your offset in such cases? Production Scenario – Problem statement.
  • 37. Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 45 Solution So in such scenario before it is rebalanced we have to commit till whatever is processed. So that when this partition is assigned to new consumer. Already processed record by this consumer wont be again sent to the newly assigned consumer. ConsumerRebalanceListner class has certain methods like onPartitionRevoked which will be invoked just before rebalance is issued. onPartitionAsssigned which will be invoked after rebalance is complete.
  • 40. Total number of copies made for a partition is Replication Factor. The purpose of adding replication in Kafka is for stronger durability and higher availability. We want to guarantee that any successfully published message will not be lost and can be consumed, even when there are server failures. Such failures can be caused by machine error, program error, or more commonly, software upgrades. Replication Factor
  • 41. Leader and Follower For each partition one broker is chosen as a leader. Leader copies data to all its replicas.
  • 42. Client application sends message only to leader.
  • 43. An open source Apache project. Provides a centralised, infrastructure and services that enables synchronisation across clusters. Common objects used across large cluster environment are maintained in zookeeper. Objects such as configuration, hierarchical naming space are maintained in zookeeper. Zookeeper services are used by large scale applications to coordinate distributed processing across large cluster. Zookeeper
  • 45. Kafka can be downloaded from the following location https://kafka.apache.org/downloads.html As per the current documentation the version of kafka is 0.11.0.0. Installing Kafka
  • 46. Kafka Configurations Property Default Description broker.id Each broker is uniquely identified by a non- negative integer id. This id serves as the brokers “name” and allows the broker to be moved to a different host/port without confusing consumers. You can choose any number you like as it is unique. logs.dirs. /tmp/kafka- logs A comma separated list or one or more directories in which kafka data are stored. Each new partition that is created will be placed in the directory which currently has the fewest partitions. Port 6667 The port on which server accepts client connections. keeper.connect null Specifies the zookeeper connection string in the form hostname:port, where hostname and port are the host and port for a node in your zookeeper cluster. To allow connecting through other zookeeper nodes when that host is down you can also specify multiple hosts in the form hostname1:port1, hostname2:port2, hostname3:port3. Zookeeper also allows you to add a “chroot” path which will make all Kafka data for the cluster appear under a particular path. This is a way to setup multiple kafka clusters or
  • 47. Kafka cluster can run against the following broker model. Single Broker Cluster Multi Broker Cluster Single broker cluster generally runs only one instance compared to multi broker which runs multiple broker. To test the kafka cluster the following shell scripts can be used. Testing Kafka Cluster Kafka Shell Scripts Zookeeper-server-start.sh Kafka-server-start.sh Kafka-topics.sh Kafka-console-producer.sh Kafka-console-consumer.sh
  • 48. Header of section Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 55 Demo
  • 49. Header of section Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 56 Start Kafka server Create topic Start a console producer Start a console consumer. Send and receive message. Set up clustered broker
  • 51. 1. Auto.create.topics.enable 2. Default.replication.factor 3. Num.partition 4. Log.retention.ms 5. Log.retention.bytes.
  • 54. import java.util.Properties; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; public class SimpleProducerToLearn { public static void main(String[] args) { String topicName = "SimpleTopic"; String key = "Key1"; String value = "Value-1"; Properties properties = new Properties(); properties.put("bootstrap.servers", "localhost:9092"); properties.put("serializer.class", "org.apache.commmon.serialization.StringSerializer"); properties.put("request.required.acks", "1"); Producer<String, String> producer = new KafkaProducer<>(properties); ProducerRecord<String, String> record = new ProducerRecord<>(topicName, key, value); producer.send(record); producer.close(); } } Producer Java API
  • 56. Producer Record Kafka comes with default practitioner. Messages with same message key goes in same partition. Key is optional, hence if message has no key then Kafka with evenly distribute messages across the partitions. If you pass partition in constructor then default partition is disabled. Timestamp field in constructor denotes the time when message is sent in broker. If you don’t pass this then broker will set timestamp as time at which messages received in broker.
  • 60. package com.Learning.co; import java.util.Properties; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; public class SynchronousProducerToLearn { public static void main(String[] args) { String topicName = "SimpleTopic"; String key = "Key1"; String value = "Value-1"; Properties properties = new Properties(); properties.put("bootstrap.servers", "localhost:9092"); properties.put("serializer.class", "org.apache.commmon.serialization.StringSerializer"); properties.put("request.required.acks", "1"); Producer<String, String> producer = new KafkaProducer<String, String>(properties); ProducerRecord<String, String> record = new ProducerRecord<String, String>(topicName, key, value); try{ RecordMetadata metadata = producer.send(record).get(); System.out.println("Synchronous completed with success" +"sent to partition"+metadata.partition()+" offset "+ metadata.offset()); }catch(Exception e){ System.out.println("Synchronous completed with failure" ); } producer.close(); } } Sync
  • 61. Async producer.send(record, new MessageCallBack()); class MessageCallBack implements Callback{ public void onCompletion(RecordMetadata metadata, Exception e) { // TODO Auto-generated method stub if(e!=null){ System.out.println("Failed"); }else{ System.out.println("Success"); } } }
  • 62. Header of section Bala | 7/14/2017 69 Production Scenario – Problem statement. Assume auto commit interval is set to 60 seconds (Default). Now pull method in consumer A invokes and receives 6 records. All these 6 records are processed in less than 10 seconds. Since 60 seconds gap is not over these records are not committed in Kafka. Now another set of records are received via pull method. Now lets assume due to some reason a rebalance is triggered. First 6 records which is already processed is still not committed. After rebalance this partition which is assigned to this consumer A goes to a new consumer B. Now since none of the records are committed by consumer A the first 6 messages are again resent to new consumer B. This is clear case of data duplication and how to handle it?
  • 64. Header of section Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 71 Critical Configs Batch.size (size based batching) Linger.ms ( time based batching) Compression.type Max.in.flight.requests.per.connection (affects ordering) Acks ( affects durability) retries
  • 65. Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 72 Acks ◦ Acks = 0 ◦ Producer Doesn't wait for response from broker. ◦ High throughput ◦ No retries ◦ Loss of message is possible. ◦ Acks =1 ◦ Producer waits for response from broker. ◦ Response is sent by leader after it receives the message from producer. ◦ Still message loss if possible. ◦ Acks = -1 ◦ Response is sent after leader receives acknowledgement from all its replicas. ◦ Slow ◦ Highly reliable.
  • 66. Comparisons Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 73 Acks mode Acks Throughtput Latency Durability 0 High Low No Gurantee 1 Medium Medium Leader -1 Low High ISR
  • 67. Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 74 Partitioner Default Partitioner  If a partition is specified in the record use it.  If no partition is specified but a key is present choose a partition based on hash of the key.  If no partition or key is present choose a partition in a round robin fashion.
  • 68. Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 75 Partitioner Code snipped from default Partitioner return Utils.toPositive(Utils.murmur2(keybytes))%numPartitions;
  • 69. Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 76 Custom Partitioner
  • 70. import java.util.*; import org.apache.kafka.clients.producer.*; public class SensorProducer { public static void main(String[] args) throws Exception{ String topicName = "SensorTopic"; Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9092,localhost:9093"); props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("partitioner.class", "SensorPartitioner"); props.put("speed.sensor.name", "TSS"); Producer<String, String> producer = new KafkaProducer <>(props); for (int i=0 ; i<10 ; i++) producer.send(new ProducerRecord<>(topicName,"SSP"+i,"500"+i)); for (int i=0 ; i<10 ; i++) producer.send(new ProducerRecord<>(topicName,"TSS","500"+i)); producer.close(); System.out.println("SimpleProducer Completed."); } } Partition - Producer
  • 71. import java.util.*; import org.apache.kafka.clients.producer.*; import org.apache.kafka.common.*; import org.apache.kafka.common.utils.*; import org.apache.kafka.common.record.*; public class SensorPartitioner implements Partitioner { private String speedSensorName; public void configure(Map<String, ?> configs) { speedSensorName = configs.get("speed.sensor.name").toString(); } public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) { List<PartitionInfo> partitions = cluster.partitionsForTopic(topic); int numPartitions = partitions.size(); int sp = (int)Math.abs(numPartitions*0.3); int p=0; if ( (keyBytes == null) || (!(key instanceof String)) ) throw new InvalidRecordException("All messages must have sensor name as key"); if ( ((String)key).equals(speedSensorName) ) p = Utils.toPositive(Utils.murmur2(valueBytes)) % sp; else p = Utils.toPositive(Utils.murmur2(keyBytes)) % (numPartitions-sp) + sp ; System.out.println("Key = " + (String)key + " Partition = " + p ); return p; } public void close() {} } Custom Partitioner
  • 72. max.in.flight.request.per.connection: Definition: How many request you can send to broker without getting any response. Default value is 5. High value will give high throughput and also use high memory consumption. In Asncy commit set the value of this property to 1, to maintain ordering of messages. May cause out of order delivery when retry occurs.
  • 73. Bala | 7/14/2017 80 Scenario Side effects of async commit is that, it may loose the ordering of data, which processing messages in batches. Record1 Record2 Record3 Record4 Record5 Record6 Record7 Record8 Record9 Record10 Commits Successfully Broker Record6 Record7 Record8 Record9 Record10 Record1 Record2 Record3 Record4 Record5 Partition buffer Callback with exception Retries and successfull
  • 74. Async producer sends message in background – no blocking in client. Provides more powerful batching of messages. Wraps a sync produce, or rather a pool of them. Communication from asyncsync happens via a queue. Which explains why you may see kafka.produce.async.QueueFullException. Async produce may drop messages if its queue is full. ◦ Solution1 don’t push messages to producer faster than its able to send to queue. ◦ Solution 2 Queue full == need more brokers ◦ Solution 3 set queue.enqueuer.timeout.ms to -1. Now the producer will block indefinitely and will never drop messages. ◦ Solution 4 Increase queue.buffering.max.messages For more in detailed study: https://engineering.gnip.com/kafka-async-producer/ Async Producer
  • 75. Bala | 7/14/2017 © Robert Bosch Engineering and Business Solutions Private Limited 2017. All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights. 82 Other producer config properties retries: No of time producer retries to send messages. Default value 0 Retries.backoff.ms= time between each retries. Defaullt value 100ms
  • 77. Lag Lag = how far your producer is behind the consumer. Older message Newer message producer Consumer lag
  • 78. Lag is a consumer problem. Too slow, too much GC , loosing connection to ZK or Kafka Bug or design flaw in consumer. Operational mistakes eg. You brought 6 kafka servers in parallel, each one in turn trigerring rebalancing, then hit kafkas rebalance limit, cf.rebalance.max.retries Lag
  • 79. Under replicated partitions. ◦ For example because a broker is down. Offline partitions ◦ Even worse than under replicated. ◦ Serious problem if anything but 0 offline partitions.
  • 80. Partitions Leader broker ISR paritition1 0 1,2 One of the replica broker say 2 goes down. – Under partitioned paritition1 0 1 Again one of the replica say 1 goes down – Still Under partitioned paritition1 0 0 Assume replication factor is set as 3 for this topic.
  • 81. replica.lag.max.messages Leader In sync replica 1 In sync replica 2 0 0 0 1 1 1 2 2 3 3 4 4 5 5 6 6 In sync replica 2 for some reason messages are not being copied. And this case replica 2 is lagging 5 messages. Which is more than value of property replica.lag.max.messages =4 (default value). This broker (replica 2) will go out of sync. commit commit
  • 82. replica.lag.max.messages Leader In sync replica 1 Record1 Record2 Record3 Record4 Record5 1. What happens when message coming in batches. 2. If the value of property is set to 3. 3. Assume batch one has 5 messages and first batch is replicated in all brokers 4. Second batch has another 5 messages. But since replica 1 is lagging behind more than 3 messages it goes out of sync. 5. Hence though replica set 1 is not dead. It goes out of sync. 6. Solution is to use replica.lag.max.ms commit Record1 Record2 Record3 Record4 Record5 Record1 Record2 Record3 Record4 Record5 Replica 1 goes Out of Sync
  • 83. What happens when broker goes down and comes up again. Production Scenario 1 Partitions Leader broker Leader assignment after one of broker 1 goes down Leader assignment after broker 1 comes up paritition1 0 0 0 paritition2 1 2 2 paritition3 2 2 2 paritition4 3 3 3 partition5 1 0 0 partition6 0 0 0 Sad reality is Broker 1 could never become leader again. It will simply be as ISR Kafka-preferred-replica- election.sh Comes to your rescue. And hence load is evenly balanced.
  • 84. How to increase or decrease no of node in kafka? Increase or Add new Broker ◦ Just start a new instance of kafka. But this new instance will never be a leader. Hence after starting the broker run kafka-preferred-replica-election.sh Decrease or Cut down a Broker ◦ Run kafka-reassign-partition.sh ◦ This will show the current replica assignment and proposed replica assignment. ◦ kafka-reassign-partition.sh << list of brokers you want to keep>> --generate. ◦ Suppose you have 5 brokers 1,2,3,4,5 and you want to bring down 5. kafka-reassign-partition.sh <<1,2,3,4>> --generate. ◦ This will generate a json file with proposed assignment file. ◦ Now again run the script ◦ kafka-reassign-partition.sh --execute –reassignment-json-file <<json file name>> ◦ After this run preferred-replica-election.sh. ◦ Cross check using describe command. Production Scenario 2
  • 85. What to do if broker 2 goes down and is not recoverable? ◦ Simple start a new broker with broker.id similar to one which is currently not recoverable. ◦ Then start kafka-preferred-replica-election.sh Production Scenario 3
  • 87. Twitter Code URL Viyaan | 7/14/2017 94 Git Hub Links https://github.com/Viyaan/TwitterKafkaProducer https://github.com/Viyaan/StormKafkaStreamingWordCount

Editor's Notes

  1. If there are more consumers in a group. Then the extra consumer will be idle. In no case a partition will be assigned to more than one consumer.
  2. When a new consumer enters or exits the group. A rebalance is triggered by the group co Ordinator.