Streaming kafka search utility for Mozilla's Bagheera
KAFKA Quickstart
1. KAFKA
Introduction:
Kafka is a distributed publish-subscribe messaging system that is designed to be
fast, scalable, and durable.
Kafka maintains feeds of messages in categories called topics.
Kafka messages are generated by processes called producers.
The processes that subscribe to topics and process the feed of published
messages are called consumers.
Kafka is run as a cluster comprised of one or more servers each of which
is called a broker.
Quick Start:
1. Create a topic
/usr/bin/kafka-topics --create --zookeeper zookeeperIP:2181 --
replication-factor1 --partitions1 --topic testTopic
2. Publish Messagevia Producer to a topic
/usr/bin/kafka-console-producer --broker-list producerIP:9092 --topic
testTopic
This is firstkafka message
2. 3. Starting a consumer
/usr/bin/kafka-console-consumer --zookeeper zookeeperIP:2181 --topic
testTopic --from-beginning
If you start different consumers in different sessions of putty then you can see
messages being delivered to all the consumers as soon as the producer
publishes them
A bit more details:
A topic is a category or feed name to which messages are published. For each
topic, the Kafka cluster maintains a partitioned log that looks like…
3. Each partition is an ordered, immutable sequence of messages that is continually
appended to commit log. The messages in the partitions are each assigned a
sequential id number called the offset that uniquely identifies each message
within the partition.
The Kafka cluster retains all published messages whether or not they have been
consumed for a configurable period of time. Log retention can be set in two
either, day based retention or size based retention. Kafka's performance is
effectively constant with respect to data size so retaining lots of data is not a
problem.
QnA:
1. What type of messages canbe sent froma producer
The producer class takes two generic parameter i.e
Producer<K, V>
V: type of the message
K: type of the optional key associated with the message
So any kind of message can be sent for example String, JSON, AVRO
2. How a consumer can start reading from a particular offset
Kafka does not take care of the offset up till which a particular consumer has
already read. The consumer has to take care of the offset on his side. The
information regarding the offset up till he has consumed the messages have to
be stored elsewhere i.e HDFS/Db/HBase etc.
Kafka only provides two kind of reading from Beginning OR from Latest Time
4. 3. So When touse Kafka?
Cloudera recommends using Kafka if the data will be consumed by multiple
applications
API Examples:
A sample Producer
Propertiesprops=newProperties();
props.put("metadata.broker.list",args[0]);
props.put("zk.connect",args[1]);
props.put("serializer.class","kafka.serializer.StringEncoder");
props.put("request.required.acks","1");
StringTOPIC= "event";
ProducerConfigconfig=newProducerConfig(props);
Producer<String,String>producer=new Producer<String,String>(config);
String[] events={"Normal","Normal","Normal",…];
String[] truckIds= {"1", "2", "3","4"};
String[] driverIds={"11", "12", "13", "14"};
Stringmessage = newTimestamp(newDate().getTime()) +"|"
+ truckIds[2] + "|" + driverIds[2] +"|" + events[random.nextInt(evtCnt)] );
try {
KeyedMessage<String,String>data= new KeyedMessage<String,String>(TOPIC, message);
producer.send(data);
Thread.sleep(1000);
} catch (Exceptione) {
e.printStackTrace();
}
A sample Consumer
Kafka provides a simple consumer which can be modified as per requirement
Steps for using a Simple Consumer
Find an active Broker and find out which Broker is the leader for your topic
and partition
Determine who the replica Brokers are for your topic and partition
5. Build the request defining what data you are interested in
Fetch the data
Identify and recover from leader changes
Data fetch pseudo code
FetchRequestreq=new FetchRequestBuilder().clientId(clientName).addFetch(a_topic,a_partition,
readOffset, 100000).build();
FetchResponsefetchResponse =consumer.fetch(req);
if (fetchResponse.hasError()) {
//Error Handlingcode here
}
for (MessageAndOffset messageAndOffset:fetchResponse.messageSet(a_topic,a_partition)) {
longcurrentOffset=messageAndOffset.offset();
if (currentOffset<readOffset) {
//Properloggerhere
continue;
}
readOffset=messageAndOffset.nextOffset();
ByteBufferpayload=messageAndOffset.message().payload();
byte[] bytes=new byte[payload.limit()];
payload.get(bytes);
System.out.println(String.valueOf(messageAndOffset.offset()) + ": " + new String(bytes, "UTF-8"));
numRead++;
a_maxReads--;
}
Conclusion:
As you can see, Kafka has a unique design that makes it very useful for solving a
wide range of architectural challenges. It is important to make sure you use the
right approach for your use case and use it correctly to ensure high throughput,
low latency, high availability, and no loss of data.