SA UNIT II KAFKA.pdf

18CSE489T - STREAMING ANALYTICS
UNIT-2
Session-1
SLO 1 – Getting Started with Kafka
SRM Institute of Science and Technology, Ramapuram 1

Getting Started with
Kafka
 Apache Kafka was an open sourced Apache project in 2011, then First-class
Apache project in 2012.
 Kafka is written in Scala and Java.
 Apache Kafka is publish-subscribe based fault tolerant messaging system.
 It is fast, scalable and distributed by design.

Why Kafka? Publish
Subscribe messaging model
 In Big Data, an enormous volume of data is used.
 Regarding data, we have two main challenges.
 The first challenge is how to collect large volume of data and
 The second challenge is to analyze the collected data.
 To overcome those challenges, you must need a messaging system.
 Kafka is designed for distributed high throughput systems.
 Kafka tends to work very well as a replacement for a more traditional message broker.
 In comparison to other messaging systems, Kafka has better throughput, built-in
partitioning, replication and inherent fault-tolerance, which makes it a good fit for large-
scale message processing applications.

Why Kafka? Publish
 Why Kafka?
 Multiple Producers
 Multiple Consumers
 Disk Retention
 Scalable
 High Performance.

Why Kafka? Publish
What is a Messaging System?
 A Messaging System is responsible for transferring data from one application to
another, so the applications can focus on data, but not worry about how to share
it.
 Distributed messaging is based on the concept of reliable message queuing.
 Messages are queued asynchronously between client applications and messaging
system.
 Two types of messaging patterns are available
 one is point to point and
 the other is publish-subscribe (pub-sub) messaging system.
 Most of the messaging patterns follow pub-sub.

Why Kafka? Publish
Publish-Subscribe Messaging System
 In the publish-subscribe system, messages are persisted in a topic.
 Unlike point-to-point system, consumers can subscribe to one or more topic and
consume all the messages in that topic.
 In the Publish-Subscribe system, message producers are called publishers and
message consumers are called subscribers.
 A real-life example is Dish TV, which publishes different channels like sports,
movies, music, etc., and anyone can subscribe to their own set of channels and
get them whenever their subscribed channels are available.

Why Kafka? Publish

Why Kafka? Publish
Following are a few benefits of Kafka −
 Reliability − Kafka is distributed, partitioned, replicated and fault tolerance.
 Scalability − Kafka messaging system scales easily without down time..
 Durability − Kafka uses Distributed commit log which means messages persists
on disk as fast as possible, hence it is durable..
 Performance − Kafka has high throughput for both publishing and subscribing
messages. It maintains stable performance even many TB of messages are
stored.
Kafka is very fast and guarantees zero downtime and zero data loss.

UNIT-2
Session-1
SLO 2 – Why Kafka? Publish Subscribe
Messaging Model

Why Kafka? Publish
 In Big Data, an enormous volume of data is used.
 Regarding data, we have two main challenges.
 The first challenge is how to collect large volume of data and
 The second challenge is to analyze the collected data.
 To overcome those challenges, you must need a messaging system.
 Kafka is designed for distributed high throughput systems.
 Kafka tends to work very well as a replacement for a more traditional message broker.
 In comparison to other messaging systems, Kafka has better throughput, built-in
partitioning, replication and inherent fault-tolerance, which makes it a good fit for large-
scale message processing applications.

Why Kafka? Publish
 Why Kafka?
 Multiple Producers
 Multiple Consumers
 Disk Retention
 Scalable
 High Performance.

Why Kafka? Publish
What is a Messaging System?
 A Messaging System is responsible for transferring data from one application to
another, so the applications can focus on data, but not worry about how to share
it.
 Distributed messaging is based on the concept of reliable message queuing.
 Messages are queued asynchronously between client applications and messaging
system.
 Two types of messaging patterns are available
 one is point to point and
 the other is publish-subscribe (pub-sub) messaging system.
 Most of the messaging patterns follow pub-sub.

Why Kafka? Publish
Publish-Subscribe Messaging System
 In the publish-subscribe system, messages are persisted in a topic.
 Unlike point-to-point system, consumers can subscribe to one or more topic and
consume all the messages in that topic.
 In the Publish-Subscribe system, message producers are called publishers and
message consumers are called subscribers.
 A real-life example is Dish TV, which publishes different channels like sports,
movies, music, etc., and anyone can subscribe to their own set of channels and
get them whenever their subscribed channels are available.

Why Kafka? Publish

Why Kafka? Publish
Following are a few benefits of Kafka −
 Reliability − Kafka is distributed, partitioned, replicated and fault tolerance.
 Scalability − Kafka messaging system scales easily without down time..
 Durability − Kafka uses Distributed commit log which means messages persists
on disk as fast as possible, hence it is durable..
 Performance − Kafka has high throughput for both publishing and subscribing
messages. It maintains stable performance even many TB of messages are
stored.
Kafka is very fast and guarantees zero downtime and zero data loss.

UNIT-2
Session-2
SLO 1 – Kafka Architecture

Kafka Architecture
 Topics, partitions, producers, consumers, etc., together forms the Kafka architecture.
 As different applications design the architecture of Kafka accordingly, there are the
following essential parts required to design Apache Kafka architecture.

Kafka Architecture
o Data Ecosystem: Several applications that use Apache Kafka forms an ecosystem. This
ecosystem is built for data processing. It takes inputs in the form of applications that
create data, and outputs are defined in the form of metrics, reports, etc. The below
diagram represents a circulatory data ecosystem for Kafka.
o Kafka Cluster: A Kafka cluster is a system that comprises of different brokers, topics,
and their respective partitions. Data is written to the topic within the cluster and read by
the cluster itself.
o Producers: A producer sends or writes data/messages to the topic within the cluster. In
order to store a huge amount of data, different producers within an application send data
to the Kafka cluster.

Kafka Architecture
o Consumers: A consumer is the one that reads or consumes messages from the Kafka
cluster. There can be several consumers consuming different types of data form the
cluster. The beauty of Kafka is that each consumer knows from where it needs to
consume the data.
o Brokers: A Kafka server is known as a broker. A broker is a bridge between producers
and consumers. If a producer wishes to write data to the cluster, it is sent to the Kafka
server. All brokers lie within a Kafka cluster itself. Also, there can be multiple brokers.
o Topics: It is a common name or a heading given to represent a similar type of data. In
Apache Kafka, there can be multiple topics in a cluster. Each topic specifies different
types of messages.

Kafka Architecture
o Partitions: The data or message is divided into small subparts, known as partitions.
Each partition carries data within it having an offset value. The data is always written in
a sequential manner. We can have an infinite number of partitions with infinite offset
values. However, it is not guaranteed that to which partition the message will be written.

Kafka Architecture
o ZooKeeper: A ZooKeeper is used to store information about the Kafka cluster and
details of the consumer clients. It manages brokers by maintaining a list of them. Also, a
ZooKeeper is responsible for choosing a leader for the partitions. If any changes like a
broker die, new topics, etc., occurs, the ZooKeeper sends notifications to Apache Kafka.
A ZooKeeper is designed to operate with an odd number of Kafka servers. Zookeeper
has a leader server that handles all the writes, and rest of the servers are the followers
who handle all the reads. However, a user does not directly interact with the Zookeeper,
but via brokers. No Kafka server can run without a zookeeper server. It is mandatory to
run the zookeeper server.

Kafka Architecture
 In the above figure, there are three zookeeper servers where server 2 is the leader, and the
other two are chosen as its followers. The five brokers are connected to these servers.
Automatically, the Kafka cluster will come to know when brokers are down, more topics
are added, etc.. Hence, on combining all the necessities, a Kafka cluster architecture is
designed.

UNIT-2
Session-2
SLO 2 – Messages and Batches, Schemas

Messages and Batches,
Schemas, Topics and Partitions
 The unit of data within Kafka is called a message.
 If you are approaching Kafka from a database background, you can think of this as
similar to a row or a record.
 A message is simply an array of bytes as far as Kafka is concerned, so the data contained
within it does not have a specific format or meaning to Kafka.
 A message can have an optional bit of metadata, which is referred to as a key.
 The key is also a byte array and, as with the message, has no specific meaning to Kafka.
Keys are used when messages are to be written to partitions in a more controlled manner.

 For efficiency, messages are written into Kafka in batches.
 A batch is just a collection of messages, all of which are being produced to the same
topic and partition.
 An individual roundtrip across the network for each message would result in excessive
overhead, and collecting messages together into a batch reduces this.
 Schemas: While messages are opaque byte arrays to Kafka itself, it is recommended that
additional structure, or schema, be imposed on the message content so that it can be
easily understood. There are many options available for message schema, depending on
your application’s individual needs. Simplistic systems, such as Javascript Object
Notation (JSON) and Extensible Markup Language (XML), are easy to use and human-
readable. However, they lack features such as robust type handling and compatibility
between schema versions. Many Kafka developers favor the use of Apache Avro, which
is a serialization framework originally developed for Hadoop.

 Topics and Partitions
 Messages in Kafka are categorized into topics. The closest analogies for a topic are a
database table or a folder in a filesystem. Topics are additionally broken down into a
number of partitions. Going back to the “commit log” description, a partition is a single
log. Messages are written to it in an append-only fashion, and are read in order from
beginning to end. Note that as a topic typically has multiple partitions, there is no
guarantee of message time-ordering across the entire topic, just within a single partition.
Figure 1-5 shows a topic with four partitions, with writes being appended to the end of
each one. Partitions are also the way that Kafka provides redundancy and scalability.
Each partition can be hosted on a different server, which means that a single topic can be
scaled horizontally across multiple servers to provide performance far beyond the ability
of a single server.

 The term stream is often used when discussing data within systems like Kafka. Most
often, a stream is considered to be a single topic of data, regardless of the number of
partitions. This represents a single stream of data moving from the producers to the
consumers. This way of referring to messages is most common when discussing stream
processing, which is when frameworks—some of which are Kafka Streams, Apache
Samza, and Storm—operate on the messages in real time. This method of operation can
be compared to the way offline frameworks, namely Hadoop, are designed to work on
bulk data at a later time

UNIT-2
Session-3
SLO 1 – Topics and Partitions

UNIT-2
Session-3
SLO 2 – Producers and Consumers

Producers and Consumers
 Producers create new messages. In other publish/subscribe systems, these may be called
publishers or writers.
 Consumers read messages. In other publish/subscribe systems, these clients may be
called subscribers or readers.

Brokers and Clusters
 A single Kafka server is called a broker. The broker receives messages from producers,
assigns offsets to them, and commits the messages to storage on disk.
 Kafka brokers are designed to operate as part of a cluster. Within a cluster of brokers, one
broker will also function as the cluster controller (elected automatically from the live
members of the cluster). The controller is responsible for administrative operations,
including assigning partitions to brokers and monitoring for broker failures. A partition is
owned by a single broker in the cluster, and that broker is called the leader of the
partition

Data Ecosystem

Use cases
 Activity tracking
 Messaging
 Metrics and Logging
 Commit log
 Stream Processing

Sending Messages with
Producers Steps & Example
 The simplest way to send a message is as follows:
ProducerRecord<String, String> record = new ProducerRecord<>("CustomerCountry", "Precision Products",
"France");
try {
producer.send(record);
} catch (Exception e) {
e.printStackTrace();
}

 Sending a Message Synchronously
The simplest way to send a message synchronously is as follows:
ProducerRecord<String, String> record = new ProducerRecord<>("CustomerCountry", "Precision
Products", "France");
try {
producer.send(record).get();
}

 Sending a Message Asynchronously
private class DemoProducerCallback implements Callback {
@Override
public void onCompletion(RecordMetadata recordMetadata, Exception e) {
if (e != null) {
}
}
}
ProducerRecord<String, String> record =
new ProducerRecord<>("CustomerCountry", "Biomedical Materials", "USA");
producer.send(record, new DemoProducerCallback());

Receiving Messages with
Consumers Steps & Example
 Creating a Kafka Consumer
The following code snippet shows how to create a KafkaConsumer:
Properties props = new Properties();
props.put("bootstrap.servers", "broker1:9092,broker2:9092");
props.put("group.id", "CountryCounter");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String,
String>(props);

 Subscribing to Topics
 The subcribe() method takes a list of topics as a parameter, so it’s pretty simple to
use:
consumer.subscribe(Collections.singletonList("customerCountries"));
 To subscribe to all test topics, we can call:
consumer.subscribe("test.*");

The Poll Loop
try {
while (true) {
ConsumerRecords<String, String> records =
consumer.poll(100);
for (ConsumerRecord<String, String> record :
records)
{
log.debug("topic = %s, partition = %s, offset =
%d,
customer = %s, country = %sn",
record.topic(), record.partition(), record.offset(),
record.key(), record.value());
int updatedCount = 1;
if
(custCountryMap.countainsValue(record.value(
))) {
updatedCount =
custCountryMap.get(record.value()) + 1;
}
custCountryMap.put(record.value(),
updatedCount)
JSONObject json = new
JSONObject(custCountryMap);
System.out.println(json.toString(4))
}
}
} finally {
consumer.close();
}

UNIT-2
Session-4
SLO 1 – Brokers and Clusters

 A single Kafka server is called a broker. The broker receives messages from producers,
assigns offsets to them, and commits the messages to storage on disk.
 Kafka brokers are designed to operate as part of a cluster. Within a cluster of brokers, one
broker will also function as the cluster controller (elected automatically from the live
members of the cluster). The controller is responsible for administrative operations,
including assigning partitions to brokers and monitoring for broker failures. A partition is
owned by a single broker in the cluster, and that broker is called the leader of the
partition

Data Ecosystem

Use cases
 Activity tracking
 Messaging
 Metrics and Logging
 Commit log
 Stream Processing

"France");
try {
}

try {
}

@Override
if (e != null) {
}
}
}

String>(props);

use:

The Poll Loop
try {
while (true) {
consumer.poll(100);
records)
{
%d,
if
))) {
updatedCount =
}
updatedCount)
}
}
} finally {
consumer.close();
}

UNIT-2
Session-4
SLO 2 – Multiple Clusters, Data Ecosystem

UNIT-2
Session-5
SLO 1 – Sending Messages with
Producers

"France");
try {
}

try {
}

@Override
if (e != null) {
}
}
}

String>(props);

use:

The Poll Loop
try {
while (true) {
consumer.poll(100);
records)
{
%d,
if
))) {
updatedCount =
}
updatedCount)
}
}
} finally {
consumer.close();
}

UNIT-2
Session-5
SLO 2 – Steps and Example - Sending
Messages with Producers

UNIT-2
Session-6
SLO 1 – Receiving Messages with
Consumers

String>(props);

use:

The Poll Loop
try {
while (true) {
consumer.poll(100);
records)
{
%d,
if
))) {
updatedCount =
}
updatedCount)
}
}
} finally {
consumer.close();
}

UNIT-2
Session-6
SLO 2 – Steps & Examples Receiving
Messages with Consumers

UNIT-2
Session-7
SLO 1 – Developing Kafka Stream
Applications

Developing Kafka Stream
Applications
 The Kafka Streams DSL is the high-level API that enables you to build Kafka
Streams applications quickly.
 The high-level API is very well thought out, and there are methods to handle most
stream-processing needs out of the box, so you can create a sophisticated stream-
processing program without much effort.
 At the heart of the high-level API is the KStream object, which represents the
streaming key/value pair records. Most of the methods in the Kafka Streams DSL return a
reference to a KStream object, allowing for a fluent interface style of programming.
 Additionally, a good percentage of the KStream methods accept types consisting of
single-method interfaces allowing for the use of Java 8 lambda expressions. Taking these
factors into account, you can imagine the simplicity and ease with which you can build a
Kafka Streams program.

Phases in Kafka Stream
Applications Development
 Your first program will be a toy application that takes incoming messages and
converts them to uppercase characters, effectively yelling at anyone who reads the
message.

 This is a trivial example, but the code shown here is representative of what you’ll
see in other Kafka Streams programs. In most of the examples, you’ll see a similar
structure:
1. Define the configuration items.
2. Create Serde instances, either custom or predefined.
3. Build the processor topology.
4. Create and start the KStream.
 When we get into the more advanced examples, the principal difference will be in the
complexity of the processor topology. With that in mind, it’s time to build your first
application.

 Creating the topology for the Yelling App
 The first step to creating any Kafka Streams application is to create a source node. The
source node is responsible for consuming the records, from a topic, that will flow through
the application.

 The following line of code creates the source, or parent, node of the graph.
KStream<String, String> simpleFirstStream = builder.stream("src-topic",
Consumed.with(stringSerde, stringSerde));
 The simpleFirstStreamKStream instance is set to consume messages written to the src-
topic topic

UNIT-2
Session-7
SLO 2 – Phases in Kafka Stream
Application Development

UNIT-2
Session-8
SLO 1 – Constructing a Topology

Constructing a Topology
 BUILDING THE SOURCE NODE
 You’ll start by building the source node
and first processor of the topology by
chaining two calls to the KStream API
together. It should be fairly obvious by
now what the role of the origin node is.
The first processor in the topology will be
responsible for masking credit card
numbers to protect customer privacy.

 KStream<String,Purchase> purchaseKStream = streamsBuilder.stream("transactions",
Consumed.with(stringSerde, purchaseSerde)) .mapValues(p ->
Purchase.builder(p).maskCreditCard().build());
 You create the source node with a call to the StreamsBuilder.stream method using a
default String serde, a custom serde for Purchase objects, and the name of the topic that’s
the source of the messages for the stream
 The next immediate call is to the KStream.mapValues method, taking a ValueMapper< V,
V1> instance as a parameter. Value mappers take a single parameter of one type (a
Purchase object, in this case) and map that object to a to a new value, possibly of another
type. In this example, KStream.mapValues returns an object of the same type (Purchase),
but with a masked credit card number.
 Note that when using the KStream.mapValues method, the original key is unchanged and
isn’t factored into mapping a new value. If you wanted to generate a new key/value pair
or include the key in producing a new value, you’d use the KStream.map method that
takes a KeyValueMapper<K, V, KeyValue<K1, V1>> instance.

 BUILDING THE SECOND
PROCESSOR
 Now you’ll build the second processor,
responsible for extracting pattern data
from a topic, which ZMart can use to
determine purchase patterns in regions of
the country. You’ll also add a sink node
responsible for writing the pattern data to
a Kafka topic.

 This new KStream will start to receive Purchase- Pattern objects created as a result of the
mapValues call.
 KStream<String, PurchasePattern> patternKStream =
purchaseKStream.mapValues(purchase -> PurchasePattern.builder(purchase).build());
patternKStream.to("patterns", Produced.with(stringSerde,purchasePatternSerde));
 Here, you declare a variable to hold the reference of the new KStream instance, because
you’ll use it to print the results of the stream to the console with a print call. This is very
useful during development and for debugging. The purchase-patterns processor forwards
the records it receives to a child node of its own, defined by the method call KStream.to,
writing to the patterns topic.

 BUILDING THE THIRD
PROCESSOR
 The third processor in the topology is
the customer rewards accumulator
node, which will let ZMart track
purchases made by members of their
preferred customer club. The
rewards accumulator sends data to a
topic consumed by applications at
ZMart HQ to determine rewards
when customers complete purchases.

 KStream<String, RewardAccumulator> rewardsKStream =
purchaseKStream.mapValues(purchase ->
RewardAccumulator.builder(purchase).build()); rewardsKStream.to("rewards",
Produced.with(stringSerde,rewardAccumulatorSerde));
 You build the rewards accumulator processor using what should be by now a
familiar pattern: creating a new KStream instance that maps the raw purchase data
contained in the record to a new object type. You also attach a sink node to the
rewards accumulator so the results of the rewards KStream can be written to a
topic and used for determining customer reward levels.

 BUILDING THE LAST PROCESSOR
 Finally, you’ll take the first KStream you
created, purchaseKStream, and attach a
sink node to write out the raw purchase
records (with credit cards masked, of
course) to a topic called purchases. The
purchases topic will be used to feed into a
NoSQL store such as Cassandra
(http://cassandra.apache.org/), Presto
(https://prestodb.io/), or Elastic Search
(www.elastic.co/webinars/getting-started-
elasticsearch) to perform ad hoc analysis.
Figure 3.9 shows the final processor.

 Specifically, you still performed the following steps:
 Create a StreamsConfig instance.
 Build one or more Serde instances.
 Construct the processing topology.
 Assemble all the components and start the Kafka Streams program.
 In this application, I’ve mentioned using a Serde, but I haven’t explained why or how
you create them. Let’s take some time now to discuss the role of the Serde in a Kafka
Streams application.

UNIT-2
Session-8
SLO 2 – Streams and State – Applying
stateful operations

Streams and State
 The preceding fictional scenario illustrates something that most of us already know
instinctively. Sometimes it’s easy to reason about what’s going on, but usually you
need some context to make good decisions. When it comes to stream processing, we
call that added context state.
 At first glance, the notions of state and stream processing may seem to be at odds with
each other. Stream processing implies a constant flow of discrete events that don’t
have much to do with each other and need to be dealt with as they occur. The notion
of state might evoke images of a static resource, such as a database table.

Streams and State
 In actuality, you can view these as one and the same. But the rate of change in a
stream is potentially much faster and more frequent than in a database table.1 You
don’t always need state to work with streaming data. In some cases, you may have
discrete events or records that carry enough information to be valuable on their own.
But more often than not, the incoming stream of data will need enrichment from some
sort of store, either using information from events that arrived before, or joining
related events with events from different streams.

Applying Stateful Operation
 In this topology, you produced a stream of
purchase-transaction events. One of the
processing nodes in the topology calculated
reward points for customers based on the
 amount of the sale. But in that processor, you
just calculated the total number of points for
the single transaction and forwarded the
results.
 If you added some state to the processor, you
could keep track of the cumulative number of
reward points. Then, the consuming
application at ZMart would need to check the
total and send out a reward if needed.

 Now that you have a basic idea of how state can be useful in Kafka Streams (or any
other streaming application), let’s look at some concrete examples.
 You’ll start with transforming the stateless rewards processor into a stateful processor
using transformValues.
 You’ll keep track of the total bonus points achieved so far and the amount of time
between purchases, to provide more information to downstream consumers.

 The transformValues processor
 The most basic of the stateful functions is
KStream.transformValues. Figure 4.4
illustrates how the
KStream.transformValues() method
operates. This method is semantically the
same as KStream.mapValues(), with a few
exceptions. One difference is that
transformValues has access to a
StateStore instance to accomplish its task.
The other difference is its ability to
schedule operations to occur at regular
intervals via a punctuate() method.

 Stateful customer rewards
 The rewards processor from the chapter 3 topology for ZMart extracts information for
customers belonging to ZMart’s rewards program. Initially, the rewards processor used
the KStream.mapValues() method to map the incoming Purchase object into a
RewardAccumulator object. The RewardAccumulator object originally consisted of just
two fields, the customer ID and the purchase total for the transaction. Now, the
requirements have changed some, and points are being associated with the ZMart
rewards program:

 Initializing the value transformer
 The first step is to set up or create any instance variables in the transformer init()
method. In the init() method, you retrieve the state store created when building the
processing topology

 Mapping the Purchase object to a RewardAccumulator using state
 Now that you’ve initialized the processor, you can move on to transforming a Purchase
object using state. A few simple steps for performing the transformation are as follows:
 1 Check for points accumulated so far by customer ID.
 2 Sum the points for the current transaction and present the total.
 3 Set the reward points on the RewardAccumulator to the new total amount.
 4 Save the new total points by customer ID in the local state store.

UNIT-2
Session-9
SLO 1 – Example Application
Development with Kafka Streams

Example Application
Development
 Word Count
 Let’s walk through an abbreviated word count example for Kafka Streams. You can
find the full example on GitHub.
 The first thing you do when creating a stream-processing app is configure Kafka
Streams. Kafka Streams has a large number of possible configurations, which we
won’t discuss here, but you can find them in the documentation. In addition, you can
also configure the producer and consumer embedded in Kafka Streams by adding any
producer or consumer config to the Properties object:

Example Application
Development
public class WordCountExample {
public static void main(String[] args) throws Exception{
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG,
Serdes.String().getClass().getName());
props.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG,

Example Application
Development
 Every Kafka Streams application must have an application ID. This is used to
coordinate the instances of the application and also when naming the internal local
stores and the topics related to them. This name must be unique for each Kafka
Streams application working with the same Kafka cluster.
 The Kafka Streams application always reads data from Kafka topics and writes its
output to Kafka topics. As we’ll discuss later, Kafka Streams applications also use
Kafka for coordination. So we had better tell our app where to find Kafka.
 When reading and writing data, our app will need to serialize and deserialize, so we
provide default Serde classes. If needed, we can override these defaults later when
building the streams topology.

Example Application
Development
Now that we have the configuration, let’s build our streams topology:
KStreamBuilder builder = new KStreamBuilder();
KStream<String, String> source =builder.stream("wordcount-input");
final Pattern pattern = Pattern.compile("W+");
KStream counts = source.flatMapValues(value-
>Arrays.asList(pattern.split(value.toLowerCase())))
.map((key, value) -> new KeyValue<Object,Object>(value, value))
.filter((key, value) -> (!value.equals("the"))).groupByKey()
.count("CountStore").mapValues(value->Long.toString(value)).toStream();
counts.to("wordcount-output");

Example Application
Development
 We create a KStreamBuilder object and start defining a stream by pointing at the topic
we’ll use as our input.
 Each event we read from the source topic is a line of words; we split it up using a
regular expression into a series of individual words. Then we take each word
(currently a value of the event record) and put it in the event record key so it can be
used in a group-by operation.
 We filter out the word “the,” just to show how easy filtering is.
 And we group by key, so we now have a collection of events for each unique word.

Example Application
Development
 We count how many events we have in each collection. The result of counting is a
Long data type. We convert it to a String so it will be easier for humans to read the
results.
 Only one thing left–write the results back to Kafka.
 Now that we have defined the flow of transformations that our application will run,
we just need to… run it:
KafkaStreams streams = new KafkaStreams(builder, props);
streams.start();
Thread.sleep(5000L);
streams.close();
}
} SRM Institute of Science and Technology, Ramapuram 7

Example Application
Development
 Define a KafkaStreams object based on our topology and the properties we defined.
 Start Kafka Streams.
 After a while, stop it.
 Thats it! In just a few short lines, we demonstrated how easy it is to implement a
single event processing pattern (we applied a map and a filter on the events). We
repartitioned the data by adding a group-by operator and then maintained simple local
state when we counted the number of records that have each word as a key. Then we
maintained simple local state when we counted the number of times each word
appeared.
 At this point, we recommend running the full example. The README in the GitHub
repository contains instructions on how to run the example.

Example Application
Development
 One thing you’ll notice is that you can run the entire example on your machine
without installing anything except Apache Kafka. This is similar to the experience you
may have seen when using Spark in something like Local Mode. The main difference
is that if your input topic contains multiple partitions, you can run multiple instances
of the WordCount application (just run the app in several different terminal tabs) and
you have your first Kafka Streams processing cluster. The instances of the WordCount
application talk to each other and coordinate the work. One of the biggest barriers to
entry with Spark is that local mode is very easy to use, but then to run a production
cluster, you need to install YARN or Mesos and then install Spark on all those
machines, and then learn how to submit your app to the cluster. With the Kafka’s
Streams API, you just start multiple instances of your app—and you have a cluster.
 The exact same app is running on your development machine and in production.

UNIT-2
Session-9
SLO 2 – Demo – Kafka Streams

Demo – Kafka Streams
 Stock Market Statistics
 The next example is more involved—we will read a stream of stock market trading
events that include the stock ticker, ask price, and ask size. In stock market trades, ask
price is what a seller is asking for whereas bid price is what the buyer is suggesting to
pay. Ask size is the number of shares the seller is willing to sell at that price. For
simplicity of the example, we’ll ignore bids completely. We also won’t include a
timestamp in our data; instead, we’ll rely on event time populated by our Kafka
producer.
 We will then create output streams that contains a few windowed statistics:
 • Best (i.e., minimum) ask price for every five-second window
 • Number of trades for every five-second window
 • Average ask price for every five-second window
 All statistics will be updated every second.

 For simplicity, we’ll assume our exchange only has 10 stock tickers trading in it. The
setup and configuration are very similar to those we used in the “Word Count” on
page 265:
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "stockstat");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, Constants.BROKER);
props.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG,
props.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG,
TradeSerde.class.getName());

 The main difference is the Serde classes used. In the “Word Count”, we used strings
for both key and value and therefore used the Serdes.String() class as a serializer and
deserializer for both. In this example, the key is still a string, but the value is a Trade
object that contains the ticker symbol, ask price, and ask size.
 In order to serialize and deserialize this object (and a few other objects we used in this
small app), we used the Gson library from Google to generate a JSon serializer and
deserializer from our Java object. Then created a small wrapper that created a Serde
object from those. Here is how we created the Serde:
static public final class TradeSerde extends WrapperSerde<Trade> {
public TradeSerde() {
super(new JsonSerializer<Trade>(),
new JsonDeserializer<Trade>(Trade.class));
}} SRM Institute of Science and Technology, Ramapuram 4

 Nothing fancy, but you need to remember to provide a Serde object for every object
you want to store in Kafka—input, output, and in some cases, also intermediate
results. To make this easier, we recommend generating these Serdes through projects
like GSon, Avro, Protobufs, or similar.
Now that we have everything configured, it’s time to build our topology:
KStream<TickerWindow, TradeStats> stats = source.groupByKey()
.aggregate(TradeStats::new, (k, v, tradestats) -> tradestats.add(v),
TimeWindows.of(5000).advanceBy(1000), new TradeStatsSerde(),
"trade-stats-store") .toStream((key, value) -> new TickerWindow(key.key(),
key.window().start())).mapValues((trade) -> trade.computeAvgPrice());
stats.to(new TickerWindowSerde(), new TradeStatsSerde(), "stockstats-output");

 We start by reading events from the input topic and performing a groupByKey()
operation. Despite its name, this operation does not do any grouping. Rather, it
ensures that the stream of events is partitioned based on the record key. Since wewrote
the data into a topic with a key and didn’t modify the key before calling
groupByKey(), the data is still partitioned by its key—so this method does nothing in
this case.
 After we ensure correct partitioning, we start the windowed aggregation. The
“aggregate” method will split the stream into overlapping windows (a five-second
window every second), and then apply an aggregate method on all the events in the
window. The first parameter this method takes is a new object that will contain the
results of the aggregation—Tradestats in our case. This is an object we created to
contain all the statistics we are interested in for each time window— minimum price,
average price, and number of trades.

SA UNIT II KAFKA.pdf

Recommended

Recommended

More Related Content

Similar to SA UNIT II KAFKA.pdf

Similar to SA UNIT II KAFKA.pdf (20)

More from ManjuAppukuttan2

More from ManjuAppukuttan2 (9)

Recently uploaded

Recently uploaded (20)

SA UNIT II KAFKA.pdf