Join Semantics in Kafka
Streams
Himani Arora
Software Consultant
Knoldus Inc.
Agenda
● Introduction to Apache Kafka
● Introduction to Streams API
● How to use Streams API
● Join Operations supported in Kafka Streams
● Different types of Joins
Apache Kafka
Introduction
● Apache Kafka is a distributed streaming platform where producers send
messages—key-value pairs—to topics which in turn are polled and read by
consumers. Each topic is partitioned, and the partitions are distributed among
brokers.
● It has four core APIs:
○ Producer API
○ Consumer API
○ Streams API
○ Connector API
Streams API
● Kafka Streams is a client library for processing and analyzing data stored in Kafka.
● There are two main abstractions in the Streams API:
○ A KStream is a stream of key-value pairs. KStreams are stateless, but they allow
for aggregation by turning them into the other core abstraction.
○ A KTable, which is often described as a “changelog stream.” A KTable holds the
latest value for a given message key and reacts automatically to newly incoming
messages.
How to install the Streams API?
● There is no installation needed - Build Apps, Not Clusters!
● It is a library and can be added to your app like any other library.
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>1.1.0</version>
</dependency>
Joins
Kafka Streams supports 3 type of joins:
● Inner Joins
○ Gives an output when both input sources have records with same key.
● Left Joins
○ Gives an output for each record in the left or primary input source. If the other source does not
have a value for a given key, it is set to null.
● Outer Joins
○ Gives an output for each record in either input source. If only one source contains a key, the
other is null.
Type 1 Type 2 Inner Join Left Join Outer Join
KStream KStream ✔ ✔ ✔
KStream KTable ✔ ✔ ✖
KStream Global KTable ✔ ✔ ✖
KTable KTable ✔ ✔ ✔
KStream-KStream Join
● This is a sliding window join, meaning that all tuples close to each other with regard to time are
joined. Time here is the difference up to the size of the window.
● These joins are always windowed joins because otherwise, the size of the internal state store used
to perform the join would grow indefinitely.
● Since KStream-KStream Join is always windowed joins, we must provide a join window.
KStream<String, String> joined = left.join(right,
(leftValue, rightValue) -> "left=" + leftValue + ", right=" +
rightValue, /* ValueJoiner */
JoinWindows.of(TimeUnit.MINUTES.toMillis(5)),
Serdes.String(), /* key */
Serdes.Long(), /* left value */
Serdes.Double() /* right value */
);
KTable-KTable Join
● KTable-KTable joins are designed to be consistent with their counterparts in relational databases.
They are always non-windowed joins.
● The changelog streams of KTables is materialized into local state stores that represent the latest
snapshot of their tables. The join result is a new KTable representing changelog stream of the join
operation.
KTable<String, String> joined = left.join(right,
(leftValue, rightValue) -> "left=" + leftValue + ", right=" +
rightValue /* ValueJoiner */
);
KStream-KTable Join
● KStream-KTable joins are asymmetric non-window joins. They allow you to perform table lookups
against a KTable everytime a new record is received from the KStream.
● In contrast to stream-stream and table-table join which are both symmetric, a stream-table join is
asymmetric.
KStream<String, String> joined = left.join(right,
(leftValue, rightValue) -> "left=" + leftValue + ", right=" +
rightValue, /* ValueJoiner */
Serdes.String(), /* key */
Serdes.Long() /* left value */
);
KStream-GlobalKTable Join
● KStream-GlobalKTable joins are always non-windowed joins.
● It differs from KStream-Global KTable joins in the following manner:
○ They allow for efficient star joins, joining large scale facts stream with dimension tables.
○ They allow for joining against foreign keys
○ They are often more efficient than their partitioned KTable counterpart.
KStream<String, String> joined = left.join(right, (leftKey,
leftValue) -> leftKey.length(), /* derive a new key by which to
lookup agianst the table */
(leftValue, rightValue) -> "left=" + leftValue + ", right=" +
rightValue ); /* ValueJoiner */
References
● https://kafka.apache.org/documentation/streams
● https://docs.confluent.io/current/streams/developer-guide/dsl-api.html
● https://www.confluent.io/blog/crossing-streams-joins-apache-kafka
Q&A
Please email your queries to himani.arora@knoldus.in

Join semantics in kafka streams

  • 1.
    Join Semantics inKafka Streams Himani Arora Software Consultant Knoldus Inc.
  • 2.
    Agenda ● Introduction toApache Kafka ● Introduction to Streams API ● How to use Streams API ● Join Operations supported in Kafka Streams ● Different types of Joins
  • 4.
  • 5.
    Introduction ● Apache Kafkais a distributed streaming platform where producers send messages—key-value pairs—to topics which in turn are polled and read by consumers. Each topic is partitioned, and the partitions are distributed among brokers. ● It has four core APIs: ○ Producer API ○ Consumer API ○ Streams API ○ Connector API
  • 7.
    Streams API ● KafkaStreams is a client library for processing and analyzing data stored in Kafka. ● There are two main abstractions in the Streams API: ○ A KStream is a stream of key-value pairs. KStreams are stateless, but they allow for aggregation by turning them into the other core abstraction. ○ A KTable, which is often described as a “changelog stream.” A KTable holds the latest value for a given message key and reacts automatically to newly incoming messages.
  • 9.
    How to installthe Streams API? ● There is no installation needed - Build Apps, Not Clusters! ● It is a library and can be added to your app like any other library. <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>1.1.0</version> </dependency>
  • 10.
    Joins Kafka Streams supports3 type of joins: ● Inner Joins ○ Gives an output when both input sources have records with same key. ● Left Joins ○ Gives an output for each record in the left or primary input source. If the other source does not have a value for a given key, it is set to null. ● Outer Joins ○ Gives an output for each record in either input source. If only one source contains a key, the other is null.
  • 11.
    Type 1 Type2 Inner Join Left Join Outer Join KStream KStream ✔ ✔ ✔ KStream KTable ✔ ✔ ✖ KStream Global KTable ✔ ✔ ✖ KTable KTable ✔ ✔ ✔
  • 12.
    KStream-KStream Join ● Thisis a sliding window join, meaning that all tuples close to each other with regard to time are joined. Time here is the difference up to the size of the window. ● These joins are always windowed joins because otherwise, the size of the internal state store used to perform the join would grow indefinitely. ● Since KStream-KStream Join is always windowed joins, we must provide a join window. KStream<String, String> joined = left.join(right, (leftValue, rightValue) -> "left=" + leftValue + ", right=" + rightValue, /* ValueJoiner */ JoinWindows.of(TimeUnit.MINUTES.toMillis(5)), Serdes.String(), /* key */ Serdes.Long(), /* left value */ Serdes.Double() /* right value */ );
  • 13.
    KTable-KTable Join ● KTable-KTablejoins are designed to be consistent with their counterparts in relational databases. They are always non-windowed joins. ● The changelog streams of KTables is materialized into local state stores that represent the latest snapshot of their tables. The join result is a new KTable representing changelog stream of the join operation. KTable<String, String> joined = left.join(right, (leftValue, rightValue) -> "left=" + leftValue + ", right=" + rightValue /* ValueJoiner */ );
  • 14.
    KStream-KTable Join ● KStream-KTablejoins are asymmetric non-window joins. They allow you to perform table lookups against a KTable everytime a new record is received from the KStream. ● In contrast to stream-stream and table-table join which are both symmetric, a stream-table join is asymmetric. KStream<String, String> joined = left.join(right, (leftValue, rightValue) -> "left=" + leftValue + ", right=" + rightValue, /* ValueJoiner */ Serdes.String(), /* key */ Serdes.Long() /* left value */ );
  • 15.
    KStream-GlobalKTable Join ● KStream-GlobalKTablejoins are always non-windowed joins. ● It differs from KStream-Global KTable joins in the following manner: ○ They allow for efficient star joins, joining large scale facts stream with dimension tables. ○ They allow for joining against foreign keys ○ They are often more efficient than their partitioned KTable counterpart. KStream<String, String> joined = left.join(right, (leftKey, leftValue) -> leftKey.length(), /* derive a new key by which to lookup agianst the table */ (leftValue, rightValue) -> "left=" + leftValue + ", right=" + rightValue ); /* ValueJoiner */
  • 16.
  • 17.
    Q&A Please email yourqueries to himani.arora@knoldus.in

Editor's Notes

  • #8 The records in a KStream either come directly from a topic or have gone through some kind of transformation—for example there is a filter method that takes a predicate and returns another KStream that only contains those elements that satisfy the predicate.