Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent

Temporal Joins in
Kafka Streams and ksqlDB
Matthias J. Sax | Software Engineer
@MatthiasJSax

Ecosystem
2
@MatthiasJSax
ksqlDB: streaming database for Apache Kafka
• SQL interface to process data stored in Apache Kafka
• Declarative approach to stream processing
• Queries instead of “programming”
Kafka Streams: Java library for stream processing
• Part of Apache Kafka
• ”Functional” DSL but still programming
Both ksqlDB and Kafka Streams support joins.
Joins are powerful but streaming joins can be difficult to understand.

Joins: The Basics
3
@MatthiasJSax
https://www.confluent.io/kafka-summit-ny19/zen-and-the-art-of-streaming-joins/

Temporal Joins – Why should I give a Damn?
4
Static Data vs Streaming Data
• Data is constantly in motion
• Input tables are not static but updated all the time
• The result must be updated continuously and with deterministic semantics
Relational Joins are Defined over (static) Tables only:
• What about joining streams?
• What about joining a stream and a table?
Temporal Joins define deterministic (event-time) semantics
over continuously changing inputs.
@MatthiasJSax

Event-time vs Processing-time
5
Database Transactions are not predicable!
Database Txs offer ACID guarantees, that are defined over processing time:
• If you run a set of concurrent (read/write) transactions over a database multiple times, there is no guarantee that you get
the same result!
• You ”only” get a guarantee that each ”run” produces a consistent result
@MatthiasJSax

Example: Tx Processing
6
Tx1 w
Tx3 r (join)
Tx2 w
?
@MatthiasJSax

Streams, Records, Timestamps
7
Topic can be processed as:
• Event Stream (STREAM in ksqlDB / KStream in Kafka Streams)
• Changelog Stream (TABLE in ksqlDB / KTable in Kafka Streams)
• ”Tx Order” is determined upstream
Topic contains:
• Timestamped records
• Timestamps define “Tx Order”
• Need to obey pre-defined “Tx Order” when processing the data streams (ie, event-time semantics)
• Timestamps are data!
• Temporal joins are defined on event-time: provides deterministic processing semantics
@MatthiasJSax

* GlobalKTables in Kafka Streams are one exception (ie, non-deterministic stream-globalTable-join)
All* joins in Kafka Streams and ksqlDB
are temporal joins!
@MatthiasJSax

Versioned Tables
9
@MatthiasJSax
Tables evolve over time:
We can associate a different table version for each point in stream-time
Changelog Stream:
Table Versions:
14:01
a 14:03
b 14:05
c 14:08
b 14:11
a
14:01
a
14:03
b
14:05
c
14:05
14:01
a
14:08
b
14:05
c
14:08
14:11
a
14:08
b
14:05
c
14:11
14:01
a
14:03
b
14:03
14:01
a
14:01
stream-time

Temporal Table-Table Join
10
@MatthiasJSax
Join tables with the same version (ie, event-time)
Left Table
Right Table
Result Table
stream-time
14:01 14:03 14:05 14:08 14:11
14:02 14:04 14:06 14:07 14:09 14:10

Example: Table-Table Join
11
@MatthiasJSax

Data Enrichment: Stream-Table Join
12
@MatthiasJSax
Enrich events with table data: ”lookup join”
For each event-stream record, do a table lookup:
• Temporal table lookup: join a stream record with event-time T to table version T
Changelog Stream:
Input Table:
Input Stream:
Result Stream:
14:06
…
14:05
… 14:10
…
14:02
…
14:06
… 14:10
…
14:05
…
14:04
… 14:07
…

Example: Stream-Table Join
13
@MatthiasJSax

There is no concept of “bootstrapping” a table:
• Table versions will be evolved based on processing progress, ie,
stream-time.
• This ensure that the correct table version is loaded at each point
in stream-time.
@MatthiasJSax

Joining Event Streams – How to Handle Infinite Input
15
@MatthiasJSax
Event Streams are infinite and there is no concept of “versions”
Limit join “scope” with a temporal join condition, ie, a time-band-join.
-- mental model
SELECT * FROM stream1, stream2
WHERE
-- equi-join condition
stream1.key = stream2.key
AND
-- time condition
stream1.ts - windowSize <= stream2.ts
AND stream2.ts <= stream1.ts + windowSize

16
@MatthiasJSax
Example: join window size 5
Left Stream
Right Stream
Result Stream
14:04
1 14:16
3
14:01
1 14:16
3
SELECT *
FROM leftStream AS l JOIN rightStream AS r
WITHIN 5 minutes ON l.id = r.id;
14:04
1 14:11
2 14:12
3

Left/Outer Stream-Stream Join
17
@MatthiasJSax
Example: spurious left join result with window size 5
Left Stream
Right Stream
Result Stream
14:04
1 14:16
3
14:01
1 14:16
3
14:04
1 14:11
2 14:12
3
14:11
2 14:12
3

18
@MatthiasJSax
Example: delayed left join result with window size 5 (WIP)
Left Stream
Right Stream
Result Stream
14:04
1 14:16
3
14:01
1 14:16
3
14:04
1 14:11
2 14:12
3
14:11
2

Timestamping Result Records
19
@MatthiasJSax
Result determinism requires deterministic result record event-timestamps
Out-of-Order data processing need to be considered
Example: Stream-Stream join with window size 5
14:04
1 14:16
2 14:08
2
14:01
1 14:11
2 14:23
2
14:04
1 14:16
2 14:11
2
max(l.ts; r.ts)

The Outlier: GlobalKTables
20
@MatthiasJSax
GlobalKTables have no concept of stream-time
Designed for “static” (but still mutable) data
• In contrast to regular tables, a GlobalKTable is bootstrapped at startup
• GlobalKTable updates are applied unsynchronized
• Stream-GlobalKTable join is non-deterministic on GlobalKTable updates
Global Changelog:
Global Table:
Input Stream:
14:05
…
14:02
… 14:10
…
14:04
… 14:07
… 14:09
…

Broadcast vs Replication and Temporal Semantics
21
@MatthiasJSax
TABLE
KTable
n/a
GlobalKTable
TABLE*
KTable*
n/a
(*) with custom timestamp extractor
than ensures “preferred
processing”, e.g., always returns
timestamp zero

Wrapping Up
22
Temporal Join are a Key Concept in Data Stream Processing
• Generalization of SQL joins (for snapshots) to continuously changing data
• Ensure deterministic / reproducible results
• Types of Temporal Joins:
• Joining evolving tables
• Joining streams to evolving tables
• Stream-Stream join
• Outlier: GlobalKTables
• Sharding vs replication & time synchronized vs unsynchronized/non-determistic

Thanks! We are hiring!
@MatthiasJSax
matthias@confluent.io | mjsax@apache.org

Joins: The Basics
24
@MatthiasJSax

Joins: The Basics
25
Join Types
INNER LEFT (OUTER) RIGHT (OUTER) (FULL) OUTER
R left-join S <=> S right-join R
Join Conditions
Most distributed systems only support equi-joins (ie, left.attribute = right.otherAttribute) because they can
be computed efficiently.
@MatthiasJSax

26
@MatthiasJSax
Example: join window size 5
Left Stream
Right Stream
Result Stream
14:04
1 14:16
2
14:01
1 14:12
3 14:16
2
14:04
1 14:11
2 14:23
3
SELECT *
FROM leftStream AS l JOIN rightStream AS r
WITHIN 5 minutes ON l.id = r.id;

27
@MatthiasJSax
Example: spurious left join result with window size 5
Left Stream
Right Stream
Result Stream
14:04
1 14:16
2
14:01
1 14:12
3 14:16
2
14:04
1 14:11
2 14:23
3
14:11
2 14:23
3

28
@MatthiasJSax
Example: delayed left join result with window size 5 (WIP)
Left Stream
Right Stream
Result Stream
14:04
1 14:16
2
14:01
1 14:16
2
14:10
3
14:04
1 14:11
2
14:10
3

Bending Time: Timestamp Extractor
29
@MatthiasJSax

Fighting Time Jitter (max.task.idle.ms)
30
@MatthiasJSax

Future Work: outer-s-s-join / versioned tables / s-t join / time
synchronization / custom topic prioritization
31
@MatthiasJSax

Typography – Headings
32
H1 / Mark Pro Bold
H2 / Mark Pro Medium 16pt
24pt
FF Mark® is the new primary
typeface for Confluent. When
creating a new text box, the font
defaults to Calibri in
PowerPoint. Please make sure
to [download and] select one of
the outlined fonts listed here.
Download FF Mark Pro:
cnfl.io/ffmark-font

Typography – Body
33
Body / Mark Pro Book
Body / Mark Pro Light 12pt
Source Code / Source Code Pro 12pt
12pt
FF Mark® is the new primary
typeface for Confluent. When
creating a new text box, the font
defaults to Calibri in
PowerPoint. Please make sure
to [download and] select one of
the outlined fonts listed here.
Download FF Mark Pro:
cnfl.io/ffmark-font

Single Line Headline Mark Pro Bold 24 pt
34

Single column text slide
Subhead Mark Pro Bold 16 pt
Body copy would go here.
This would just be to show how longer copy would look. Mark Pro Book 12 pt.
• Bullet Level 1
• Bullet Level 2
35

Single column text slide – Denim
36
• Bullet Level 1
• Bullet Level 2

Single column text slide – Robins Egg
37
• Bullet Level 1
• Bullet Level 2

Single column text slide – Powder
38
• Bullet Level 1
• Bullet Level 2

39
Headline
Mark Pro Bold 24 pt
Body copy would go here. This would just be to
show how longer copy would look. Mark Pro
Light 14 pt.
Light 14 pt.

40
Headline
Mark Pro Bold 24 pt
Light 14 pt.
Light 14 pt.

41
Headline
Mark Pro Bold 24 pt
Light 14 pt.
Light 14 pt.

Breaker Page – Denim
Sample line 2

Breaker Page – Robins Egg
Sample line 2

Breaker Page – Powder
Sample line 2

Full Name, Title, Company
“Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Etsi ea quidem, quae adhuc
dixisti, quamvis ad aetatem recte isto modo
dicerentur.”

“Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Etsi ea quidem, quae adhuc
dixisti, quamvis ad aetatem recte isto modo
dicerentur.”
Full Name, Title, Company

Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent

Similar to Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent