Joins in Kafka Streams and ksqlDB are a killer-feature for data processing and basic join semantics are well understood. However, in a streaming world records are associated with timestamps that impact the semantics of joins: welcome to the fabulous world of _temporal_ join semantics. For joins, timestamps are as important as the actual data and it is important to understand how they impact the join result.
In this talk we want to deep dive on the different types of joins, with a focus of their temporal aspect. Furthermore, we relate the individual join operators to the overall ""time engine"" of the Kafka Streams query runtime and explain its relationship to operator semantics. To allow developers to apply their knowledge on temporal join semantics, we provide best practices, tip and tricks to ""bend"" time, and configuration advice to get the desired join results. Last, we give an overview of recent, and an outlook to future, development that improves joins even further.
2. Ecosystem
2
@MatthiasJSax
ksqlDB: streaming database for Apache Kafka
• SQL interface to process data stored in Apache Kafka
• Declarative approach to stream processing
• Queries instead of “programming”
Kafka Streams: Java library for stream processing
• Part of Apache Kafka
• ”Functional” DSL but still programming
Both ksqlDB and Kafka Streams support joins.
Joins are powerful but streaming joins can be difficult to understand.
4. Temporal Joins – Why should I give a Damn?
4
Static Data vs Streaming Data
• Data is constantly in motion
• Input tables are not static but updated all the time
• The result must be updated continuously and with deterministic semantics
Relational Joins are Defined over (static) Tables only:
• What about joining streams?
• What about joining a stream and a table?
Temporal Joins define deterministic (event-time) semantics
over continuously changing inputs.
@MatthiasJSax
5. Event-time vs Processing-time
5
Database Transactions are not predictable!
Database Txs offer ACID guarantees, that are defined over processing time:
• If you run a set of concurrent (read/write) transactions over a database multiple times, there is no guarantee
that you get the same result!
• You ”only” get a guarantee that each ”run” produces a consistent result
@MatthiasJSax
7. Streams, Records, Timestamps
7
Topic can be processed as:
• Event Stream (STREAM in ksqlDB / KStream in Kafka Streams)
• Changelog Stream (TABLE in ksqlDB / KTable in Kafka Streams)
• ”Tx Order” is determined upstream
Topic contains:
• Timestamped records
• Timestamps define “Tx Order”
• Need to obey pre-defined “Tx Order” when processing the data streams (ie, event-time semantics)
• Timestamps are data!
• Temporal joins are defined on event-time: provides deterministic processing semantics
@MatthiasJSax
8. * GlobalKTables in Kafka Streams are one exception (ie, non-deterministic stream-globalTable-join)
All* joins in Kafka Streams and ksqlDB
are temporal joins!
@MatthiasJSax
9. Versioned Tables
9
@MatthiasJSax
Tables evolve over time:
We can associate a different table version for each point in stream-time
Changelog Stream:
Table Versions:
14:01
a 14:03
b 14:05
c 14:08
b 14:11
a
14:01
a
14:03
b
14:05
c
14:05
14:01
a
14:08
b
14:05
c
14:08
14:11
a
14:08
b
14:05
c
14:11
14:01
a
14:03
b
14:03
14:01
a
14:01
stream-time
10. Temporal Table-Table Join
10
@MatthiasJSax
Join tables with the same version (ie, event-time)
Left Table
Right Table
Result Table
stream-time
14:01 14:03 14:05 14:08 14:11
14:02 14:04 14:06 14:07 14:09 14:10
12. Data Enrichment: Stream-Table Join
12
@MatthiasJSax
Enrich events with table data: ”lookup join”
For each event-stream record, do a table lookup:
• Temporal table lookup: join a stream record with event-time T to table version T
Changelog Stream:
Input Table:
Input Stream:
Result Stream:
14:06
…
14:05
… 14:10
…
14:02
…
14:06
… 14:10
…
14:05
…
14:04
… 14:07
…
14. There is no concept of “bootstrapping” a table:
• Table versions will be evolved based on processing progress,
ie, stream-time.
• This ensure that the correct table version is loaded at each
point in stream-time.
@MatthiasJSax
15. Joining Event Streams – How to Handle Infinite Input
15
@MatthiasJSax
Event Streams are infinite and there is no concept of “versions”
Limit join “scope” with a temporal join condition, ie, a time-band-join.
-- mental model
SELECT * FROM stream1, stream2
WHERE
-- equi-join condition
stream1.key = stream2.key
AND
-- time condition
stream1.ts - windowSize <= stream2.ts
AND stream2.ts <= stream1.ts + windowSize
16. Joining Event Streams – How to Handle Infinite Input
16
@MatthiasJSax
Example: join window size 5
Left Stream
Right Stream
Result Stream
14:04
1 14:16
3
14:01
1 14:16
3
SELECT *
FROM leftStream AS l JOIN rightStream AS r
WITHIN 5 minutes ON l.id = r.id;
14:04
1 14:11
2 14:12
3
19. Timestamping Result Records
19
@MatthiasJSax
Result determinism requires deterministic result record event-timestamps
Out-of-Order data processing need to be considered
Example: Stream-Stream join with window size 5
14:04
1 14:16
2 14:08
2
14:01
1 14:11
2 14:23
2
14:04
1 14:16
2 14:11
2
max(l.ts; r.ts)
20. The Outlier: GlobalKTables
20
@MatthiasJSax
GlobalKTables have no concept of stream-time
Designed for “static” (but still mutable) data
• In contrast to regular tables, a GlobalKTable is bootstrapped at startup
• GlobalKTable updates are applied unsynchronized
• Stream-GlobalKTable join is non-deterministic on GlobalKTable updates
Global Changelog:
Global Table:
Input Stream:
14:05
…
14:02
… 14:10
…
14:04
… 14:07
… 14:09
…
21. Broadcast vs Replication and Temporal Semantics
21
@MatthiasJSax
time synchronized
unsynchronized
replicated
TABLE
KTable
n/a
GlobalKTable
TABLE*
KTable*
n/a
(*) with custom timestamp
extractor than ensures
“preferred processing”, e.g.,
always returns timestamp zero
sharded
22. Wrapping Up
22
Temporal Join are a Key Concept in Data Stream Processing
• Generalization of SQL joins (for snapshots) to continuously changing data
• Ensure deterministic / reproducible results
• Types of Temporal Joins:
• Joining evolving tables
• Joining streams to evolving tables
• Stream-Stream join
• Outlier: GlobalKTables
• Sharding vs replication & time synchronized vs unsynchronized/non-determistic
23. Thanks! We are hiring!
@MatthiasJSax
matthias@confluent.io | mjsax@apache.org