(Nick Dearden, Confluent) Kafka Summit SF 2018
One of the most powerful capabilities of both KSQL and the Kafka Streams library is that they allow us to easily express multiple “types” of join over continuous streams of data, and then have those joins be executed in distributed fashion by a self-organizing group of machines—but how many of us really understand the intrinsic qualities of a RIGHT OUTER STREAM-STREAM JOIN, SPAN(5 MINUTES, 2 MINUTES)? And what happens when that data can arrive late or out of order? In this talk we will explore the available streaming join options, covering common uses and examples for each, how to decide between them and illustrations of what’s really happening under the covers in the face of real-world data.
7. 7
alice 1
alice 1
charlie 1
alice 2
charlie 1
alice 2
charlie 1
bob 1
TABLE STREAM TABLE
(“alice”, 1)
(“charlie”, 1)
(“alice”, 2)
(“bob”, 1)
alice 1
alice 1
charlie 1
alice 2
charlie 1
alice 2
charlie 1
bob 1
8. 8
Streams & Tables
● STREAM and TABLE as first-class citizens
● Interpretations of topic content
● STREAM - data in motion
● TABLE - collected state of a stream
• One record per key (per window)
• Current values (compacted topic)
• Changelog
● STREAM – TABLE Joins
9. 9
Type INNER LEFT OUTER FULL OUTER
Stream-Stream Windowed
Table-Table Non-windowed
Stream-Table Non-windowed
12. 12
Why ?
● Relate 2 streams of ongoing facts or events
● Ad impressions -> ad clicks
● Orders -> ShipmentsStream-Stream
13. 13
How ?
● Equi-join on the key from each side
● Co-partitioning
● Stream – stream joins are time-windowed
● Use asymmetric windowing to indicate
happens-before or happens-after
● Each input record triggers an output for
every match from the other side
● Input records with NULL key or value are
ignored and don’t trigger an output
Stream-Stream
16. 16
Stream-Stream Join - What
Time
Left
Stream
Right
Stream
INNER JOIN LEFT JOIN OUTER JOIN
1 null
2 A [A, null] [A, null]
3 a [A, a] [A, a] [A, a]
4 B [B, a] [B, a] [B, a]
5 b [A, b], [B, b] [A, b], [B, b] [A, b], [B, b]
6 null
7 C [C, a], [C, b] [C, a], [C, b] [C, a], [C, b]
8 c [A,c],[B,c],[C,c] [A,c],[B,c],[C,c] [A,c],[B,c],[C,c]
Condition: all incoming records have the same key
Condition: all incoming records arrive within the join window
17. 17
Why ?
● Relate 2 evolving sets of state
● Hotel rooms -> Room Rates
Table-Table
18. 18
How ?
● Equi-join on the key from each side
● Co-partitioning
● Table–Table joins are NOT time-windowed
● Each input record triggers an output for at
most one match from the other side
● Input records with NULL key are ignored
and don’t trigger an output
● Input records with NULL value are tombstones
and don’t trigger an output
Table-Table
20. 20
CREATE TABLE priced_rooms AS
SELECT h.hotel_id, r.room_rate FROM hotels h
JOIN rates r ON h.hotel_id = r.hotel_id;
21. 21
Table-Table Join - What
Time
Left
Table
Right
Table
INNER JOIN LEFT JOIN OUTER JOIN
1
null
2 A [A, null] [A, null]
3 a [A, a] [A, a] [A, a]
4 B [B, a] [B, a] [B, a]
5 b [B, b] [B, b] [B, b]
6 null null null [null, b]
7 null null
8 C [C, null] [C, null]
Condition: all incoming records have the same key
9 c [C, c] [C, c] [C, c]
23. 23
How ?
● Equi-join on the key from each side
● Co-partitioning
● Stream–Table joins are NOT time-windowed
● Each input record from stream side triggers
at most one output
● Input stream records with NULL key or value
are ignored and don’t trigger an output
● Input table records with NULL value are
tombstones and don’t trigger an output
Stream-Table
25. 25
CREATE STREAM vip_actions AS
SELECT userid, page, action FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
26. 26
Stream-Table Join - What
Time
Left
Stream
Right
Table
INNER JOIN LEFT JOIN
1 null
2 A [A, null]
3 a
4 B [B, a] [B, a]
5 b
6 null
7 null
8 C [C, null]
Condition: all incoming records have the same key