How does Kafka Streams and ksqlDB reason about time, how does it affect my application, and how do I take advantage of it? In this talk, we explore the "time engine" of Kafka Streams and ksqlDB and answer important questions how you can work with time. What is the difference between sliding, time, and session windows and how do they relate to time? What timestamps are computed for result records? What temporal semantics are offered in joins? And why does the suppress() operator not emit data? Besides answering those questions, we will share tips and tricks how you can "bend" time to your needs and when mixing event-time and processing-time semantics makes sense. Six month ago, the question "What's the time? …and Why?" was asked and partly answered at Kafka Summit in San Francisco, focusing on writing data, data storage and retention, as well as consuming data. In this talk, we continue our journey and delve into data stream processing with Kafka Streams and ksqlDB, that both offer rich time semantics. At the end of the talk, you will be well prepared to process past, present, and future data with Kafka Streams and ksqlDB.
4. Recap: Time 101
4@MatthiasJSax
Event Time
• When an event happened (embedded in the message/record)
• Ensures deterministic processing
• Used to express processing semantics, i.e., impacts the result
Processing Time (aka Wall-clock Time)
• When an event/message/record is processed
• Used for non-functional properties
• Timeouts
• Data rate control
• Periodic actions
• Should not impact the result: otherwise, non-deterministic
6. Tracking Time
Stream-time: the maximum observed input event timestamp (aka ROWTIME)
• Monotonically increasing
• Allows to identify out-of-order and late input
• Tracked per task / used instead of watermarks
6@MatthiasJSax
14:01… 14:03… 14:08…14:01… 14:02… 14:11…
stream-time
14:03 14:08 14:1114:01
advances
7. Yeah, well, history is gonna change
Input records with descending event timestamp are considered out-of-order
• Out-of-order if event-time < stream-time
7@MatthiasJSax
14:01… 14:03… 14:08…14:01… 14:02… 14:11…
stream-time
14:03 14:1114:0814:01
advances
out-of-order out-of-order
9. You are not thinking fourth-dimensionally
9@MatthiasJSax
14:11…Topic-A, Partition 0
Topic-B, Partition 0 empty
Pause processing and poll() for new data.
Unblock when timeout max.task.idle.ms hits.
… 14:01
14:02… 14:04… 14:03…
14:05…
14:08…
11. Tumbling Windows
• fixed size / non-overlapping / grouped (i.e, GROUP BY)
Time Windows
11@MatthiasJSax
14:00 14:05 14:1514:10
No variable size window support yet:
• Weeks, Month, Years
• No out-of-the-box time zone support
• https://github.com/confluentinc/kafka-streams-examples/blob/5.5.0-post/src/test/java/io/confluent/examples/streams/window/DailyTimeWindows.java
12. Time Windows
12@MatthiasJSax
Hopping Windows
• fixed size / overlapping / grouped (i.e., GROUP BY)
• Different to a sliding window!
14:00 14:05 14:1514:10
14:01 14:06 14:1614:11
14:02 14:07 14:1714:12
14:03 14:08 14:1814:13
14:04 14:09 14:1914:14
13. Different use-case: aggregate the data of the last (e.g.) 10 minutes
• Window boundaries are data dependent and unknown upfront (cf. KIP-450)
Sliding Windows
13@MatthiasJSax
14:03… 14:07… 14:12… 14:19… 14:26…
13:53 | 14:03
13:57 14:07
14:02 14:12
14:04 14:14
14:08 14:18
14:09 14:19
14:13 14:23
14:16 14:26
14:20 14:30
14. When we are processing, we don’t need watermarks
Grace period: defines a cut-off for out-of-order records that are (too) late
• Grace period is defined per operator
• Late if stream-time - event-time > grace period
• Late data is ignored and not processed by the operator
14@MatthiasJSax
14:01… 14:03… 14:08…14:01… 14:02… 14:11…
stream-time
14:03 14:1114:0814:01
advances
grace := 5min
-> late (delay: 6min)
15. Retention Time
How long to store data in a (windowed) table.
TimeWindows.of(Duration.ofMinutes(5L)).grace(Duration.ofMinutes(1L))
Materialized.as(…).withRetention(Duration.ofHours(1L))
WINDOW TUMBLING(SIZE 5 MINUTES, GRACE PERIOD 1 MINUTE, RETENTION TIME 1 HOUR)
15@MatthiasJSax
stream-time
SIZE
5 MINUTES
GRACE PERIOD
1 MINUTE
windowStart
@14:00
windowEnd
@14:05
window close
@14:06
14:05 15:05
retention
(1 hour)
16. If my calculations are correct…
16@MatthiasJSax
Table is continuously updated, but when to emit data to the result stream?
• Non-deterministic via caching (default)
• Output data rate reduction (non-functional)
• Deterministic rate control via suppress() | EMIT FINAL
• Periodic or final (for window operations)
• Stream-time based!
14:32…
14:01Marty
14:26Doc
14:05Einstein
14:23Biff
14:15Elaine
14:23George
?
stream-time: 14:26
14:25…
18. Stream-Stream Join
18@MatthiasJSax
Streams are conceptually unbounded
• Limited join scope via a sliding time window
leftStream.join(rightStream, JoinWindows.of(Duration.ofMinutes(5L)));
SELECT * FROM leftStream AS l JOIN rightStream AS r WITHIN 5 MINUTES ON l.id = r.id;
14:041 14:162 14:083
14:01A 14:11B 14:23C
14:041⨝A 14:162⨝B 14:113⨝B
max(l.ts; r.ts)
28. Time Traveling is just too Dangerous
28@MatthiasJSax
Is it? Well, mind compaction!
14:05c 14:08b 14:11a
14:02… 14:04… 14:07…14:06… 14:10…
14:05c
14:05
14:08b
14:05c
14:08
14:11a
14:08b
14:05c
14:11
14:06 14:07 14:1014:0414:02
14:01a 14:03b
29. You Need to Know your History
29@MatthiasJSax
Table Changelog
Stream
append
new data
(tail)
truncation
retention time
compaction lag
(preserves full history)
compacted head
(old data)
30. You Need to Know your History
30@MatthiasJSax
Table Changelog
Stream
truncation
retention time
Lost History
fully compacted append
new data
(tail)
31. You are the doc, Doc
31@MatthiasJSax
Wrapping up
• Event time vs processing time
• Stream-time, grace period, and retention time
(no watermarks)
• Tumbling/hopping vs sliding windows
• Join:
• Temporal semantics
• Stream-stream and stream-table
• Tables and time traveling