Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
Streams and Tables: Two Sides of the Same Coin
Matthias J. Sax12, Guozhang Wang1, Matthias Weidlich2, Johann-Christoph F...
2
Count Clicks per Page
BIRTE
VLDB
VLDB
Distributed Data Source
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3)...
3
Count Clicks per Page
BIRTE
VLDB
VLDB
Distributed Data Source
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3)...
4
Ordering: Common Approaches
Input Data Stream
BIRTE(3) VLDB(4)VLDB(7)
BIRTE(3)VLDB(4)VLDB(7)
buffer and re-order
SPS
SPS...
5
Cost
Correctness/
Completeness
Latency
Buffering and Reordering
- Ref: CQL1, Trill2
Punctuations/Watermarks
- Ref: Li et...
6
Problem Statement
How to design a model
• for the evaluation of expressive operators
• with low latency over potentially...
7
High-Level Proposal
• To reduce latency, we need to avoid any processing delays
• Process data in arrival order
• Emit c...
8
Data Model
• Offset: physical order (arrival/processing order)
• Timestamp: logical order (event-time)
• Key: optional f...
9
Stream Processing Operators
• Stateless, order agnostic
• filter, projection, flatMap
• No special handling necessary
• ...
10
Data Stream Aggregation
• Model output of (windowed) aggregations as table
• State is not internal but first-class citi...
11
Example: Count Clicks per Page
url count
record stream changelog stream
BIRTE
1
<BIRTE,1>
BIRTE 1
countTable = stream.g...
12
Example: Count Clicks per Page
url count
record stream changelog stream
1
<BIRTE,1>
BIRTE 1
VLDB
1VLDB
<VLDB,1>
countTa...
13
Example: Count Clicks per Page
url count
record stream changelog stream
1
<BIRTE,1>
BIRTE 1
VLDB
2VLDB
<VLDB,2>
countTa...
14
countTable2 = countTable.filter(url=‘VLDB’).toTable()
Example: Processing a Changelog Stream
url count
changelog stream...
15
countTable = stream.groupBy(r->url).windowedBy(5sec).count()
windowID = <groupingKey,windowStartTimestamp>
windowStartT...
16
countTable = stream.groupBy(r->url).windowedBy(5sec).count()
windowID = <groupingKey,windowStartTimestamp>
windowStartT...
17
Duality of Streams and Tables
18
Cost
Correctness/
Completeness
Latency
Buffering and Reordering
- Ref: CQL1, Trill2
Punctuations/Watermarks
- Ref: Li e...
19
Stream-Table Transformations
See the paper for details…
20
Implementation
• Implemented in Apache Kafka (v0.10)
• Kafka Streams / Streams API
StreamsBuilder builder = new Streams...
21
Implementation
• Implemented in Apache Kafka (v0.10)
• Kafka Streams / Streams API
• Leveraged in Confluent’s KSQL
CREA...
22
Implementation
• Implemented in Apache Kafka (v0.10)
• Kafka Streams / Streams API
• Leveraged in Confluent’s KSQL
• Wi...
23
Summary
• Suggest the Dual-Streaming-Model
• Handles out-of-order data within the processing model
• Optimized for low ...
24
Thank You
We are hiring!
25
References
[1] Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2003. CQL: A Language for Continuous Queries
over Strea...
26
Data Streams Types
27
Evolving Table
window ID count
1<BIRTE,0>
1<VLDB,0>
window ID count
1<BIRTE,0>
window ID count
1<BIRTE,0>
1<VLDB,0>
<VL...
28
Evolving Table
window ID count
<BIRTE,0>
1<VLDB,0>
window ID count
<BIRTE,0>
window ID count
<BIRTE,0>
1<VLDB,0>
<VLDB,...
29
Table-Table Join
30
Stream-Stream Join
• Sliding Window Join, i.e., band join
• Window size specifies additional timestamp based join predi...
31
Stream-Table Join
• Temporal table “lookup” join
• For each stream record, lookup for a matching table record
• Join co...
32
Stream-Table Join
Upcoming SlideShare
Loading in …5
×

Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)

1,022 views

Published on

Presentation given at BIRTE 2018
Twelfth International Workshop on Real-Time Business Intelligence and Analytics
27 August, 2018, Rio de Janeiro

Speakers:
Matthias Sax, Confluent Inc.
Guozhang Wang, Confluent Inc.
Matthias Weidlich, Humboldt-Universität zu Berlin
Johann-Christoph Freytag, Humboldt-Universität zu Berlin

Published in: Technology

Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)

  1. 1. 1 Streams and Tables: Two Sides of the Same Coin Matthias J. Sax12, Guozhang Wang1, Matthias Weidlich2, Johann-Christoph Freytag2 1Confluent Inc., Palo Alto (CA) matthias@confluent.io guozhang@confluent.io 2Humboldt-Universität zu Berlin mjsax@informatik.hu-berlin.de matthias.weidlich@hu-berlin.de freytag@informatik.hu-berlin.de @MatthiasJSax Twelfth International Workshop on Real-Time Business Intelligence and Analytics 27 August, 2018, Rio de Janeiro
  2. 2. 2 Count Clicks per Page BIRTE VLDB VLDB Distributed Data Source BIRTE(3) VLDB(4) VLDB(7) BIRTE(3) VLDB(4) VLDB(7) BIRTE(3)VLDB(4)VLDB(7) Input Data Stream
  3. 3. 3 Count Clicks per Page BIRTE VLDB VLDB Distributed Data Source BIRTE(3) VLDB(4) VLDB(7) BIRTE(3) VLDB(4) VLDB(7) BIRTE(3) VLDB(4)VLDB(7) Input Data Stream Arrival order non-deterministic Even-time semantics implies out-of-order data
  4. 4. 4 Ordering: Common Approaches Input Data Stream BIRTE(3) VLDB(4)VLDB(7) BIRTE(3)VLDB(4)VLDB(7) buffer and re-order SPS SPS punctuations/watermarks BIRTE(3) VLDB(4)VLDB(7) time=3
  5. 5. 5 Cost Correctness/ Completeness Latency Buffering and Reordering - Ref: CQL1, Trill2 Punctuations/Watermarks - Ref: Li et al.3, Krishnamurthy et al.4 Design Space
  6. 6. 6 Problem Statement How to design a model • for the evaluation of expressive operators • with low latency over potentially unordered data streams • that can be implemented by mean of distributed online algorithms?
  7. 7. 7 High-Level Proposal • To reduce latency, we need to avoid any processing delays • Process data in arrival order • Emit current result immediately • Law et al.5: cannot handle out-of-order data • To handle out-of-order data, we need to be able to update/refine previous results • Data streams must allow for update records • Update/delete records by Babu and Widom6: no operator semantics defined • Borealis:7 replays data stream after “updating/reordering”; very high cost
  8. 8. 8 Data Model • Offset: physical order (arrival/processing order) • Timestamp: logical order (event-time) • Key: optional for grouping • Value: payload
  9. 9. 9 Stream Processing Operators • Stateless, order agnostic • filter, projection, flatMap • No special handling necessary • Stateful, order sensitive • aggregation, joins, windowing • Need to handle out-of-order data
  10. 10. 10 Data Stream Aggregation • Model output of (windowed) aggregations as table • State is not internal but first-class citizen • Update stateful operator continuously • Emit changelog stream to downstream operators • Streams, Table, and Changelogs • Define operator semantics over changelogs and updating tables • Temporal operator semantics
  11. 11. 11 Example: Count Clicks per Page url count record stream changelog stream BIRTE 1 <BIRTE,1> BIRTE 1 countTable = stream.groupBy(r->url).count()
  12. 12. 12 Example: Count Clicks per Page url count record stream changelog stream 1 <BIRTE,1> BIRTE 1 VLDB 1VLDB <VLDB,1> countTable = stream.groupBy(r->url).count() VLDB
  13. 13. 13 Example: Count Clicks per Page url count record stream changelog stream 1 <BIRTE,1> BIRTE 1 VLDB 2VLDB <VLDB,2> countTable = stream.groupBy(r->url).count() VLDB <VLDB,1>
  14. 14. 14 countTable2 = countTable.filter(url=‘VLDB’).toTable() Example: Processing a Changelog Stream url count changelog stream 1 <BIRTE,1> BIRTE VLDB <VLDB,1> 2 <VLDB,2> url count 2VLDB
  15. 15. 15 countTable = stream.groupBy(r->url).windowedBy(5sec).count() windowID = <groupingKey,windowStartTimestamp> windowStartTimestamp = recordTimestamp / windowSize Example: Windowed Count window ID count record stream changelog stream BIRTE(3) 1 <<BIRTE,0>,1> <BIRTE,0> VLDB(7) 1<VLDB,0> <<VLDB,0>,1>VLDB(4) <VLDB,5> <<VLDB,5>,1>1
  16. 16. 16 countTable = stream.groupBy(r->url).windowedBy(5sec).count() windowID = <groupingKey,windowStartTimestamp> windowStartTimestamp = recordTimestamp / windowSize Example: Out-of-Order Data window ID count record stream changelog stream BIRTE(1) 2<BIRTE,0> 1<VLDB,0> <VLDB,5> 1 <<BIRTE,0>,2>
  17. 17. 17 Duality of Streams and Tables
  18. 18. 18 Cost Correctness/ Completeness Latency Buffering and Reordering - Ref: CQL1, Trill2 Punctuations/Watermarks - Ref: Li et al.3, Krishnamurthy et al.4 Design Space Dual Streaming Model - continuous updates / changelogs - decouple latency from correctness - trade-off latency and cost - trade-off cost and completeness (retention time)
  19. 19. 19 Stream-Table Transformations See the paper for details…
  20. 20. 20 Implementation • Implemented in Apache Kafka (v0.10) • Kafka Streams / Streams API StreamsBuilder builder = new StreamsBuilder(); KStream<String, String> textLines = builder.stream("TextLinesTopic"); KTable<String, Long> wordCounts = textLines .flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split("W+"))) .groupBy((key, word) -> word) .windowedBy(TimeWindows.of(5_000L)) .count(); wordCounts.toStream().to("WordsWithCountsTopic"); KafkaStreams streams = new KafkaStreams(builder.build(), props); streams.start();
  21. 21. 21 Implementation • Implemented in Apache Kafka (v0.10) • Kafka Streams / Streams API • Leveraged in Confluent’s KSQL CREATE TABLE click_count_per_url AS SELECT url, count(*) FROM click_stream WINDOW TUMBLING (SIZE 1 MINUTE) WHERE url LIKE = '%confluent%' OR url LIKE ‘%hu-berlin%’ GROUP BY url;
  22. 22. 22 Implementation • Implemented in Apache Kafka (v0.10) • Kafka Streams / Streams API • Leveraged in Confluent’s KSQL • Widely adopted in industry
  23. 23. 23 Summary • Suggest the Dual-Streaming-Model • Handles out-of-order data within the processing model • Optimized for low latency • Streams and Tables are Dual • Allows to trade-off processing cost, latency, completeness • Adopted in industry via Kafka Streams and KSQL
  24. 24. 24 Thank You We are hiring!
  25. 25. 25 References [1] Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2003. CQL: A Language for Continuous Queries over Streams and Relations. Database Programming Languages, 9th Int. WS. 1–19. [2] Badrish Chandramouli et al. 2014. Trill: A High-performance Incremental Query Processor for Diverse Analytics. Proc. VLDB Endow. 8, 4 (2014), 401–412. [3] Jin Li et al. 2005. Semantics and Evaluation Techniques for Window Aggregates in Data Streams. Proc. of the ACM SIGMOD Int. Conf. on Management of Data. 311–322. [4] Sailesh Krishnamurthy et al. 2010. Continuous Analytics over Discontinuous Streams. Proc. of the 2010 ACM SIGMOD Int. Conf. on Management of Data. 1081–1092. [5] Yan-Nei Law, HaixunWang, and Carlo Zaniolo. 2004. Query Languages and Data Models for Database Sequences and Data Streams. Proc. of the 13th Int. Conf. on Very Large Data Bases. 492-503. [6] Shivnath Babu and Jennifer Widom. 2001. Continuous Queries over Data Streams. SIGMOD Records 30, 3 (2001), 109–120. [7] Daniel Abadi et al. 2005. The Design of the Borealis Stream Processing Engine. CIDR, 2nd Biennial Conf. on Innovative Data Systems Research. 277–289.
  26. 26. 26 Data Streams Types
  27. 27. 27 Evolving Table window ID count 1<BIRTE,0> 1<VLDB,0> window ID count 1<BIRTE,0> window ID count 1<BIRTE,0> 1<VLDB,0> <VLDB,5> 1 record stream BIRTE(3)VLDB(7) VLDB(4)BIRTE(1) table v3 table v4 table v7
  28. 28. 28 Evolving Table window ID count <BIRTE,0> 1<VLDB,0> window ID count <BIRTE,0> window ID count <BIRTE,0> 1<VLDB,0> <VLDB,5> 1 record stream BIRTE(3)VLDB(7) VLDB(4)BIRTE(1) table v3 table v4 table v7 window ID count 1<BIRTE,0> table v1 2 2 2
  29. 29. 29 Table-Table Join
  30. 30. 30 Stream-Stream Join • Sliding Window Join, i.e., band join • Window size specifies additional timestamp based join predicate SELECT * FROM stream1, stream2 WHERE stream1.key = stream2.key AND stream1.ts – windowSize <= stream2.ts AND stream2.ts <= stream1.ts + windowSize windowSize stream1 stream2
  31. 31. 31 Stream-Table Join • Temporal table “lookup” join • For each stream record, lookup for a matching table record • Join condition: streamRecord.key == tableRecord.key • The join is temporal is the sense, that the “correct” table version must be use • i.e., youngest table version that is before the stream records timestamp
  32. 32. 32 Stream-Table Join

×