Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Aggregate Sharing for User-Define Data Stream Windows

658 views

Published on

Aggregation queries on data streams are evaluated over evolving and often overlapping logical views called windows. While the aggregation of periodic windows were extensively studied in the past through the use of aggregate sharing techniques such as Panes and Pairs, little to no work has been put in optimizing the aggregation of very common, non-periodic windows. Typical examples of non-periodic windows are punctuations and sessions which can implement complex business logic and are often expressed as user- defined operators on platforms such as Google Dataflow or Apache Storm. The aggregation of such non-periodic or user-defined windows either falls back to expensive, best-effort aggregate sharing methods, or is not optimized at all.
In this paper we present a technique to perform efficient aggregate sharing for data stream windows, which are de- clared as user-defined functions (UDFs) and can contain arbitrary business logic. To this end, we first introduce the concept of User-Defined Windows (UDWs), a simple, UDF-based programming abstraction that allows users to programmatically define custom windows. We then define semantics for UDWs, based on which we design Cutty, a low-cost aggregate sharing technique. Cutty improves and outperforms the state of the art for aggregate sharing on single and multiple queries. Moreover, it enables aggregate sharing for a broad class of non-periodic UDWs. We implemented our techniques on Apache Flink, an open source stream processing system, and performed experiments demonstrating orders of magnitude of reduction in aggregation costs compared to the state of the art.

Published in: Data & Analytics
  • Be the first to comment

Aggregate Sharing for User-Define Data Stream Windows

  1. 1. @CIKM16 Cutty: Aggregate Sharing for User-Defined Windows Paris Carbone <parisc@kth.se, senorcarbone@apache.org> Jonas Traub <jonas.traub@tu-berlin.de> Asterios Katsifodimos <asterios.katsifodimos@tu-berlin.de> Seif Haridi <haridi@kth.se> Volker Markl <volker.markl@tu-berlin.de> 1 Presentation : Paris Carbone PhD Candidate @ KTH Sweden Committer @ Apache Flink
  2. 2. @CIKM16 4 Reasons Not to check your email during this talk 1. Windows are the backbone of data stream analysis. 2. We generalise the concept of data stream windows. 3. Our technique makes aggregations on general stream windows more efficient than ever. 4. We can multiplex and share aggregations of diverse types of sliding windows that run simultaneously. 2
  3. 3. @CIKM16 Outline • Partial Sliding Window Aggregation • Fundamental Limitations of Existing Approaches • Introducing User-Defined Windows(UDWs) • Multi-Query Aggregation of UDWs with Cutty • Performance Comparison 3
  4. 4. @CIKM16 4 Window Aggregation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 … 1 2 3 4 5 3 4 5 6 7 5 6 7 8 9 Stream Discretization fd
  5. 5. @CIKM16 Window Aggregation 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 … 1 2 3 4 5 3 4 5 6 7 5 6 7 8 9 A1 A2 A3 fa
  6. 6. @CIKM16 lift record —> (val, count) combine (val1 + val2, count1 + count2) lower (val, count) —> val / count Partial Aggregation 6 1 2 3 4 5 1. lift 3. lower A1 M (2,1)(1,1) 2. combine M M M M M 3 record type partial type aggr type ? Example - AVG (3,2) (1,1) (3,1) (6,3) (4,1) (10,4) (5,1) (15,5)
  7. 7. @CIKM16 Partial Aggregation 7 •#Invocations <—> Computational Complexity •Commutativity & Associativity are typically assumed. 2. combine M
  8. 8. @CIKM16 Redundancy in Sliding Window Aggregation 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 … 1 2 3 4 5 3 4 5 6 7 5 6 7 8 9 … … overlapping means redundant combine calls we need to optimise…
  9. 9. @CIKM16 tumbling single-type periodic Punctuation Snapshot FCF/CF Lower-Bound Session multi-type ADWIN Delta-based FCA slicing Optimise…which windows? pre-compute non-overlapping partials Periodic
  10. 10. @CIKM16 Slicing 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Example - Count Window range: 10, slide:3 1 2 3 4 5 6 7 8 9 10 11 13 14 15 16 17 18 19 If a sliding window can be defined in terms of a fixed range and slide the system can pre-aggregate consecutive, non-overlapping slices. Panes1 gcd(range,slide) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Pairs2 p2: range%slide p1: slide-p2 12 periodic windows 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Cutty (preview) 1. No Pane, No Gain: Efficient Evaluation of Sliding-Window Aggregates over Data Streams SIGMOD 2005 2. On-the-Fly Sharing for Streamed Aggregation - SIGMOD 2006
  11. 11. @CIKM16 Slicing - Observations • Computational Complexity —> upd: O(1) ,merge: O(#partials) • Space Complexity (#stored sliced partials): • Similar space requirements when range is a multiple of slice • Pairs has been extended for multi-query aggregation sharing 11 d range gcd(range, slide) e d 2 ⇥ range slide e Memory (#partials) Panes Pairs —> 10 partials —> 7 partials …from previous example
  12. 12. @CIKM16 tumbling single-type periodic Punctuation Snapshot FCF/CF Lower-Bound Session multi-type ADWIN Delta-based FCA slicing Optimise…which windows? pre-compute non-overlapping partials pre-compute overlapping partials for arbitrary aggregation lookups eager pre-aggregation Non-Periodic
  13. 13. @CIKM16 Eager Pre-Aggregation 13 When windowing cannot be expressed simply by a range and slide : eagerly pre-compute partial aggregates and update a binary tree, bottom-up. 1 2 3 4 5 6 7 8 3 7 11 15 10 26 36 9 10 19 30 21 … arbitrary window lookups logn} }n pre-computed partials n leaves ~ records} }2n
  14. 14. @CIKM16 Eager Pre-Aggregation Observations • Implementations: FlatFAT1, B-Int2 • High Space Complexity (#raw records…twice) • Most pre-aggregates are never used • Update+Aggregation complexity : • Generic and Suitable for Ad-Hoc Queries • Potential for Multi-Query Window Pre-Aggregation 14 log(leaves) 1.General incremental sliding-window aggregation - VLDB 15 2.Resource sharing in continuous sliding-window aggregates - VLDB 04
  15. 15. @CIKM16 tumbling single-type periodic Punctuation Snapshot FCF/CF Lower-Bound Session multi-type ADWIN Delta-based FCA efficient slicing generic, high-cost pre-aggregation Non-Periodic Periodic
  16. 16. @CIKM16 tumbling single-type periodic Punctuation Snapshot FCF/CF Lower-Bound Session multi-type ADWIN Delta-based FCA efficient slicing generic, high-cost pre-aggregation Non-Deterministic Deterministic
  17. 17. @CIKM16 Deterministic Windows: Intuition 17 Slices Higher order partials price [in USD] time [in min.] 0 0 5 10 15 20 25 31 35 10 Window Window Begin Threshold Pre-Aggregate
  18. 18. @CIKM16 Deterministic Windows: Intuition 18 Slices Higher order partials price [in USD] time [in min.] 0 0 5 10 15 20 25 31 35 10 Window Window Begin Threshold Pre-Aggregate only need to determine when new windows start
  19. 19. @CIKM16 Deterministic Windows: Intuition 19 Slices Higher order partials price [in USD] time [in min.] 0 0 5 10 15 20 25 31 35 10 Window Window Begin Threshold Pre-Aggregate only need to determine when new windows start
  20. 20. @CIKM16 User-Defined Windows Deterministic: Expressed as a UDF that assigns each record to number of new or complete windows. 20 Trivial templating of existing window types Non-Deterministic: Expressed as a UDF that assigns a record to complete windows and a reference to their beginning.
  21. 21. @CIKM16 Cutty Concept 21 Slices Higher order partials price [in USD] time [in min.] 0 0 5 10 15 20 25 31 35 10 Window Window Begin Threshold Pre-Aggregate 1 Slicing Eager Pre-Aggregation
  22. 22. @CIKM16 Cutty Overview Exploits Deterministic Windows for the most efficient yet aggregation slicing. Utilises eager pre-aggregation at a low memory cost over optimally sliced partials. Supports both single and multi-query multiplexed execution out-of-the-box for efficient operator sharing. Non-Deterministic Windows can still utilise eager pre- aggregation. 22
  23. 23. @CIKM16 Cutty Architecture 23
  24. 24. @CIKM16 Cutty - Demo 24 1 2 3 4 5 6 7 8 9 10 - Active Partial - - Stored Partials - - - - - Records Windows
  25. 25. @CIKM16 Cutty - Demo 25 1 2 3 4 5 6 7 8 9 10 1 Active Partial - - Stored Partials - - - - - Records Windows
  26. 26. @CIKM16 Cutty - Demo 26 1 2 3 4 5 6 7 8 9 10 Active Partial - - Stored Partials - - - - - Records Windows 3
  27. 27. @CIKM16 Cutty - Demo 27 1 2 3 4 5 6 7 8 9 10 3 Active Partial 3 3 Stored Partials - 3 - - - Records Windows
  28. 28. @CIKM16 Cutty - Demo 28 1 2 3 4 5 6 7 8 9 10 Active Partial 3 3 Stored Partials - 3 - - - Records Windows 3
  29. 29. @CIKM16 Cutty - Demo 29 1 2 3 4 5 6 7 8 9 10 Active Partial 3 3 Stored Partials - 3 - - - Records 15 Windows 12
  30. 30. @CIKM16 Cutty - Demo 30 1 2 3 4 5 6 7 8 9 10 Active Partial 15 15 Stored Partials - 3 12 - - Records 15 Windows 21 6
  31. 31. @CIKM16 Cutty - Demo 31 1 2 3 4 5 6 7 8 9 10 Active Partial 15 15 Stored Partials - 3 12 - - Records 15 Windows 21 13
  32. 32. @CIKM16 Cutty - Demo 32 1 2 3 4 5 6 7 8 9 10 Active Partial 15 28 Stored Partials 13 3 12 13 - Records 15 Windows 21 33 8
  33. 33. @CIKM16 Implementation 33 • Apache Flink •UDW API (Contributed to Apache Flink - 0.9) •Shared Aggregation Operator (experimental) •Optimiser collocates parallel windows in operators • Aggregate Store •Adaptation of FlatFAT1 •Circular Resizable Buffer Strategies •Non-Eager Strategy Supported for Experiments 1.General incremental sliding-window aggregation - VLDB 15
  34. 34. @CIKM16 Performance Analysis Periodic Window Aggregation (DEBS12 dataset) 34 20 40 60 80 100 Number of Queries 0.0 0.5 1.0 1.5 2.0 2.5 3.0 NumberofPartials ⇥105 Cutty Pairs/Pairs+ COUNT-RANGES COUNT-SLIDES 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 NumberofRecords 20 40 60 80 100 Number of Queries 0k 500k 1000k 1500k 2000k 2500k 3000k 3500k 4000k 4500k Throughput(records/sec) Cutty Pairs+ RA 1 10 20 30 40 50 60 70 80 90 100 Number of Queries 104 105 106 107 108 109 1010 1011 TotalReduceCalls Cutty (eager) Pairs+ Cutty (lazy) Pairs RA Naive
  35. 35. @CIKM16 Performance Analysis Session Window Aggregation (DEBS12 dataset) 35 SESSION LENGTHS 0 5000 10000 15000 20000 25000 30000 35000 NumberofRecords 1 10 20 30 40 50 60 70 80 90 100 Number of Queries 103 104 105 106 107 108 109 TotalReduceCalls Cutty (UPD) Cutty (MERGE) RA (UPD) RA (MERGE) 1 10 20 30 40 50 60 70 80 90 100 Number of Queries 100 101 102 103 104 105 106 MaxAllocation(#partials)
  36. 36. @CIKM16 No limits in multiplexing 36 distance [in km] time [in min.] 0 0 6 12 18 24 5 10 15 20 Slice Window Window Begin Record 1
  37. 37. @CIKM16 Summary • UDWs extend the potential of pre-aggregation in window classes beyond fixed periodic windows. • Cutty takes slicing a step further in terms of computational efficiency which combines seamlessly with eager aggregation. • First work that addresses multi-query aggregation across diverse window types. 37
  38. 38. @CIKM16 Thank you! 38 @SenorCarbone https://flink.apache.org/ https://github.com/apache/flink

×