Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Ariadne: Managing Fine-Grained Provenance on Data
Streams
Boris Glavic1 Kyumars Sheykh Esmaili2 Peter M. Fischer3
Nesime T...
Outline
1 Motivation
2 Reduced Eager Operator Instrumentation
3 Optimizations
4 Experiments
5 Conclusions
Fine-grained Provenance for Data Stream Management
Provenance
• Information about the origin and creation process of data
...
Fine-grained Provenance for Data Stream Management
Provenance
• Information about the origin and creation process of data
...
Why (Fine-Grained) Stream Provenance?
Use cases
• Ad hoc inspection
• DSMS generates alarm event based on sensor readings
...
Challenges and Opportunities
• Online and infinite data arrival
• Traditional methods that reconstruct provenance retroacti...
Related Work
Database Provenance
• Large body of work on provenance models
• Query rewrite-based approaches
Workflow Proven...
How to generate provenance?
Inversion
• Infer provenance from query outputs
• Usually tracing back one operator at a time
...
How to generate provenance?
Annotation Propagation
• Annotate data with provenance
• Propagate and manipulate annotations ...
How to generate provenance?
Propagation - Operator instrumentation
• Modify DSMS operators to propagate provenance informa...
Plan of attack
1) Provenance model
• Annotated tuples
• Annotated streams
2) Provenance generation and querying
• Operator...
Outline
1 Motivation
2 Reduced Eager Operator Instrumentation
3 Optimizations
4 Experiments
5 Conclusions
Overview
Operator Instrumentation
• For each operator of the system create new versions
• Provenance generator (PG): Gener...
Overview
Querying Provenance
• Translate provenance annotations into regular streaming data
• For one output tuple t annot...
Overview
S1 Sout
S2
Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Overview
S1 Sout
S2
Instrumented
Network
PG PP
PG
Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced...
Overview
S1 Sout
S2 n P
Instrumented
Network
Reconstruct
Provenance
Temporary
Input Storage
PG
PG
PP
Slide 7 of 18 Glavic,...
Stream Provenance Model
Provenance Model
• Provenance is modelled as a set of contributing tuples
• From input of intermed...
Stream Provenance Model
Example
• Query over stream of temperature readings
• Filter out outliers (temperature above thres...
Stream Provenance Model
Provenance Annotated Streams
• Each tuple in a provenance annotated stream (PAS) is annotated
with...
Stream Provenance Model
Example
• Query over stream of temperature readings
• Filter out outliers (temperature above thres...
Implementing Instrumented Operators
Approach
• Three instrumented versions for each operator
• Provenance generators: Init...
Implementing Instrumented Operators
Provenance Generator
• Computes one-step provenance for an operator’s output based on ...
Implementing Instrumented Operators
Provenance Propagator
• Compute output provenance set by combining provenance sets fro...
Instrumenting a Network
• User provides
• query network q
• stream O
• set of streams I upstream from O
• Output is a netw...
Instrumenting a Network
Example
• Network that computes P(3, {1})
↵
1 32
TID time tempature
1:1 0 102
1:2 5 105
1:3 10 399...
Ariadne Implementation
Based on Borealis
• Set of standard DSMS operators
• Operators connected through queues
• Fixed-len...
Querying Provenance
Approach
1 Translate provenance annotations into regular stream data to query
using the host language
...
Querying Provenance
Approach
1 Translate provenance annotations into regular stream data to query
using the host language
...
Querying Provenance
Example
• Tuple 3:2 with P(3:2, {1}) = {1:2, 1:4}
↵
1 32
TID time temperature
1:1 0 102
1:2 5 105
1:3 ...
Outline
1 Motivation
2 Reduced Eager Operator Instrumentation
3 Optimizations
4 Experiments
5 Conclusions
Rationale
Compression
• Reduce load on queues by using concise provenance representation
• Reduce memory imprint
Lazy Retr...
Replay-Lazy
Idea
• Avoid provenance computation if provenance is not needed
• Propagate concise representation of a super-...
Replay-Lazy
Idea
• Avoid provenance computation if provenance is not needed
• Propagate concise representation of a super-...
Replay-Lazy
Enabling Replay
• Covering interval operators:
• CG corresponds to PG
• CP corresponds to PP
• C-join operator...
Replay-Lazy
Approach
1 Instrument original network to produce covering interval
2 Filter out results and c-join
3 Feed res...
Outline
1 Motivation
2 Reduced Eager Operator Instrumentation
3 Optimizations
4 Experiments
5 Conclusions
Completion Time - Basic Network
• Send large batch of 100,000 tuples
• Measure completion time
• Without Retrieval/With Re...
Completion Time - Basic Network
0
5
10
15
20
25
30
35
40
45
NoProvenance
Instrumentation
Replay-Lazy
Rewrite
NoProvenance
...
Outline
1 Motivation
2 Reduced Eager Operator Instrumentation
3 Optimizations
4 Experiments
5 Conclusions
Conclusions
Reduced-Eager Operator Instrumentation
• Novel propagation provenance method for DSMS
• Replace operators in a...
Conclusions
Optimizations
• Replay-Lazy
• Propagate only covering intervals - replay through provenance
generating copy of...
Future Work
Optimizing Temporary Input Storage
• Exploiting additional knowledge in purging temporary input storage
• Stat...
Questions?
Boris Glavic
IIT
DBGroup
bglavic@iit.edu
Kyumars Sheykh Esmaili
Nanyang University
SANDS
kyumarss@ntu.edu.sg
Pe...
Outline
6 Implementation
7 Additional Experiments
8 Optimizations
How to Implement Annotations?
Alternatives for sending annotations
• Borealis models streams as queues of fixed-length tupl...
How to Implement Annotations?
Alternatives for sending annotations
• Borealis models streams as queues of fixed-length tupl...
Implementing Instrumented Operators
Approach
• Factor out common functionality
• Serializing/Deserializing provenance to/f...
Temporary Input Storage
Connection Points
• Borealis feature for storing tuples from a stream
• Time-based or count based ...
Outline
6 Implementation
7 Additional Experiments
8 Optimizations
Varying Window Size
• Send large batch of 100,000 tuples
• Measure completion time
• No Retrieval
• No Provenance/Single/O...
Varying Window Size
0
10
20
30
40
50
60
70
80
90
100
50 100 200 500 1000 2000
CompletionTime(sec)
Window Size
No Provenanc...
Vary Retrieval Frequency
• Send large batch of 100,000 tuples
• Measure completion time
• Instrumentation with Retrieval
•...
Vary Retrieval Frequency
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
0.05 0.10 0.25 0.50 0.75 1 2 5 ...
Latency
• In Ariadne inputs are send in batches with a fixed number of tuples
(parameter)
• Measure latency
• Using the Bas...
Latency
0
1
2
3
4
5
6
7
10 25 50 75 100
Latency(ms)
Batch Size
No Provenance
Single
Optimized
Covering Intervals
Slide 6 o...
Nested Aggregations
• Sequence of aggregation operators
• Measure Completion time
• Vary Number of aggregation operators
M...
Outline
6 Implementation
7 Additional Experiments
8 Optimizations
Compression
Rationale
• Reduce amount of data that is shipped between operators
• Requirements
• Fast compression and deco...
Compression
Rationale
• Reduce amount of data that is shipped between operators
• Requirements
• Fast compression and deco...
Compression
Rationale
• Reduce amount of data that is shipped between operators
• Requirements
• Fast compression and deco...
Compression
Delta Compression
• Express provenance as delta to down-stream provenance
• Once in a while send full provenan...
Compression
Delta Compression
• Express provenance as delta to down-stream provenance
• Once in a while send full provenan...
Retrieval-Lazy
Approach
• User query that filters out unneeded provenance
↵
PPPG PP
n
Slide 9 of 9 Glavic, Sheykh Esmaili, ...
Retrieval-Lazy
Approach
• User query that filters out unneeded provenance
• Avoid expensive provenance reconstruction if pr...
Upcoming SlideShare
Loading in …5
×

DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

471 views

Published on

Managing fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only to address complex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality such as revision processing or query debugging. This paper introduces a novel approach that uses operator instrumentation, i.e., modifying the behavior of operators, to generate and propagate fine-grained provenance through several operators of a query network. In addition to applying this technique to compute provenance eagerly during query execution, we also study how to decouple provenance computation from query processing to reduce run-time overhead and avoid unnecessary provenance retrieval. This includes computing a concise superset of the provenance to allow lazily replaying a query network and reconstruct its provenance as well as lazy retrieval to avoid unnecessary reconstruction of provenance. We develop stream-specific compression methods to reduce the computational and storage overhead of provenance generation and retrieval. Ariadne, our provenance-aware extension of the Borealis DSMS implements these techniques. Our experiments confirm that Ariadne manages provenance with minor overhead and clearly outperforms query rewrite, the current state-of-the-art.

Published in: Science, Technology, Business
  • Be the first to comment

DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

  1. 1. Ariadne: Managing Fine-Grained Provenance on Data Streams Boris Glavic1 Kyumars Sheykh Esmaili2 Peter M. Fischer3 Nesime Tatbul4 Illinois Institute of Technology1 DBGroup Nanyang Technological University2 SANDS University of Freiburg3 Web Science Intel Labs4 Intel Science and Technology Center for Big Data DEBS 2013 - July 7, 2013 - Arlington, USA
  2. 2. Outline 1 Motivation 2 Reduced Eager Operator Instrumentation 3 Optimizations 4 Experiments 5 Conclusions
  3. 3. Fine-grained Provenance for Data Stream Management Provenance • Information about the origin and creation process of data • Here: On which inputs does a given output depend on Fine-grained Provenance • At the granularity of tuples • Fix a tuple t in an output or intermediate stream • On which input tuples does it depend on? Example S1 Sout S2 141 2 3 a b Slide 1 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  4. 4. Fine-grained Provenance for Data Stream Management Provenance • Information about the origin and creation process of data • Here: On which inputs does a given output depend on Fine-grained Provenance • At the granularity of tuples • Fix a tuple t in an output or intermediate stream • On which input tuples does it depend on? Example S1 Sout S2 141 2 3 a b Slide 1 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  5. 5. Why (Fine-Grained) Stream Provenance? Use cases • Ad hoc inspection • DSMS generates alarm event based on sensor readings • Understand why alarm was raised to act appropriately • Stream query debugging • Trace back and forward erroneous data items • Auditing • Proof of correct operation • No access policies violated Slide 2 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  6. 6. Challenges and Opportunities • Online and infinite data arrival • Traditional methods that reconstruct provenance retroactively not applicable • Ordered data model • Order can be exploited for compressing provenance • Window-based processing • Aggregation prevalent operation that has large provenance • Low-latency requirements • Provenance overhead should not violate latency requirements • Non-determinism • Sources: Load-shedding, windowing on system-time • Provenance computation can not assume reproducibility of results Slide 3 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  7. 7. Related Work Database Provenance • Large body of work on provenance models • Query rewrite-based approaches Workflow Provenance • Many systems support provenance tracking • Mostly coarse-grained provenance Data Stream Provenance • Coarse-grained provenance • Not detailed enough for most use-cases • Fine-grained provenance • Stream debugging: One-step (one operator at a time) • Approximation techniques (strong assumptions) Slide 4 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  8. 8. How to generate provenance? Inversion • Infer provenance from query outputs • Usually tracing back one operator at a time • No storage overhead • Only works for trivial operators Slide 5 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  9. 9. How to generate provenance? Annotation Propagation • Annotate data with provenance • Propagate and manipulate annotations during query processing Propagation - Query rewrite • Rewrite query network to propagate annotations • Use original DSMS operators • Easy to implement, applicable to all DSMS∗ • Often results in complex and inefficient networks • Only applicable to deterministic networks Slide 5 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  10. 10. How to generate provenance? Propagation - Operator instrumentation • Modify DSMS operators to propagate provenance information • More efficient than query rewrite • Tolerates certain types of non-determinism • Keeps structure of the original query network intact • Requires modification of the engine (DSMS source code needed) Slide 5 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  11. 11. Plan of attack 1) Provenance model • Annotated tuples • Annotated streams 2) Provenance generation and querying • Operator instrumentation • Buffer inputs + reconstruction for querying • Implementation in Ariadne (extension of Borealis) 3) Optimizations • Compression • Laziness • Decoupling provenance computation from query processing Slide 6 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  12. 12. Outline 1 Motivation 2 Reduced Eager Operator Instrumentation 3 Optimizations 4 Experiments 5 Conclusions
  13. 13. Overview Operator Instrumentation • For each operator of the system create new versions • Provenance generator (PG): Generates provenance annotations • Provenance propagator (PP): Propagates provenance annotations • Instrument query (or parts thereof) to generate provenance • Replace operators with annotating versions • Only in the part of the network we want to track provenance Reduced Eager • Eager • Propagate provenance annotations during query processing • Reduced Eager • Propagate sets of tuple-IDs (TIDs) annotations • Temporarily store inputs of relevant streams for restoring full tuples in provenance Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  14. 14. Overview Querying Provenance • Translate provenance annotations into regular streaming data • For one output tuple t annotated with provenance {t1, . . . , tn} • output n tuples (t, t1), (t, t2), . . . • New operator (p-join) joins buffered input data with provenance annotations • Use operators of the DSMS for querying Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  15. 15. Overview S1 Sout S2 Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  16. 16. Overview S1 Sout S2 Instrumented Network PG PP PG Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  17. 17. Overview S1 Sout S2 n P Instrumented Network Reconstruct Provenance Temporary Input Storage PG PG PP Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  18. 18. Stream Provenance Model Provenance Model • Provenance is modelled as a set of contributing tuples • From input of intermediate streams • Provenance set P(t, I) • tuple t in intermediate or result stream • set of upstream streams I • Which tuples belong to provenance? • Declarative definition that models provenance for single operators • Transitivity Slide 8 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  19. 19. Stream Provenance Model Example • Query over stream of temperature readings • Filter out outliers (temperature above threshold) • Compute average temperature over every consecutive window of two readings • P(3:2 , {2}) = {2:2 , 2:3 } • 3:2 generated by aggregating values from 2:2 and 2:3 • P(3:2 , {1}) = {1:2 , 1:4 } • 2:2 and 2:3 derived from 1:2 and 1:4 ↵ 1 3 2 Slide 8 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  20. 20. Stream Provenance Model Provenance Annotated Streams • Each tuple in a provenance annotated stream (PAS) is annotated with its provenance set • according to a set of input streams I • PAS P(O, I) • Provenance annotated stream for stream O according to set of upstream streams I Slide 8 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  21. 21. Stream Provenance Model Example • Query over stream of temperature readings • Filter out outliers (temperature above threshold) • Compute average temperature over every consecutive window of two readings • Provenance annotated stream (PAS) P(3, {1}) ↵ 1 32 Slide 8 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  22. 22. Implementing Instrumented Operators Approach • Three instrumented versions for each operator • Provenance generators: Initialize provenance annotations • Provenance propagators: Propagate provenance annotations • Provenance dropper: Drop provenance from input Slide 9 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  23. 23. Implementing Instrumented Operators Provenance Generator • Computes one-step provenance for an operator’s output based on its input • Attach input TIDs as provenance • For windowing operators merge TIDs from window • Before instrumentation: I → op → O • After instrumentation: I → opPG → P(O, I) Slide 9 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  24. 24. Implementing Instrumented Operators Provenance Propagator • Compute output provenance set by combining provenance sets from the input • Before instrumentation: P(I, I ) → op → O • After instrumentation: P(I, I ) → opPP → P(O, I ) Provenance Dropper • Removes provenance annotations • Instrumentation: P(I, I ) → opPD → O Slide 9 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  25. 25. Instrumenting a Network • User provides • query network q • stream O • set of streams I upstream from O • Output is a network that computes P(O, I) InstrumentNetwork(q,O,I) for all op on paths between I and O if op is connected to I replace op with PG version else replace op with PP version Slide 10 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  26. 26. Instrumenting a Network Example • Network that computes P(3, {1}) ↵ 1 32 TID time tempature 1:1 0 102 1:2 5 105 1:3 10 399 1:4 15 85 TID avg temp 3:1 103.5 {1:1, 1:2} 3:2 95 {1:2, 1:4} PPPG Example • Network that computes P(2, {1}) TID avg temp 3:1 103.5 3:2 95 ↵ 1 32 TID time temperature 1:1 0 102 1:2 5 105 1:3 10 399 1:4 15 85 TID time temperature 2:1 0 102 {1:1} 2:2 5 105 {1:2} 2:3 15 85 {1:4} PG PD Slide 10 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  27. 27. Ariadne Implementation Based on Borealis • Set of standard DSMS operators • Operators connected through queues • Fixed-length tuples Changes to Borealis code • Implement annotations • Implement operator instrumentation • Implement p-join and input buffering Slide 11 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  28. 28. Querying Provenance Approach 1 Translate provenance annotations into regular stream data to query using the host language Slide 12 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  29. 29. Querying Provenance Approach 1 Translate provenance annotations into regular stream data to query using the host language P-join • Joins a PAS P(O, I) with buffered input tuples from I • Combine output t with each tuple in its provenance P(t, I) Slide 12 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  30. 30. Querying Provenance Example • Tuple 3:2 with P(3:2, {1}) = {1:2, 1:4} ↵ 1 32 TID time temperature 1:1 0 102 1:2 5 105 1:3 10 399 1:4 15 85 TID avg temp 3:1 103.5 {1:1, 1:2} 3:2 95 {1:2, 1:4} TID time temperature 2:1 0 102 {1:1} 2:2 5 105 {1:2} 2:3 15 85 {1:4} PPPG C1 n Slide 12 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  31. 31. Outline 1 Motivation 2 Reduced Eager Operator Instrumentation 3 Optimizations 4 Experiments 5 Conclusions
  32. 32. Rationale Compression • Reduce load on queues by using concise provenance representation • Reduce memory imprint Lazy Retrieval • Avoid reconstructing unneeded provenance Replay-Lazy • Avoid generating unneeded provenance • Decouple provenance computation from regular stream processing Slide 13 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations
  33. 33. Replay-Lazy Idea • Avoid provenance computation if provenance is not needed • Propagate concise representation of a super-set of the provenance • If provenance is requested replay super-set through provenance generating network Slide 14 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations
  34. 34. Replay-Lazy Idea • Avoid provenance computation if provenance is not needed • Propagate concise representation of a super-set of the provenance • If provenance is requested replay super-set through provenance generating network Covering Intervals • Interval: • Minimal TID in provenance • Maximal TID in provenance • Constant size super-set of provenance (piggy-back!) • Provenance: {1, 2, 4, 5, 10, 16, 65} • Covering interval: [1, 65] Slide 14 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations
  35. 35. Replay-Lazy Enabling Replay • Covering interval operators: • CG corresponds to PG • CP corresponds to PP • C-join operator ⊗ • For each input get covering interval • Retrieve all tuples from covering interval from connection point Slide 14 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations
  36. 36. Replay-Lazy Approach 1 Instrument original network to produce covering interval 2 Filter out results and c-join 3 Feed result into provenance generating copy of the network PG ↵ PP ↵ CPCG S1 Sout Provenance Generating Network Filter Provenance and Fetch Tuples from Input Covering Interval Generating Network ⌦ ⇡ PD P Slide 14 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations
  37. 37. Outline 1 Motivation 2 Reduced Eager Operator Instrumentation 3 Optimizations 4 Experiments 5 Conclusions
  38. 38. Completion Time - Basic Network • Send large batch of 100,000 tuples • Measure completion time • Without Retrieval/With Retrieval • No Provenance/Instrumentation/Rewrite • Using the Basic Network • Vary Aggregation Window Size • Slide fixed to 1 • Increase window size → increase amount of provenance Basic Query Network ασ σ Rewritten Version σ α σ Instrumented Version α PP σ PG PP σ σ Replay-Lazy Version PPCP PG ↵ PP ↵ CPCG ⌦ n Slide 15 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Experiments
  39. 39. Completion Time - Basic Network 0 5 10 15 20 25 30 35 40 45 NoProvenance Instrumentation Replay-Lazy Rewrite NoProvenance Instrumentation Replay-Lazy Rewrite NoProvenance Instrumentation Replay-Lazy Rewrite NoProvenance Instrumentation Replay-Lazy Rewrite NoProvenance Instrumentation Replay-Lazy Rewrite NoProvenance Instrumentation Replay-Lazy Rewrite CompletionTime(sec) Window Size No Provenance Instrumentation (Generation) Instrumentation (Retrieval) Replay-Lazy (Covering Interval) Replay-Lazy (Retrieval) Rewrite 1008060402010 Slide 15 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Experiments
  40. 40. Outline 1 Motivation 2 Reduced Eager Operator Instrumentation 3 Optimizations 4 Experiments 5 Conclusions
  41. 41. Conclusions Reduced-Eager Operator Instrumentation • Novel propagation provenance method for DSMS • Replace operators in a query network with provenance-aware (instrumented) versions • Keeps original network structure intact • Deals well with non-determinism • Flexible: • Single- and multi-hop provenance • Instrument parts of a network Slide 16 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Conclusions
  42. 42. Conclusions Optimizations • Replay-Lazy • Propagate only covering intervals - replay through provenance generating copy of network • Reduce cost for low retrieval rates • Option to decouple provenance computation • Lazy-Retrieval • Push filters through p-joins • Avoids provenance reconstruction • Compression • Interval • Delta • Dictionary Slide 16 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Conclusions
  43. 43. Future Work Optimizing Temporary Input Storage • Exploiting additional knowledge in purging temporary input storage • Static query network analysis • Self-optimizing purging strategies • Compression and Approximations Integration with Distributed Storage and Scaling-out • Using distributed write-optimized storage for provenance? Order-aware Provenance Model • Integrate order into the provenance model • “Tuple t is after u in the result, because t’s provenance is before u’s provenance in the input.” Slide 17 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Conclusions
  44. 44. Questions? Boris Glavic IIT DBGroup bglavic@iit.edu Kyumars Sheykh Esmaili Nanyang University SANDS kyumarss@ntu.edu.sg Peter M. Fischer University of Freiburg Web Science peter.fischer@ cs.uni-freiburg.de Nesime Tatbul Intel Labs Intel Science and Technology Center for Big Data tatbul@csail.mit.edu http://cs.iit.edu/~dbgroup/research/ariadne.php Slide 18 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Conclusions
  45. 45. Outline 6 Implementation 7 Additional Experiments 8 Optimizations
  46. 46. How to Implement Annotations? Alternatives for sending annotations • Borealis models streams as queues of fixed-length tuples with header and payload 1 Variable-length tuples: Negates optimizations based on fixed-length 2 Split provenance sets into fix-length tuples 3 New information channels: Complex interactions with normal query processing Slide 1 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Implementation
  47. 47. How to Implement Annotations? Alternatives for sending annotations • Borealis models streams as queues of fixed-length tuples with header and payload 1 Variable-length tuples: Negates optimizations based on fixed-length 2 Split provenance sets into fix-length tuples • Send provenance set after each output tuple • First tuple has small header to tell downstream operators how many provenance tuples will follow • Additional provenance tuples only store payload (TIDs) 3 New information channels: Complex interactions with normal query processing Slide 1 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Implementation
  48. 48. Implementing Instrumented Operators Approach • Factor out common functionality • Serializing/Deserializing provenance to/from queues • Caching provenance of windows • Merging provenance sets • Implemented in Provenance Wrapper • Operators access queues through the wrapper • Reduces code changes to each operator • LOC • Provenance wrapper: ˜8000 LOC • Instrumented operators: aggregation (largest) 200 LOC Slide 2 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Implementation
  49. 49. Temporary Input Storage Connection Points • Borealis feature for storing tuples from a stream • Time-based or count based eviction strategies • Content can be joined with streams S1 Sout S2 n P Instrumented Network Reconstruct Provenance Temporary Input Storage PG PG PP Slide 3 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Implementation
  50. 50. Outline 6 Implementation 7 Additional Experiments 8 Optimizations
  51. 51. Varying Window Size • Send large batch of 100,000 tuples • Measure completion time • No Retrieval • No Provenance/Single/Optimized/Covering Interval • Using the Basic Network • Vary Window Size • Slide fixed to 1 Basic Query Network ασ σ Rewritten Version σ α σ Instrumented Version α PP σ PG PP σ σ Replay-Lazy Version PPCP PG ↵ PP ↵ CPCG ⌦ n Slide 4 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
  52. 52. Varying Window Size 0 10 20 30 40 50 60 70 80 90 100 50 100 200 500 1000 2000 CompletionTime(sec) Window Size No Provenance Single Optimized Covering Interval Slide 4 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
  53. 53. Vary Retrieval Frequency • Send large batch of 100,000 tuples • Measure completion time • Instrumentation with Retrieval • Replay-Lazy with Retrieval • Using a sequence of aggregations • Vary Retrieval Frequency - selectivity of filter before p-join • Slide fixed to 1, Window size fixed to 100 Instrumented Version ↵ PPPG PP n Replay Lazy PPCP PG ↵ PP ↵ CPCG ⌦ n Slide 5 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
  54. 54. Vary Retrieval Frequency 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 0.05 0.10 0.25 0.50 0.75 1 2 5 10 25 50 100 CompletionTime/ CompletionTimewithoutProvenance Retrieval Frequency (%) Instrumentation with Retrieval (Optimized) Replay-Lazy with Retrieval (Optimized) Slide 5 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
  55. 55. Latency • In Ariadne inputs are send in batches with a fixed number of tuples (parameter) • Measure latency • Using the Basic Network • Vary the load • Vary batch size and fix the delay between consecutive batches Slide 6 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
  56. 56. Latency 0 1 2 3 4 5 6 7 10 25 50 75 100 Latency(ms) Batch Size No Provenance Single Optimized Covering Intervals Slide 6 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
  57. 57. Nested Aggregations • Sequence of aggregation operators • Measure Completion time • Vary Number of aggregation operators Method Number of Aggregations 1 2 3 4 No Provenance 3.1 3.9 4.8 5.7 Instr. Generation 3.9 7.4 14.7 48.6 Retrieval 3.0 12.9 103.0 2047.0 Replay-Lazy Cov. Inter. 3.1 4.4 5.2 6.3 Retrieval 5.2 14.7 91.1 2224.0 Rewrite 7.2 625.0 crash crash Slide 7 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
  58. 58. Outline 6 Implementation 7 Additional Experiments 8 Optimizations
  59. 59. Compression Rationale • Reduce amount of data that is shipped between operators • Requirements • Fast compression and decompression • Operations such as merging sets on compressed data Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
  60. 60. Compression Rationale • Reduce amount of data that is shipped between operators • Requirements • Fast compression and decompression • Operations such as merging sets on compressed data Interval Compression • Input: {1, 2, 4, 5, 6, 7, 9} • Output: {[1 − 2], [4 − 6], [9 − 9]} Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
  61. 61. Compression Rationale • Reduce amount of data that is shipped between operators • Requirements • Fast compression and decompression • Operations such as merging sets on compressed data Interval Compression • Input: {1, 2, 4, 5, 6, 7, 9} • Output: {[1 − 2], [4 − 6], [9 − 9]} Dictionary Compression • Input: {1, 2, 4, 5} • Output: LZ77({1, 2, 4, 5}) Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
  62. 62. Compression Delta Compression • Express provenance as delta to down-stream provenance • Once in a while send full provenance • Express provenance as delta to previous full provenance • Input: {1, 2, 4, 7, 9} ← {4, 7, 9, 10} ← {7, 9, 10, 12, 15, 19} • Output: {1, 2, 4, 7, 9} ← 3 − {10} ← 2 − {10, 12, 15, 19} Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
  63. 63. Compression Delta Compression • Express provenance as delta to down-stream provenance • Once in a while send full provenance • Express provenance as delta to previous full provenance • Input: {1, 2, 4, 7, 9} ← {4, 7, 9, 10} ← {7, 9, 10, 12, 15, 19} • Output: {1, 2, 4, 7, 9} ← 3 − {10} ← 2 − {10, 12, 15, 19} Heuristic Adaptive Compression • No one fits all • Combine compression methods • Heuristic rules for when to apply which method Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
  64. 64. Retrieval-Lazy Approach • User query that filters out unneeded provenance ↵ PPPG PP n Slide 9 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
  65. 65. Retrieval-Lazy Approach • User query that filters out unneeded provenance • Avoid expensive provenance reconstruction if provenance is not needed • If possible filter before reconstruction (p-join) • Push filters through p-joins ↵ PPPG PP n Slide 9 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations

×