SlideShare a Scribd company logo
1 of 65
Download to read offline
Ariadne: Managing Fine-Grained Provenance on Data
Streams
Boris Glavic1 Kyumars Sheykh Esmaili2 Peter M. Fischer3
Nesime Tatbul4
Illinois Institute of
Technology1
DBGroup
Nanyang
Technological
University2
SANDS
University of
Freiburg3
Web Science
Intel Labs4
Intel Science and
Technology Center
for Big Data
DEBS 2013 - July 7, 2013 - Arlington, USA
Outline
1 Motivation
2 Reduced Eager Operator Instrumentation
3 Optimizations
4 Experiments
5 Conclusions
Fine-grained Provenance for Data Stream Management
Provenance
ā€¢ Information about the origin and creation process of data
ā€¢ Here: On which inputs does a given output depend on
Fine-grained Provenance
ā€¢ At the granularity of tuples
ā€¢ Fix a tuple t in an output or intermediate stream
ā€¢ On which input tuples does it depend on?
Example
S1 Sout
S2
141 2 3
a b
Slide 1 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
Fine-grained Provenance for Data Stream Management
Provenance
ā€¢ Information about the origin and creation process of data
ā€¢ Here: On which inputs does a given output depend on
Fine-grained Provenance
ā€¢ At the granularity of tuples
ā€¢ Fix a tuple t in an output or intermediate stream
ā€¢ On which input tuples does it depend on?
Example
S1 Sout
S2
141 2 3
a b
Slide 1 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
Why (Fine-Grained) Stream Provenance?
Use cases
ā€¢ Ad hoc inspection
ā€¢ DSMS generates alarm event based on sensor readings
ā€¢ Understand why alarm was raised to act appropriately
ā€¢ Stream query debugging
ā€¢ Trace back and forward erroneous data items
ā€¢ Auditing
ā€¢ Proof of correct operation
ā€¢ No access policies violated
Slide 2 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
Challenges and Opportunities
ā€¢ Online and inļ¬nite data arrival
ā€¢ Traditional methods that reconstruct provenance retroactively not
applicable
ā€¢ Ordered data model
ā€¢ Order can be exploited for compressing provenance
ā€¢ Window-based processing
ā€¢ Aggregation prevalent operation that has large provenance
ā€¢ Low-latency requirements
ā€¢ Provenance overhead should not violate latency requirements
ā€¢ Non-determinism
ā€¢ Sources: Load-shedding, windowing on system-time
ā€¢ Provenance computation can not assume reproducibility of results
Slide 3 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
Related Work
Database Provenance
ā€¢ Large body of work on provenance models
ā€¢ Query rewrite-based approaches
Workļ¬‚ow Provenance
ā€¢ Many systems support provenance tracking
ā€¢ Mostly coarse-grained provenance
Data Stream Provenance
ā€¢ Coarse-grained provenance
ā€¢ Not detailed enough for most use-cases
ā€¢ Fine-grained provenance
ā€¢ Stream debugging: One-step (one operator at a time)
ā€¢ Approximation techniques (strong assumptions)
Slide 4 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
How to generate provenance?
Inversion
ā€¢ Infer provenance from query outputs
ā€¢ Usually tracing back one operator at a time
ā€¢ No storage overhead
ā€¢ Only works for trivial operators
Slide 5 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
How to generate provenance?
Annotation Propagation
ā€¢ Annotate data with provenance
ā€¢ Propagate and manipulate annotations during query processing
Propagation - Query rewrite
ā€¢ Rewrite query network to propagate annotations
ā€¢ Use original DSMS operators
ā€¢ Easy to implement, applicable to all DSMSāˆ—
ā€¢ Often results in complex and ineļ¬ƒcient networks
ā€¢ Only applicable to deterministic networks
Slide 5 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
How to generate provenance?
Propagation - Operator instrumentation
ā€¢ Modify DSMS operators to propagate provenance information
ā€¢ More eļ¬ƒcient than query rewrite
ā€¢ Tolerates certain types of non-determinism
ā€¢ Keeps structure of the original query network intact
ā€¢ Requires modiļ¬cation of the engine (DSMS source code needed)
Slide 5 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
Plan of attack
1) Provenance model
ā€¢ Annotated tuples
ā€¢ Annotated streams
2) Provenance generation and querying
ā€¢ Operator instrumentation
ā€¢ Buļ¬€er inputs + reconstruction for querying
ā€¢ Implementation in Ariadne (extension of Borealis)
3) Optimizations
ā€¢ Compression
ā€¢ Laziness
ā€¢ Decoupling provenance computation from query processing
Slide 6 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
Outline
1 Motivation
2 Reduced Eager Operator Instrumentation
3 Optimizations
4 Experiments
5 Conclusions
Overview
Operator Instrumentation
ā€¢ For each operator of the system create new versions
ā€¢ Provenance generator (PG): Generates provenance annotations
ā€¢ Provenance propagator (PP): Propagates provenance annotations
ā€¢ Instrument query (or parts thereof) to generate provenance
ā€¢ Replace operators with annotating versions
ā€¢ Only in the part of the network we want to track provenance
Reduced Eager
ā€¢ Eager
ā€¢ Propagate provenance annotations during query processing
ā€¢ Reduced Eager
ā€¢ Propagate sets of tuple-IDs (TIDs) annotations
ā€¢ Temporarily store inputs of relevant streams for restoring full tuples in
provenance
Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Overview
Querying Provenance
ā€¢ Translate provenance annotations into regular streaming data
ā€¢ For one output tuple t annotated with provenance {t1, . . . , tn}
ā€¢ output n tuples (t, t1), (t, t2), . . .
ā€¢ New operator (p-join) joins buļ¬€ered input data with provenance
annotations
ā€¢ Use operators of the DSMS for querying
Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Overview
S1 Sout
S2
Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Overview
S1 Sout
S2
Instrumented
Network
PG PP
PG
Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Overview
S1 Sout
S2 n P
Instrumented
Network
Reconstruct
Provenance
Temporary
Input Storage
PG
PG
PP
Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Stream Provenance Model
Provenance Model
ā€¢ Provenance is modelled as a set of contributing tuples
ā€¢ From input of intermediate streams
ā€¢ Provenance set P(t, I)
ā€¢ tuple t in intermediate or result stream
ā€¢ set of upstream streams I
ā€¢ Which tuples belong to provenance?
ā€¢ Declarative deļ¬nition that models provenance for single operators
ā€¢ Transitivity
Slide 8 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Stream Provenance Model
Example
ā€¢ Query over stream of temperature readings
ā€¢ Filter out outliers (temperature above threshold)
ā€¢ Compute average temperature over every consecutive window of two
readings
ā€¢ P(3:2 , {2}) = {2:2 , 2:3 }
ā€¢ 3:2 generated by aggregating values from 2:2 and 2:3
ā€¢ P(3:2 , {1}) = {1:2 , 1:4 }
ā€¢ 2:2 and 2:3 derived from 1:2 and 1:4
ā†µ
1 3
2
Slide 8 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Stream Provenance Model
Provenance Annotated Streams
ā€¢ Each tuple in a provenance annotated stream (PAS) is annotated
with its provenance set
ā€¢ according to a set of input streams I
ā€¢ PAS P(O, I)
ā€¢ Provenance annotated stream for stream O according to set of
upstream streams I
Slide 8 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Stream Provenance Model
Example
ā€¢ Query over stream of temperature readings
ā€¢ Filter out outliers (temperature above threshold)
ā€¢ Compute average temperature over every consecutive window of two
readings
ā€¢ Provenance annotated stream (PAS) P(3, {1})
ā†µ
1 32
Slide 8 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Implementing Instrumented Operators
Approach
ā€¢ Three instrumented versions for each operator
ā€¢ Provenance generators: Initialize provenance annotations
ā€¢ Provenance propagators: Propagate provenance annotations
ā€¢ Provenance dropper: Drop provenance from input
Slide 9 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Implementing Instrumented Operators
Provenance Generator
ā€¢ Computes one-step provenance for an operatorā€™s output based on its
input
ā€¢ Attach input TIDs as provenance
ā€¢ For windowing operators merge TIDs from window
ā€¢ Before instrumentation: I ā†’ op ā†’ O
ā€¢ After instrumentation: I ā†’ opPG ā†’ P(O, I)
Slide 9 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Implementing Instrumented Operators
Provenance Propagator
ā€¢ Compute output provenance set by combining provenance sets from
the input
ā€¢ Before instrumentation: P(I, I ) ā†’ op ā†’ O
ā€¢ After instrumentation: P(I, I ) ā†’ opPP ā†’ P(O, I )
Provenance Dropper
ā€¢ Removes provenance annotations
ā€¢ Instrumentation: P(I, I ) ā†’ opPD ā†’ O
Slide 9 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Instrumenting a Network
ā€¢ User provides
ā€¢ query network q
ā€¢ stream O
ā€¢ set of streams I upstream from O
ā€¢ Output is a network that computes P(O, I)
InstrumentNetwork(q,O,I)
for all op on paths between I and O
if op is connected to I
replace op with PG version
else
replace op with PP version
Slide 10 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Instrumenting a Network
Example
ā€¢ Network that computes P(3, {1})
ā†µ
1 32
TID time tempature
1:1 0 102
1:2 5 105
1:3 10 399
1:4 15 85
TID avg temp
3:1 103.5 {1:1, 1:2}
3:2 95 {1:2, 1:4}
PPPG
Example
ā€¢ Network that computes P(2, {1})
TID avg temp
3:1 103.5
3:2 95
ā†µ
1 32
TID time temperature
1:1 0 102
1:2 5 105
1:3 10 399
1:4 15 85
TID time temperature
2:1 0 102 {1:1}
2:2 5 105 {1:2}
2:3 15 85 {1:4}
PG PD
Slide 10 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Ariadne Implementation
Based on Borealis
ā€¢ Set of standard DSMS operators
ā€¢ Operators connected through queues
ā€¢ Fixed-length tuples
Changes to Borealis code
ā€¢ Implement annotations
ā€¢ Implement operator instrumentation
ā€¢ Implement p-join and input buļ¬€ering
Slide 11 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Querying Provenance
Approach
1 Translate provenance annotations into regular stream data to query
using the host language
Slide 12 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Querying Provenance
Approach
1 Translate provenance annotations into regular stream data to query
using the host language
P-join
ā€¢ Joins a PAS P(O, I) with buļ¬€ered input tuples from I
ā€¢ Combine output t with each tuple in its provenance P(t, I)
Slide 12 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Querying Provenance
Example
ā€¢ Tuple 3:2 with P(3:2, {1}) = {1:2, 1:4}
ā†µ
1 32
TID time temperature
1:1 0 102
1:2 5 105
1:3 10 399
1:4 15 85
TID avg temp
3:1 103.5 {1:1, 1:2}
3:2 95 {1:2, 1:4}
TID time temperature
2:1 0 102 {1:1}
2:2 5 105 {1:2}
2:3 15 85 {1:4}
PPPG
C1
n
Slide 12 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
Outline
1 Motivation
2 Reduced Eager Operator Instrumentation
3 Optimizations
4 Experiments
5 Conclusions
Rationale
Compression
ā€¢ Reduce load on queues by using concise provenance representation
ā€¢ Reduce memory imprint
Lazy Retrieval
ā€¢ Avoid reconstructing unneeded provenance
Replay-Lazy
ā€¢ Avoid generating unneeded provenance
ā€¢ Decouple provenance computation from regular stream processing
Slide 13 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations
Replay-Lazy
Idea
ā€¢ Avoid provenance computation if provenance is not needed
ā€¢ Propagate concise representation of a super-set of the provenance
ā€¢ If provenance is requested replay super-set through provenance
generating network
Slide 14 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations
Replay-Lazy
Idea
ā€¢ Avoid provenance computation if provenance is not needed
ā€¢ Propagate concise representation of a super-set of the provenance
ā€¢ If provenance is requested replay super-set through provenance
generating network
Covering Intervals
ā€¢ Interval:
ā€¢ Minimal TID in provenance
ā€¢ Maximal TID in provenance
ā€¢ Constant size super-set of provenance (piggy-back!)
ā€¢ Provenance: {1, 2, 4, 5, 10, 16, 65}
ā€¢ Covering interval: [1, 65]
Slide 14 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations
Replay-Lazy
Enabling Replay
ā€¢ Covering interval operators:
ā€¢ CG corresponds to PG
ā€¢ CP corresponds to PP
ā€¢ C-join operator āŠ—
ā€¢ For each input get covering interval
ā€¢ Retrieve all tuples from covering interval from connection point
Slide 14 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations
Replay-Lazy
Approach
1 Instrument original network to produce covering interval
2 Filter out results and c-join
3 Feed result into provenance generating copy of the network
PG
ā†µ
PP
ā†µ
CPCG
S1 Sout
Provenance
Generating
Network
Filter Provenance
and Fetch Tuples
from Input
Covering
Interval
Generating
Network
āŒ¦
ā‡”
PD
P
Slide 14 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations
Outline
1 Motivation
2 Reduced Eager Operator Instrumentation
3 Optimizations
4 Experiments
5 Conclusions
Completion Time - Basic Network
ā€¢ Send large batch of 100,000 tuples
ā€¢ Measure completion time
ā€¢ Without Retrieval/With Retrieval
ā€¢ No Provenance/Instrumentation/Rewrite
ā€¢ Using the Basic Network
ā€¢ Vary Aggregation Window Size
ā€¢ Slide ļ¬xed to 1
ā€¢ Increase window size ā†’ increase amount of provenance
Basic Query Network
Ī±Ļƒ Ļƒ
Rewritten Version
Ļƒ
Ī±
Ļƒ
Instrumented Version
Ī±
PP
Ļƒ
PG PP
Ļƒ Ļƒ
Replay-Lazy Version
PPCP PG
ā†µ
PP
ā†µ
CPCG
āŒ¦ n
Slide 15 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Experiments
Completion Time - Basic Network
0
5
10
15
20
25
30
35
40
45
NoProvenance
Instrumentation
Replay-Lazy
Rewrite
NoProvenance
Instrumentation
Replay-Lazy
Rewrite
NoProvenance
Instrumentation
Replay-Lazy
Rewrite
NoProvenance
Instrumentation
Replay-Lazy
Rewrite
NoProvenance
Instrumentation
Replay-Lazy
Rewrite
NoProvenance
Instrumentation
Replay-Lazy
Rewrite
CompletionTime(sec)
Window Size
No Provenance
Instrumentation (Generation)
Instrumentation (Retrieval)
Replay-Lazy (Covering Interval)
Replay-Lazy (Retrieval)
Rewrite
1008060402010
Slide 15 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Experiments
Outline
1 Motivation
2 Reduced Eager Operator Instrumentation
3 Optimizations
4 Experiments
5 Conclusions
Conclusions
Reduced-Eager Operator Instrumentation
ā€¢ Novel propagation provenance method for DSMS
ā€¢ Replace operators in a query network with provenance-aware
(instrumented) versions
ā€¢ Keeps original network structure intact
ā€¢ Deals well with non-determinism
ā€¢ Flexible:
ā€¢ Single- and multi-hop provenance
ā€¢ Instrument parts of a network
Slide 16 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Conclusions
Conclusions
Optimizations
ā€¢ Replay-Lazy
ā€¢ Propagate only covering intervals - replay through provenance
generating copy of network
ā€¢ Reduce cost for low retrieval rates
ā€¢ Option to decouple provenance computation
ā€¢ Lazy-Retrieval
ā€¢ Push ļ¬lters through p-joins
ā€¢ Avoids provenance reconstruction
ā€¢ Compression
ā€¢ Interval
ā€¢ Delta
ā€¢ Dictionary
Slide 16 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Conclusions
Future Work
Optimizing Temporary Input Storage
ā€¢ Exploiting additional knowledge in purging temporary input storage
ā€¢ Static query network analysis
ā€¢ Self-optimizing purging strategies
ā€¢ Compression and Approximations
Integration with Distributed Storage and Scaling-out
ā€¢ Using distributed write-optimized storage for provenance?
Order-aware Provenance Model
ā€¢ Integrate order into the provenance model
ā€¢ ā€œTuple t is after u in the result, because tā€™s provenance is before uā€™s
provenance in the input.ā€
Slide 17 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Conclusions
Questions?
Boris Glavic
IIT
DBGroup
bglavic@iit.edu
Kyumars Sheykh Esmaili
Nanyang University
SANDS
kyumarss@ntu.edu.sg
Peter M. Fischer
University of Freiburg
Web Science
peter.ļ¬scher@
cs.uni-freiburg.de
Nesime Tatbul
Intel Labs
Intel Science and
Technology Center
for Big Data
tatbul@csail.mit.edu
http://cs.iit.edu/~dbgroup/research/ariadne.php
Slide 18 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Conclusions
Outline
6 Implementation
7 Additional Experiments
8 Optimizations
How to Implement Annotations?
Alternatives for sending annotations
ā€¢ Borealis models streams as queues of ļ¬xed-length tuples with header
and payload
1 Variable-length tuples: Negates optimizations based on ļ¬xed-length
2 Split provenance sets into ļ¬x-length tuples
3 New information channels: Complex interactions with normal query
processing
Slide 1 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Implementation
How to Implement Annotations?
Alternatives for sending annotations
ā€¢ Borealis models streams as queues of ļ¬xed-length tuples with header
and payload
1 Variable-length tuples: Negates optimizations based on ļ¬xed-length
2 Split provenance sets into ļ¬x-length tuples
ā€¢ Send provenance set after each output tuple
ā€¢ First tuple has small header to tell downstream operators how many
provenance tuples will follow
ā€¢ Additional provenance tuples only store payload (TIDs)
3 New information channels: Complex interactions with normal query
processing
Slide 1 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Implementation
Implementing Instrumented Operators
Approach
ā€¢ Factor out common functionality
ā€¢ Serializing/Deserializing provenance to/from queues
ā€¢ Caching provenance of windows
ā€¢ Merging provenance sets
ā€¢ Implemented in Provenance Wrapper
ā€¢ Operators access queues through the wrapper
ā€¢ Reduces code changes to each operator
ā€¢ LOC
ā€¢ Provenance wrapper: Ėœ8000 LOC
ā€¢ Instrumented operators: aggregation (largest) 200 LOC
Slide 2 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Implementation
Temporary Input Storage
Connection Points
ā€¢ Borealis feature for storing tuples from a stream
ā€¢ Time-based or count based eviction strategies
ā€¢ Content can be joined with streams
S1 Sout
S2 n P
Instrumented
Network
Reconstruct
Provenance
Temporary
Input Storage
PG
PG
PP
Slide 3 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Implementation
Outline
6 Implementation
7 Additional Experiments
8 Optimizations
Varying Window Size
ā€¢ Send large batch of 100,000 tuples
ā€¢ Measure completion time
ā€¢ No Retrieval
ā€¢ No Provenance/Single/Optimized/Covering Interval
ā€¢ Using the Basic Network
ā€¢ Vary Window Size
ā€¢ Slide ļ¬xed to 1
Basic Query Network
Ī±Ļƒ Ļƒ
Rewritten Version
Ļƒ
Ī±
Ļƒ
Instrumented Version
Ī±
PP
Ļƒ
PG PP
Ļƒ Ļƒ
Replay-Lazy Version
PPCP PG
ā†µ
PP
ā†µ
CPCG
āŒ¦ n
Slide 4 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
Varying Window Size
0
10
20
30
40
50
60
70
80
90
100
50 100 200 500 1000 2000
CompletionTime(sec)
Window Size
No Provenance
Single
Optimized
Covering Interval
Slide 4 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
Vary Retrieval Frequency
ā€¢ Send large batch of 100,000 tuples
ā€¢ Measure completion time
ā€¢ Instrumentation with Retrieval
ā€¢ Replay-Lazy with Retrieval
ā€¢ Using a sequence of aggregations
ā€¢ Vary Retrieval Frequency - selectivity of ļ¬lter before p-join
ā€¢ Slide ļ¬xed to 1, Window size ļ¬xed to 100
Instrumented Version
ā†µ
PPPG PP
n
Replay Lazy
PPCP PG
ā†µ
PP
ā†µ
CPCG
āŒ¦ n
Slide 5 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
Vary Retrieval Frequency
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
0.05 0.10 0.25 0.50 0.75 1 2 5 10 25 50 100
CompletionTime/
CompletionTimewithoutProvenance
Retrieval Frequency (%)
Instrumentation with Retrieval (Optimized)
Replay-Lazy with Retrieval (Optimized)
Slide 5 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
Latency
ā€¢ In Ariadne inputs are send in batches with a ļ¬xed number of tuples
(parameter)
ā€¢ Measure latency
ā€¢ Using the Basic Network
ā€¢ Vary the load
ā€¢ Vary batch size and ļ¬x the delay between consecutive batches
Slide 6 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
Latency
0
1
2
3
4
5
6
7
10 25 50 75 100
Latency(ms)
Batch Size
No Provenance
Single
Optimized
Covering Intervals
Slide 6 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
Nested Aggregations
ā€¢ Sequence of aggregation operators
ā€¢ Measure Completion time
ā€¢ Vary Number of aggregation operators
Method Number of Aggregations
1 2 3 4
No Provenance 3.1 3.9 4.8 5.7
Instr.
Generation 3.9 7.4 14.7 48.6
Retrieval 3.0 12.9 103.0 2047.0
Replay-Lazy
Cov. Inter. 3.1 4.4 5.2 6.3
Retrieval 5.2 14.7 91.1 2224.0
Rewrite 7.2 625.0 crash crash
Slide 7 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
Outline
6 Implementation
7 Additional Experiments
8 Optimizations
Compression
Rationale
ā€¢ Reduce amount of data that is shipped between operators
ā€¢ Requirements
ā€¢ Fast compression and decompression
ā€¢ Operations such as merging sets on compressed data
Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
Compression
Rationale
ā€¢ Reduce amount of data that is shipped between operators
ā€¢ Requirements
ā€¢ Fast compression and decompression
ā€¢ Operations such as merging sets on compressed data
Interval Compression
ā€¢ Input: {1, 2, 4, 5, 6, 7, 9}
ā€¢ Output: {[1 āˆ’ 2], [4 āˆ’ 6], [9 āˆ’ 9]}
Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
Compression
Rationale
ā€¢ Reduce amount of data that is shipped between operators
ā€¢ Requirements
ā€¢ Fast compression and decompression
ā€¢ Operations such as merging sets on compressed data
Interval Compression
ā€¢ Input: {1, 2, 4, 5, 6, 7, 9}
ā€¢ Output: {[1 āˆ’ 2], [4 āˆ’ 6], [9 āˆ’ 9]}
Dictionary Compression
ā€¢ Input: {1, 2, 4, 5}
ā€¢ Output: LZ77({1, 2, 4, 5})
Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
Compression
Delta Compression
ā€¢ Express provenance as delta to down-stream provenance
ā€¢ Once in a while send full provenance
ā€¢ Express provenance as delta to previous full provenance
ā€¢ Input: {1, 2, 4, 7, 9} ā† {4, 7, 9, 10} ā† {7, 9, 10, 12, 15, 19}
ā€¢ Output: {1, 2, 4, 7, 9} ā† 3 āˆ’ {10} ā† 2 āˆ’ {10, 12, 15, 19}
Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
Compression
Delta Compression
ā€¢ Express provenance as delta to down-stream provenance
ā€¢ Once in a while send full provenance
ā€¢ Express provenance as delta to previous full provenance
ā€¢ Input: {1, 2, 4, 7, 9} ā† {4, 7, 9, 10} ā† {7, 9, 10, 12, 15, 19}
ā€¢ Output: {1, 2, 4, 7, 9} ā† 3 āˆ’ {10} ā† 2 āˆ’ {10, 12, 15, 19}
Heuristic Adaptive Compression
ā€¢ No one ļ¬ts all
ā€¢ Combine compression methods
ā€¢ Heuristic rules for when to apply which method
Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
Retrieval-Lazy
Approach
ā€¢ User query that ļ¬lters out unneeded provenance
ā†µ
PPPG PP
n
Slide 9 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
Retrieval-Lazy
Approach
ā€¢ User query that ļ¬lters out unneeded provenance
ā€¢ Avoid expensive provenance reconstruction if provenance is not
needed
ā€¢ If possible ļ¬lter before reconstruction (p-join)
ā€¢ Push ļ¬lters through p-joins
ā†µ
PPPG PP
n
Slide 9 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations

More Related Content

Similar to DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Solving 800-90 Entropy Requirements in Software
Solving 800-90 Entropy Requirements in SoftwareSolving 800-90 Entropy Requirements in Software
Solving 800-90 Entropy Requirements in SoftwareRay Potter
Ā 
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
Tutorial: The Role of Event-Time Analysis Order in Data StreamingTutorial: The Role of Event-Time Analysis Order in Data Streaming
Tutorial: The Role of Event-Time Analysis Order in Data StreamingVincenzo Gulisano
Ā 
Application of the Actor Model to Large Scale NDE Data Analysis
Application of the Actor Model to Large Scale NDE Data AnalysisApplication of the Actor Model to Large Scale NDE Data Analysis
Application of the Actor Model to Large Scale NDE Data AnalysisChrisCoughlin9
Ā 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging EnvironmentsPaul Groth
Ā 
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsPapers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsMichael Kehoe
Ā 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with PipeliningAneesh Raveendran
Ā 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsThomas Weise
Ā 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Comsysto Reply GmbH
Ā 
An Introduction to Distributed Data Streaming
An Introduction to Distributed Data StreamingAn Introduction to Distributed Data Streaming
An Introduction to Distributed Data StreamingParis Carbone
Ā 
REAL-TIME PIPELINE BATCH INTERFACE DETECTION & TRANSMIX REDUCTION
REAL-TIME PIPELINE BATCH INTERFACE DETECTION & TRANSMIX REDUCTIONREAL-TIME PIPELINE BATCH INTERFACE DETECTION & TRANSMIX REDUCTION
REAL-TIME PIPELINE BATCH INTERFACE DETECTION & TRANSMIX REDUCTIONiQHub
Ā 
Tech Talks_25.04.15_Session 3_Tibor Sulyan_Distributed coordination with zook...
Tech Talks_25.04.15_Session 3_Tibor Sulyan_Distributed coordination with zook...Tech Talks_25.04.15_Session 3_Tibor Sulyan_Distributed coordination with zook...
Tech Talks_25.04.15_Session 3_Tibor Sulyan_Distributed coordination with zook...EPAM_Systems_Bulgaria
Ā 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversZubair Nabi
Ā 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillCharles Givre
Ā 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Ā 
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxData
Ā 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"Pinar Alper
Ā 
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Mingliang Liu
Ā 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxData
Ā 

Similar to DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams" (20)

Solving 800-90 Entropy Requirements in Software
Solving 800-90 Entropy Requirements in SoftwareSolving 800-90 Entropy Requirements in Software
Solving 800-90 Entropy Requirements in Software
Ā 
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
Tutorial: The Role of Event-Time Analysis Order in Data StreamingTutorial: The Role of Event-Time Analysis Order in Data Streaming
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
Ā 
Application of the Actor Model to Large Scale NDE Data Analysis
Application of the Actor Model to Large Scale NDE Data AnalysisApplication of the Actor Model to Large Scale NDE Data Analysis
Application of the Actor Model to Large Scale NDE Data Analysis
Ā 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
Ā 
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsPapers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Ā 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with Pipelining
Ā 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
Ā 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
Ā 
An Introduction to Distributed Data Streaming
An Introduction to Distributed Data StreamingAn Introduction to Distributed Data Streaming
An Introduction to Distributed Data Streaming
Ā 
REAL-TIME PIPELINE BATCH INTERFACE DETECTION & TRANSMIX REDUCTION
REAL-TIME PIPELINE BATCH INTERFACE DETECTION & TRANSMIX REDUCTIONREAL-TIME PIPELINE BATCH INTERFACE DETECTION & TRANSMIX REDUCTION
REAL-TIME PIPELINE BATCH INTERFACE DETECTION & TRANSMIX REDUCTION
Ā 
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Ā 
Tech Talks_25.04.15_Session 3_Tibor Sulyan_Distributed coordination with zook...
Tech Talks_25.04.15_Session 3_Tibor Sulyan_Distributed coordination with zook...Tech Talks_25.04.15_Session 3_Tibor Sulyan_Distributed coordination with zook...
Tech Talks_25.04.15_Session 3_Tibor Sulyan_Distributed coordination with zook...
Ā 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device Drivers
Ā 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache Drill
Ā 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Ā 
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
Ā 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
Ā 
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Ā 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
Ā 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
Ā 

More from Boris Glavic

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...Boris Glavic
Ā 
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...Boris Glavic
Ā 
2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata GeneratorBoris Glavic
Ā 
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...Boris Glavic
Ā 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...Boris Glavic
Ā 
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
Ā 
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSONBoris Glavic
Ā 
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersTaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
Ā 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationBoris Glavic
Ā 
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceTaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceBoris Glavic
Ā 
EDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesEDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesBoris Glavic
Ā 
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...Boris Glavic
Ā 
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...Boris Glavic
Ā 
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"Boris Glavic
Ā 
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"Boris Glavic
Ā 
TaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningTaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningBoris Glavic
Ā 
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...Boris Glavic
Ā 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanBoris Glavic
Ā 

More from Boris Glavic (18)

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
Ā 
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
Ā 
2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator
Ā 
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
Ā 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
Ā 
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
Ā 
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
Ā 
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersTaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
Ā 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database Virtualization
Ā 
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceTaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
Ā 
EDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesEDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested Subqueries
Ā 
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
Ā 
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
Ā 
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
Ā 
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
Ā 
TaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningTaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data Mining
Ā 
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
Ā 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, Ian
Ā 

Recently uploaded

High Profile šŸ” 8250077686 šŸ“ž Call Girls Service in GTB NagaršŸ‘
High Profile šŸ” 8250077686 šŸ“ž Call Girls Service in GTB NagaršŸ‘High Profile šŸ” 8250077686 šŸ“ž Call Girls Service in GTB NagaršŸ‘
High Profile šŸ” 8250077686 šŸ“ž Call Girls Service in GTB NagaršŸ‘Damini Dixit
Ā 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSĆ©rgio Sacani
Ā 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
Ā 
ā¤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number šŸ’¦āœ….
ā¤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number šŸ’¦āœ….ā¤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number šŸ’¦āœ….
ā¤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number šŸ’¦āœ….Nitya salvi
Ā 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
Ā 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
Ā 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
Ā 
Hire šŸ’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire šŸ’• 9907093804 Hooghly Call Girls Service Call Girls AgencyHire šŸ’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire šŸ’• 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
Ā 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
Ā 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
Ā 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
Ā 
Nanoparticles synthesis and characterizationā€‹ ā€‹
Nanoparticles synthesis and characterizationā€‹  ā€‹Nanoparticles synthesis and characterizationā€‹  ā€‹
Nanoparticles synthesis and characterizationā€‹ ā€‹kaibalyasahoo82800
Ā 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
Ā 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
Ā 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .Poonam Aher Patil
Ā 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
Ā 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
Ā 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
Ā 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSĆ©rgio Sacani
Ā 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
Ā 

Recently uploaded (20)

High Profile šŸ” 8250077686 šŸ“ž Call Girls Service in GTB NagaršŸ‘
High Profile šŸ” 8250077686 šŸ“ž Call Girls Service in GTB NagaršŸ‘High Profile šŸ” 8250077686 šŸ“ž Call Girls Service in GTB NagaršŸ‘
High Profile šŸ” 8250077686 šŸ“ž Call Girls Service in GTB NagaršŸ‘
Ā 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Ā 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
Ā 
ā¤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number šŸ’¦āœ….
ā¤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number šŸ’¦āœ….ā¤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number šŸ’¦āœ….
ā¤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number šŸ’¦āœ….
Ā 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
Ā 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Ā 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
Ā 
Hire šŸ’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire šŸ’• 9907093804 Hooghly Call Girls Service Call Girls AgencyHire šŸ’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire šŸ’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Ā 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
Ā 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
Ā 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
Ā 
Nanoparticles synthesis and characterizationā€‹ ā€‹
Nanoparticles synthesis and characterizationā€‹  ā€‹Nanoparticles synthesis and characterizationā€‹  ā€‹
Nanoparticles synthesis and characterizationā€‹ ā€‹
Ā 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
Ā 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
Ā 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
Ā 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
Ā 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
Ā 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
Ā 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Ā 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
Ā 

DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

  • 1. Ariadne: Managing Fine-Grained Provenance on Data Streams Boris Glavic1 Kyumars Sheykh Esmaili2 Peter M. Fischer3 Nesime Tatbul4 Illinois Institute of Technology1 DBGroup Nanyang Technological University2 SANDS University of Freiburg3 Web Science Intel Labs4 Intel Science and Technology Center for Big Data DEBS 2013 - July 7, 2013 - Arlington, USA
  • 2. Outline 1 Motivation 2 Reduced Eager Operator Instrumentation 3 Optimizations 4 Experiments 5 Conclusions
  • 3. Fine-grained Provenance for Data Stream Management Provenance ā€¢ Information about the origin and creation process of data ā€¢ Here: On which inputs does a given output depend on Fine-grained Provenance ā€¢ At the granularity of tuples ā€¢ Fix a tuple t in an output or intermediate stream ā€¢ On which input tuples does it depend on? Example S1 Sout S2 141 2 3 a b Slide 1 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  • 4. Fine-grained Provenance for Data Stream Management Provenance ā€¢ Information about the origin and creation process of data ā€¢ Here: On which inputs does a given output depend on Fine-grained Provenance ā€¢ At the granularity of tuples ā€¢ Fix a tuple t in an output or intermediate stream ā€¢ On which input tuples does it depend on? Example S1 Sout S2 141 2 3 a b Slide 1 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  • 5. Why (Fine-Grained) Stream Provenance? Use cases ā€¢ Ad hoc inspection ā€¢ DSMS generates alarm event based on sensor readings ā€¢ Understand why alarm was raised to act appropriately ā€¢ Stream query debugging ā€¢ Trace back and forward erroneous data items ā€¢ Auditing ā€¢ Proof of correct operation ā€¢ No access policies violated Slide 2 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  • 6. Challenges and Opportunities ā€¢ Online and inļ¬nite data arrival ā€¢ Traditional methods that reconstruct provenance retroactively not applicable ā€¢ Ordered data model ā€¢ Order can be exploited for compressing provenance ā€¢ Window-based processing ā€¢ Aggregation prevalent operation that has large provenance ā€¢ Low-latency requirements ā€¢ Provenance overhead should not violate latency requirements ā€¢ Non-determinism ā€¢ Sources: Load-shedding, windowing on system-time ā€¢ Provenance computation can not assume reproducibility of results Slide 3 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  • 7. Related Work Database Provenance ā€¢ Large body of work on provenance models ā€¢ Query rewrite-based approaches Workļ¬‚ow Provenance ā€¢ Many systems support provenance tracking ā€¢ Mostly coarse-grained provenance Data Stream Provenance ā€¢ Coarse-grained provenance ā€¢ Not detailed enough for most use-cases ā€¢ Fine-grained provenance ā€¢ Stream debugging: One-step (one operator at a time) ā€¢ Approximation techniques (strong assumptions) Slide 4 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  • 8. How to generate provenance? Inversion ā€¢ Infer provenance from query outputs ā€¢ Usually tracing back one operator at a time ā€¢ No storage overhead ā€¢ Only works for trivial operators Slide 5 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  • 9. How to generate provenance? Annotation Propagation ā€¢ Annotate data with provenance ā€¢ Propagate and manipulate annotations during query processing Propagation - Query rewrite ā€¢ Rewrite query network to propagate annotations ā€¢ Use original DSMS operators ā€¢ Easy to implement, applicable to all DSMSāˆ— ā€¢ Often results in complex and ineļ¬ƒcient networks ā€¢ Only applicable to deterministic networks Slide 5 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  • 10. How to generate provenance? Propagation - Operator instrumentation ā€¢ Modify DSMS operators to propagate provenance information ā€¢ More eļ¬ƒcient than query rewrite ā€¢ Tolerates certain types of non-determinism ā€¢ Keeps structure of the original query network intact ā€¢ Requires modiļ¬cation of the engine (DSMS source code needed) Slide 5 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  • 11. Plan of attack 1) Provenance model ā€¢ Annotated tuples ā€¢ Annotated streams 2) Provenance generation and querying ā€¢ Operator instrumentation ā€¢ Buļ¬€er inputs + reconstruction for querying ā€¢ Implementation in Ariadne (extension of Borealis) 3) Optimizations ā€¢ Compression ā€¢ Laziness ā€¢ Decoupling provenance computation from query processing Slide 6 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation
  • 12. Outline 1 Motivation 2 Reduced Eager Operator Instrumentation 3 Optimizations 4 Experiments 5 Conclusions
  • 13. Overview Operator Instrumentation ā€¢ For each operator of the system create new versions ā€¢ Provenance generator (PG): Generates provenance annotations ā€¢ Provenance propagator (PP): Propagates provenance annotations ā€¢ Instrument query (or parts thereof) to generate provenance ā€¢ Replace operators with annotating versions ā€¢ Only in the part of the network we want to track provenance Reduced Eager ā€¢ Eager ā€¢ Propagate provenance annotations during query processing ā€¢ Reduced Eager ā€¢ Propagate sets of tuple-IDs (TIDs) annotations ā€¢ Temporarily store inputs of relevant streams for restoring full tuples in provenance Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 14. Overview Querying Provenance ā€¢ Translate provenance annotations into regular streaming data ā€¢ For one output tuple t annotated with provenance {t1, . . . , tn} ā€¢ output n tuples (t, t1), (t, t2), . . . ā€¢ New operator (p-join) joins buļ¬€ered input data with provenance annotations ā€¢ Use operators of the DSMS for querying Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 15. Overview S1 Sout S2 Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 16. Overview S1 Sout S2 Instrumented Network PG PP PG Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 17. Overview S1 Sout S2 n P Instrumented Network Reconstruct Provenance Temporary Input Storage PG PG PP Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 18. Stream Provenance Model Provenance Model ā€¢ Provenance is modelled as a set of contributing tuples ā€¢ From input of intermediate streams ā€¢ Provenance set P(t, I) ā€¢ tuple t in intermediate or result stream ā€¢ set of upstream streams I ā€¢ Which tuples belong to provenance? ā€¢ Declarative deļ¬nition that models provenance for single operators ā€¢ Transitivity Slide 8 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 19. Stream Provenance Model Example ā€¢ Query over stream of temperature readings ā€¢ Filter out outliers (temperature above threshold) ā€¢ Compute average temperature over every consecutive window of two readings ā€¢ P(3:2 , {2}) = {2:2 , 2:3 } ā€¢ 3:2 generated by aggregating values from 2:2 and 2:3 ā€¢ P(3:2 , {1}) = {1:2 , 1:4 } ā€¢ 2:2 and 2:3 derived from 1:2 and 1:4 ā†µ 1 3 2 Slide 8 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 20. Stream Provenance Model Provenance Annotated Streams ā€¢ Each tuple in a provenance annotated stream (PAS) is annotated with its provenance set ā€¢ according to a set of input streams I ā€¢ PAS P(O, I) ā€¢ Provenance annotated stream for stream O according to set of upstream streams I Slide 8 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 21. Stream Provenance Model Example ā€¢ Query over stream of temperature readings ā€¢ Filter out outliers (temperature above threshold) ā€¢ Compute average temperature over every consecutive window of two readings ā€¢ Provenance annotated stream (PAS) P(3, {1}) ā†µ 1 32 Slide 8 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 22. Implementing Instrumented Operators Approach ā€¢ Three instrumented versions for each operator ā€¢ Provenance generators: Initialize provenance annotations ā€¢ Provenance propagators: Propagate provenance annotations ā€¢ Provenance dropper: Drop provenance from input Slide 9 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 23. Implementing Instrumented Operators Provenance Generator ā€¢ Computes one-step provenance for an operatorā€™s output based on its input ā€¢ Attach input TIDs as provenance ā€¢ For windowing operators merge TIDs from window ā€¢ Before instrumentation: I ā†’ op ā†’ O ā€¢ After instrumentation: I ā†’ opPG ā†’ P(O, I) Slide 9 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 24. Implementing Instrumented Operators Provenance Propagator ā€¢ Compute output provenance set by combining provenance sets from the input ā€¢ Before instrumentation: P(I, I ) ā†’ op ā†’ O ā€¢ After instrumentation: P(I, I ) ā†’ opPP ā†’ P(O, I ) Provenance Dropper ā€¢ Removes provenance annotations ā€¢ Instrumentation: P(I, I ) ā†’ opPD ā†’ O Slide 9 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 25. Instrumenting a Network ā€¢ User provides ā€¢ query network q ā€¢ stream O ā€¢ set of streams I upstream from O ā€¢ Output is a network that computes P(O, I) InstrumentNetwork(q,O,I) for all op on paths between I and O if op is connected to I replace op with PG version else replace op with PP version Slide 10 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 26. Instrumenting a Network Example ā€¢ Network that computes P(3, {1}) ā†µ 1 32 TID time tempature 1:1 0 102 1:2 5 105 1:3 10 399 1:4 15 85 TID avg temp 3:1 103.5 {1:1, 1:2} 3:2 95 {1:2, 1:4} PPPG Example ā€¢ Network that computes P(2, {1}) TID avg temp 3:1 103.5 3:2 95 ā†µ 1 32 TID time temperature 1:1 0 102 1:2 5 105 1:3 10 399 1:4 15 85 TID time temperature 2:1 0 102 {1:1} 2:2 5 105 {1:2} 2:3 15 85 {1:4} PG PD Slide 10 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 27. Ariadne Implementation Based on Borealis ā€¢ Set of standard DSMS operators ā€¢ Operators connected through queues ā€¢ Fixed-length tuples Changes to Borealis code ā€¢ Implement annotations ā€¢ Implement operator instrumentation ā€¢ Implement p-join and input buļ¬€ering Slide 11 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 28. Querying Provenance Approach 1 Translate provenance annotations into regular stream data to query using the host language Slide 12 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 29. Querying Provenance Approach 1 Translate provenance annotations into regular stream data to query using the host language P-join ā€¢ Joins a PAS P(O, I) with buļ¬€ered input tuples from I ā€¢ Combine output t with each tuple in its provenance P(t, I) Slide 12 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 30. Querying Provenance Example ā€¢ Tuple 3:2 with P(3:2, {1}) = {1:2, 1:4} ā†µ 1 32 TID time temperature 1:1 0 102 1:2 5 105 1:3 10 399 1:4 15 85 TID avg temp 3:1 103.5 {1:1, 1:2} 3:2 95 {1:2, 1:4} TID time temperature 2:1 0 102 {1:1} 2:2 5 105 {1:2} 2:3 15 85 {1:4} PPPG C1 n Slide 12 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation
  • 31. Outline 1 Motivation 2 Reduced Eager Operator Instrumentation 3 Optimizations 4 Experiments 5 Conclusions
  • 32. Rationale Compression ā€¢ Reduce load on queues by using concise provenance representation ā€¢ Reduce memory imprint Lazy Retrieval ā€¢ Avoid reconstructing unneeded provenance Replay-Lazy ā€¢ Avoid generating unneeded provenance ā€¢ Decouple provenance computation from regular stream processing Slide 13 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations
  • 33. Replay-Lazy Idea ā€¢ Avoid provenance computation if provenance is not needed ā€¢ Propagate concise representation of a super-set of the provenance ā€¢ If provenance is requested replay super-set through provenance generating network Slide 14 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations
  • 34. Replay-Lazy Idea ā€¢ Avoid provenance computation if provenance is not needed ā€¢ Propagate concise representation of a super-set of the provenance ā€¢ If provenance is requested replay super-set through provenance generating network Covering Intervals ā€¢ Interval: ā€¢ Minimal TID in provenance ā€¢ Maximal TID in provenance ā€¢ Constant size super-set of provenance (piggy-back!) ā€¢ Provenance: {1, 2, 4, 5, 10, 16, 65} ā€¢ Covering interval: [1, 65] Slide 14 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations
  • 35. Replay-Lazy Enabling Replay ā€¢ Covering interval operators: ā€¢ CG corresponds to PG ā€¢ CP corresponds to PP ā€¢ C-join operator āŠ— ā€¢ For each input get covering interval ā€¢ Retrieve all tuples from covering interval from connection point Slide 14 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations
  • 36. Replay-Lazy Approach 1 Instrument original network to produce covering interval 2 Filter out results and c-join 3 Feed result into provenance generating copy of the network PG ā†µ PP ā†µ CPCG S1 Sout Provenance Generating Network Filter Provenance and Fetch Tuples from Input Covering Interval Generating Network āŒ¦ ā‡” PD P Slide 14 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations
  • 37. Outline 1 Motivation 2 Reduced Eager Operator Instrumentation 3 Optimizations 4 Experiments 5 Conclusions
  • 38. Completion Time - Basic Network ā€¢ Send large batch of 100,000 tuples ā€¢ Measure completion time ā€¢ Without Retrieval/With Retrieval ā€¢ No Provenance/Instrumentation/Rewrite ā€¢ Using the Basic Network ā€¢ Vary Aggregation Window Size ā€¢ Slide ļ¬xed to 1 ā€¢ Increase window size ā†’ increase amount of provenance Basic Query Network Ī±Ļƒ Ļƒ Rewritten Version Ļƒ Ī± Ļƒ Instrumented Version Ī± PP Ļƒ PG PP Ļƒ Ļƒ Replay-Lazy Version PPCP PG ā†µ PP ā†µ CPCG āŒ¦ n Slide 15 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Experiments
  • 39. Completion Time - Basic Network 0 5 10 15 20 25 30 35 40 45 NoProvenance Instrumentation Replay-Lazy Rewrite NoProvenance Instrumentation Replay-Lazy Rewrite NoProvenance Instrumentation Replay-Lazy Rewrite NoProvenance Instrumentation Replay-Lazy Rewrite NoProvenance Instrumentation Replay-Lazy Rewrite NoProvenance Instrumentation Replay-Lazy Rewrite CompletionTime(sec) Window Size No Provenance Instrumentation (Generation) Instrumentation (Retrieval) Replay-Lazy (Covering Interval) Replay-Lazy (Retrieval) Rewrite 1008060402010 Slide 15 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Experiments
  • 40. Outline 1 Motivation 2 Reduced Eager Operator Instrumentation 3 Optimizations 4 Experiments 5 Conclusions
  • 41. Conclusions Reduced-Eager Operator Instrumentation ā€¢ Novel propagation provenance method for DSMS ā€¢ Replace operators in a query network with provenance-aware (instrumented) versions ā€¢ Keeps original network structure intact ā€¢ Deals well with non-determinism ā€¢ Flexible: ā€¢ Single- and multi-hop provenance ā€¢ Instrument parts of a network Slide 16 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Conclusions
  • 42. Conclusions Optimizations ā€¢ Replay-Lazy ā€¢ Propagate only covering intervals - replay through provenance generating copy of network ā€¢ Reduce cost for low retrieval rates ā€¢ Option to decouple provenance computation ā€¢ Lazy-Retrieval ā€¢ Push ļ¬lters through p-joins ā€¢ Avoids provenance reconstruction ā€¢ Compression ā€¢ Interval ā€¢ Delta ā€¢ Dictionary Slide 16 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Conclusions
  • 43. Future Work Optimizing Temporary Input Storage ā€¢ Exploiting additional knowledge in purging temporary input storage ā€¢ Static query network analysis ā€¢ Self-optimizing purging strategies ā€¢ Compression and Approximations Integration with Distributed Storage and Scaling-out ā€¢ Using distributed write-optimized storage for provenance? Order-aware Provenance Model ā€¢ Integrate order into the provenance model ā€¢ ā€œTuple t is after u in the result, because tā€™s provenance is before uā€™s provenance in the input.ā€ Slide 17 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Conclusions
  • 44. Questions? Boris Glavic IIT DBGroup bglavic@iit.edu Kyumars Sheykh Esmaili Nanyang University SANDS kyumarss@ntu.edu.sg Peter M. Fischer University of Freiburg Web Science peter.ļ¬scher@ cs.uni-freiburg.de Nesime Tatbul Intel Labs Intel Science and Technology Center for Big Data tatbul@csail.mit.edu http://cs.iit.edu/~dbgroup/research/ariadne.php Slide 18 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Conclusions
  • 45. Outline 6 Implementation 7 Additional Experiments 8 Optimizations
  • 46. How to Implement Annotations? Alternatives for sending annotations ā€¢ Borealis models streams as queues of ļ¬xed-length tuples with header and payload 1 Variable-length tuples: Negates optimizations based on ļ¬xed-length 2 Split provenance sets into ļ¬x-length tuples 3 New information channels: Complex interactions with normal query processing Slide 1 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Implementation
  • 47. How to Implement Annotations? Alternatives for sending annotations ā€¢ Borealis models streams as queues of ļ¬xed-length tuples with header and payload 1 Variable-length tuples: Negates optimizations based on ļ¬xed-length 2 Split provenance sets into ļ¬x-length tuples ā€¢ Send provenance set after each output tuple ā€¢ First tuple has small header to tell downstream operators how many provenance tuples will follow ā€¢ Additional provenance tuples only store payload (TIDs) 3 New information channels: Complex interactions with normal query processing Slide 1 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Implementation
  • 48. Implementing Instrumented Operators Approach ā€¢ Factor out common functionality ā€¢ Serializing/Deserializing provenance to/from queues ā€¢ Caching provenance of windows ā€¢ Merging provenance sets ā€¢ Implemented in Provenance Wrapper ā€¢ Operators access queues through the wrapper ā€¢ Reduces code changes to each operator ā€¢ LOC ā€¢ Provenance wrapper: Ėœ8000 LOC ā€¢ Instrumented operators: aggregation (largest) 200 LOC Slide 2 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Implementation
  • 49. Temporary Input Storage Connection Points ā€¢ Borealis feature for storing tuples from a stream ā€¢ Time-based or count based eviction strategies ā€¢ Content can be joined with streams S1 Sout S2 n P Instrumented Network Reconstruct Provenance Temporary Input Storage PG PG PP Slide 3 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Implementation
  • 50. Outline 6 Implementation 7 Additional Experiments 8 Optimizations
  • 51. Varying Window Size ā€¢ Send large batch of 100,000 tuples ā€¢ Measure completion time ā€¢ No Retrieval ā€¢ No Provenance/Single/Optimized/Covering Interval ā€¢ Using the Basic Network ā€¢ Vary Window Size ā€¢ Slide ļ¬xed to 1 Basic Query Network Ī±Ļƒ Ļƒ Rewritten Version Ļƒ Ī± Ļƒ Instrumented Version Ī± PP Ļƒ PG PP Ļƒ Ļƒ Replay-Lazy Version PPCP PG ā†µ PP ā†µ CPCG āŒ¦ n Slide 4 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
  • 52. Varying Window Size 0 10 20 30 40 50 60 70 80 90 100 50 100 200 500 1000 2000 CompletionTime(sec) Window Size No Provenance Single Optimized Covering Interval Slide 4 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
  • 53. Vary Retrieval Frequency ā€¢ Send large batch of 100,000 tuples ā€¢ Measure completion time ā€¢ Instrumentation with Retrieval ā€¢ Replay-Lazy with Retrieval ā€¢ Using a sequence of aggregations ā€¢ Vary Retrieval Frequency - selectivity of ļ¬lter before p-join ā€¢ Slide ļ¬xed to 1, Window size ļ¬xed to 100 Instrumented Version ā†µ PPPG PP n Replay Lazy PPCP PG ā†µ PP ā†µ CPCG āŒ¦ n Slide 5 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
  • 54. Vary Retrieval Frequency 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 0.05 0.10 0.25 0.50 0.75 1 2 5 10 25 50 100 CompletionTime/ CompletionTimewithoutProvenance Retrieval Frequency (%) Instrumentation with Retrieval (Optimized) Replay-Lazy with Retrieval (Optimized) Slide 5 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
  • 55. Latency ā€¢ In Ariadne inputs are send in batches with a ļ¬xed number of tuples (parameter) ā€¢ Measure latency ā€¢ Using the Basic Network ā€¢ Vary the load ā€¢ Vary batch size and ļ¬x the delay between consecutive batches Slide 6 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
  • 56. Latency 0 1 2 3 4 5 6 7 10 25 50 75 100 Latency(ms) Batch Size No Provenance Single Optimized Covering Intervals Slide 6 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
  • 57. Nested Aggregations ā€¢ Sequence of aggregation operators ā€¢ Measure Completion time ā€¢ Vary Number of aggregation operators Method Number of Aggregations 1 2 3 4 No Provenance 3.1 3.9 4.8 5.7 Instr. Generation 3.9 7.4 14.7 48.6 Retrieval 3.0 12.9 103.0 2047.0 Replay-Lazy Cov. Inter. 3.1 4.4 5.2 6.3 Retrieval 5.2 14.7 91.1 2224.0 Rewrite 7.2 625.0 crash crash Slide 7 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments
  • 58. Outline 6 Implementation 7 Additional Experiments 8 Optimizations
  • 59. Compression Rationale ā€¢ Reduce amount of data that is shipped between operators ā€¢ Requirements ā€¢ Fast compression and decompression ā€¢ Operations such as merging sets on compressed data Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
  • 60. Compression Rationale ā€¢ Reduce amount of data that is shipped between operators ā€¢ Requirements ā€¢ Fast compression and decompression ā€¢ Operations such as merging sets on compressed data Interval Compression ā€¢ Input: {1, 2, 4, 5, 6, 7, 9} ā€¢ Output: {[1 āˆ’ 2], [4 āˆ’ 6], [9 āˆ’ 9]} Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
  • 61. Compression Rationale ā€¢ Reduce amount of data that is shipped between operators ā€¢ Requirements ā€¢ Fast compression and decompression ā€¢ Operations such as merging sets on compressed data Interval Compression ā€¢ Input: {1, 2, 4, 5, 6, 7, 9} ā€¢ Output: {[1 āˆ’ 2], [4 āˆ’ 6], [9 āˆ’ 9]} Dictionary Compression ā€¢ Input: {1, 2, 4, 5} ā€¢ Output: LZ77({1, 2, 4, 5}) Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
  • 62. Compression Delta Compression ā€¢ Express provenance as delta to down-stream provenance ā€¢ Once in a while send full provenance ā€¢ Express provenance as delta to previous full provenance ā€¢ Input: {1, 2, 4, 7, 9} ā† {4, 7, 9, 10} ā† {7, 9, 10, 12, 15, 19} ā€¢ Output: {1, 2, 4, 7, 9} ā† 3 āˆ’ {10} ā† 2 āˆ’ {10, 12, 15, 19} Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
  • 63. Compression Delta Compression ā€¢ Express provenance as delta to down-stream provenance ā€¢ Once in a while send full provenance ā€¢ Express provenance as delta to previous full provenance ā€¢ Input: {1, 2, 4, 7, 9} ā† {4, 7, 9, 10} ā† {7, 9, 10, 12, 15, 19} ā€¢ Output: {1, 2, 4, 7, 9} ā† 3 āˆ’ {10} ā† 2 āˆ’ {10, 12, 15, 19} Heuristic Adaptive Compression ā€¢ No one ļ¬ts all ā€¢ Combine compression methods ā€¢ Heuristic rules for when to apply which method Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
  • 64. Retrieval-Lazy Approach ā€¢ User query that ļ¬lters out unneeded provenance ā†µ PPPG PP n Slide 9 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations
  • 65. Retrieval-Lazy Approach ā€¢ User query that ļ¬lters out unneeded provenance ā€¢ Avoid expensive provenance reconstruction if provenance is not needed ā€¢ If possible ļ¬lter before reconstruction (p-join) ā€¢ Push ļ¬lters through p-joins ā†µ PPPG PP n Slide 9 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations