Data provenance is essential in applications such as scientific computing, curated databases, and data warehouses. Several systems have been developed that
provide provenance functionality for the relational data model. These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested subqueries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use.
In this paper we address this limitation by exploring the problem of provenance derivation when complex queries are involved. More precisely, we demonstrate that the widely used definition of Why-provenance fails in the presence of nested subqueries, and show how the definition can be modified to produce meaningful results for nested subqueries. We further present query rewrite rules to transform an SQL query into a query propagating provenance. The solution introduced in this paper allows us to track provenance information for a far wider subset of SQL than any of the existing approaches. We have incorporated these ideas into the Perm provenance management system engine and used it to evaluate the feasibility and performance of our approach.
1. Provenance for Nested Subqueries
Boris
Glavic
Database Technology Group
Department of Informatics
University of Zurich
glavic@ifi.uzh.ch
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Gustavo Alonso
Systems Group
Department of Computer Science
ETH Zurich
alonso@inf.ethz.ch
2. 2
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction
2. The Provenance of Subqueries
3. Experimental Results
4. Conclusion
3. 3
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Query
Which input data item(s)
influenced which output data
item(s)?
Granularity
Tuple
Attribute Value
...
Contribution semantics
Influence (Lineage / Why)
Copy (Where)
...
4. 4
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Most application domains that benefit from
provenance use complex queries
Subqueries
Correlated
Nested
Not supported by existing systems
Semantics not clear
Complex computation
5. 5
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Steps to solve this problem
1. Establish sound semantics for provenance
of subqueries
2. Algorithms for subquery provenance
computation
3. Integrate algorithms into a Provenance
Management system (Perm)
6. 6
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Steps to solve this problem
1. Establish sound semantics for provenance
of subqueries
2. Algorithms for subquery provenance
computation
3. Integrate algorithms into a Provenance
Management system (Perm)
7. 7
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Definition of contribution semantics
Why/Influence-provenance
Introduced in [Cui, Widom ICDE ‘00]
Provenance represented as list of subsets of
the input relations
Defined for a single algebra operator and a
single result tuple
8. 8
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Definition 1: For a single algebra
operator op with input relations T1, ... , Tn a
list (T1*, ... ,Tn*) of maximal subsets of
the input relation is the provenance of a
tuple t from the result of op iff:
u op(T1*, ..., Tn*) = t
u For all i and t* with t* in Ti*:
op(T1*, ... Ti-1*, t* , Ti+1*, ... ,Tn*) != ∅
9. 9
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Perm
Provenance Extension of the Relational
Model
Provenance Management System (PMS)
“Pure” Relational representation of
provenance
Provenance computation trough algebraic
query rewrite
Implemented as extension of PostgreSQL
10. 10
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Provenance representation
Original
Attributes
Relation 1
Attributes
Relation n
Attributes
Query
1
Original
Result
2 n
11. 11
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Provenance representation
Original
Attributes
Relation R
Attributes
Relation S
Attributes
Query
R
Original
Result
S
r1
s1r2
t 1
t 1 r1
t 1 r2
s1
s1
12. 12
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Provenance Computation though query
rewrite:
Given query q generate query q+ that
computes the provenance of q
Representation as defined before
Rewrites operate on the algebraic
representation of a query
Rewrite rules for each operator op that transform
op into a algebra statement that propagates the
provenance
13. 13
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Rewrite rules example:
SELECT agg, G
FROM T
GROUP BY G
SELECT agg, G, prov(T)
FROM
(SELECT agg, G FROM T GROUP BY G) AS agg,
LEFT OUTER JOIN
(SELECT G AS G’, prov(T) FROM T+) AS prov
ON G = G’
14. 14
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Rewrite rules example:
SELECT sum(revenue) AS sum, shop
FROM sales
GROUP BY shop
shop month revenue
Migros Jan 100
Migros Feb 10
Migros Mar 10
Coop Jan 25
Coop Feb 25
sales
sum shop
120 Migros
50 Coop
result
15. 15
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
SELECT sum, shop, pShop, pMonth, pRevenue
FROM
(SELECT sum(revenue) AS sum, shop
FROM sales GROUP BY shop) AS agg
LEFT OUTER JOIN
(SELECT shop AS shop’, pShop, pMonth, pRevenue
FROM sales ) AS prov
ON shop = shop’
sum shop pShop pMonth pRevenu
e
120 Migros Migros Jan 100
120 Migros Migros Feb 10
120 Migros Migros Mar 10
50 Coop Coop Jan 25
50 Coop Coop Feb 25
+
16. 16
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction
2. The Provenance of Subqueries
3. Experimental Results
4. Conclusion
17. 17
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Sublinks
Subqueries in e.g. SELECT-clause
Correlated
References outside attributes
Nested
Sublink that contains sublinks
σa IN σ (b=3) (S) (R)
σa IN σ (b=a) (S) (R)
σa IN σ (b = ANY (T )) (S) (R)
18. 18
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
What is the provenance of a sublink
according to Definition 1?
Sublinks can be used in different contexts
Selection
Projection
...
Sublink either
Produces exactly one value
Or produces a boolean value
19. 19
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Single uncorrelated ANY-sublinks in
selection conditions
For other
Types of sublinks
Correlated sublinks
Nested sublinks
20. 20
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
For other
Types of sublinks
Correlated sublinks
Nested sublinks
READ THE PAPER!
21. 21
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Single uncorrelated ANY-sublinks in
selection conditions
The result of the sublink query is fixed
For a given input tuple t the sublink condition
is either true or false
σa =ANY σ(b=3) (S) (R)
22. 22
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Some terminology
The query of a sublink
The conditional expression of a sublink
Tsub
q =σa =ANY Πb (S) (R)
Πb (S)
a = ANY Πb (S)Csub
Tsub
Csub
23. 23
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Sublink condition can play different roles in
a condition C of a selection (for one input
tuple t):
Reqtrue: the selection condition is true, iff
is true
Reqfalse: the selection condition is true, iff
is false
Ind: the selection condition is true
indepedent of the result of
Csub
Csub
Csub
24. 24
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Some more terminology
All tuples from the sublink query that fulfill the
“unquantified” sublink condition
All tuples from the sublink query that do not
fulfill the “unquantified” sublink condition
Tsub
true
(t)
Tsub
false
(t)
Csub = (a = ANY σb=3(S)) Csub° = (a = b)
25. 25
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Back to ANY-sublinks in selections
Proposition:
Tsub
*
(t) =
Tsub
true
(t) reqtrue
Tsub reqfalse,ind
⎧
⎨
⎩
26. 26
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
a
1
2
3
b c
1 100
2 10
4 24
SR
q =σa =ANY Πb (S) (R)
a
1
2
Result
Compute provenance for t = (1)
Example:
27. 27
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Tsub = Πb (S)
Tsub
true
(t) = {(1)}
is reqtrueCsub
Tsub
*
=Tsub
true
Csub° = (a = b)
q =σa =ANY Πb (S) (R)
28. 28
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Tsub
true
(t) = {(1)}
q =σa =ANY Πb (S) (R)
b
1
2
4
Tsub
a
1
2
3
R
Csub° = (a = b)
Compute provenance for t = (1)
29. 29
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
a
1
2
3
b c
1 100
2 10
4 24
SR
q =σa =ANY Πb (S) (R)
a
1
b
1
R
*
Tsub
*
b
1
2
4
Tsub
a
1
2
Result
Compute provenance for t = (1)
30. 30
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Definition 1 is ambiguous for queries with
more than one sublink!
b
1
2
100
c
1
5
SR
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
Result
a
5
U
31. 31
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Definition 1 is ambiguous for queries with
more than one sublink!
b
1
2
100
c
1
5
SR
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
Result
a
5
U
true
false
32. 32
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5
c
1
5
S*R*
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
U*
b
1
100
R*
b
1
S*
a
5
U*
Solution 1 Solution 2
33. 33
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5
c
1
5
S*R*
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
U*
b
1
100
R*
b
1
S*
a
5
U*
Solution 1 Solution 2
true
false
34. 34
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5
c
1
5
S*R*
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
U*
b
1
100
R*
b
1
S*
a
5
U*
Solution 1 Solution 2
false
true
35. 35
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Reasons for this ambiguity:
The definition requires the provenance to
produce the same result
But not to produce the same results for the
sublinks
-> Definition 1 produces false positives
36. 36
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Solution: Extend definition 1
Add a third condition:
For each sublink:
If computed for
one result tuple t
one tuple from the provenance of the sublink
Produces same sublink result as in the original
query
37. 37
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5
c
5
S*R*
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
U*
b
1
100
R*
b
1
S*
a
5
U*
Solution 1 Solution 2
38. 38
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
How to compute the provenance
according to the extended definition?
Use query rewrite
Generic strategy (Gen)
Specialized strategies
Use un-nesting
Check: does not change the provenance
39. 39
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Gen-strategy
For queries we cannot un-nest
1. Join original query with all possible
provenance tuples (base relations)
2. Rewrite the sublink query
3. Introduce additional correlation to simulate
a join between 1) and 2)
40. 40
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction
2. The Provenance of Subqueries
3. Experimental Results
4. Conclusion
41. 41
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
3. Experimental Results
TPC-H benchmark (10 MB size)
42. 42
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
3. Experimental Results
TPC-H benchmark (1 GB size)
43. 43
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction
2. The Provenance of Subqueries
3. Experimental Results
4. Conclusion
44. 44
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
4. Conclusion
Definition 1 fails in the presence of
sublinks
Can be extended to deal with sublinks
Provenance computation for sublinks
By using query rewrites
Implemented in the Perm
Future Work
Physical provenance-aware operators
45. 45
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Questions
? ? ?
Editor's Notes
If we have to shorten:
-remove query rewrite example in the introduction (-3 pages)
Welcome to my presentation... My from ..., together with Gustavo from ... And its about ....
The talk we be organised as follows: first a short introduction to out PMS Perm, the I’ll show what the Provenance of a subquery looks like and then how it can be computed, as usual a conclusion in the end
In the context of relational database : The main problem faced can then be stated as:
Which input...
This problem can be solved for
different levels of granularity of data items: Tuples, Attribute Values and so on. (We are looking at tuple level granularity)
-different definitions of what influences means (we call this contribution semantics)
for example only tuples that have been copied literally from the source to the result.
(We are looking at influence contribution semantics which also have been called Why-Provenance
Most app-doms where provenance would be important use complex query that use features like aggregation, user def. functions and subqueries in selections, aggregations, that are possibly correlated or nested
Oooh, these are not supported by existing systems
-add perm introd beofre this one, talk about ICDE paper
-reasons why it is not supported
Lets have a look at the contribution semantics we use for Perm. It was introduced by Cui et. Al. In 2000, prov. Was represented as a tuple of subsets ... And the definition defines the provenance of single operators (but is assumed to be transitive)
1 means the provenance if used as input of op produces exactly t and nothing else
2 means each tuple from each part of the provenance contributes something to the result (conter example, for selection all tuples that do not fullfill the selection condition obviously do not contribute something to the result but would be in the provenance if we leave out condition 2)
Change slides: the disadvantage is the non relational representation of provenance, so we decided to use same semantics with another representation:
Perm means: provenance extension of the relational model
Uses “pure” ...
Computes prov. By ....
Uses influence contribution semantics with tuple level granularity
-move before intro introduction
Single result table that contains all the original result attributes and the attributes from the input relations of the query. Each original result tuple is extended by attaching contributing tuples from the base relations (and thus has to be duplicated if there is more than one contributing tuple from one of the input relations
Single result table that contains all the original result attributes and the attributes from the input relations of the query. Each original result tuple is extended by attaching contributing tuples from the base relations (and thus has to be duplicated if there is more than one contributing tuple from one of the input relations
Perm uses query rewrite techniques to compute the provenance of a query, ...
-intro introd.
Lets have a short example: (the + operator transforms a operator or algebra statement into a provenance computation) / P is a list of provenance attributes of an algebra expression. Here we have the rewrite rule for aggregation operator
-SQL! Fast too to intro
Lets have a short example: (the + operator transforms a operator or algebra statement into a provenance computation) / P is a list of provenance attributes of an algebra expression. Here we have the rewrite rule for aggregation operator
-SQL! Fast too to intro
Lets have a short example: (the + operator transforms a operator or algebra statement into a provenance computation) / P is a list of provenance attributes of an algebra expression. Here we have the rewrite rule for aggregation operator
-SQL! Fast too to intro
So now look what happens if we introduce subqueries
We call them sublinks to distinguish between “normal” subqueries used in FROM-claus. We use algebraic representation (will see later why)
Sublinks are called correlated if...
Sublinks are called nested if...
If we want to know, What is.... Then we are facing some problems
sublinks can be used in different contexts:
And we can make some observations:
(for a given input tuple) a sublink prduces a contant value (which i either a boolean (e.g. IN, ANY, ...) or a data type contant (subqueries witout special sublink operator)
Definition is cab ambigous
We have been proven but I will only explain the main idee behind it
Lets exercise an example: we have the follwing query and relations and are searching for the provenance of result tuple t = (1)
The following tuples from Tsub and R fulfill condition one of the definition. (Note that 2 is not included because it would cause an additional result tuple 2)
Definition is cab ambigous
Definition is cab ambigous
Definition is cab ambigous
Definition is cab ambigous
Definition is cab ambigous
Definition is cab ambigous
Definition is cab ambigous
Definition is cab ambigous
Now we know how the provenance of sublinks looks like, but how can we compute the provenance using query rewrite rules
Now we know how the provenance of sublinks looks like, but how can we compute the provenance using query rewrite rules
Exchange with results like in ICDE (whole of TPCH, focus on sublinks)