EDBT 2009 - Provenance for Nested Subqueries

Provenance for Nested Subqueries
Boris
Glavic
Database Technology Group
Department of Informatics
University of Zurich
glavic@ifi.uzh.ch
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Gustavo Alonso
Systems Group
Department of Computer Science
ETH Zurich
alonso@inf.ethz.ch

2
Dekompressor „“
benötigt.
Overview
1. Introduction
2. The Provenance of Subqueries
3. Experimental Results
4. Conclusion

3
Dekompressor „“
benötigt.
1. Introduction
Query
 Which input data item(s)
influenced which output data
item(s)?
 Granularity
 Tuple
 Attribute Value
 ...
 Contribution semantics
 Influence (Lineage / Why)
 Copy (Where)
 ...

4
Dekompressor „“
benötigt.
1. Introduction
 Most application domains that benefit from
provenance use complex queries
 Subqueries
 Correlated
 Nested
 Not supported by existing systems
 Semantics not clear
 Complex computation

5
Dekompressor „“
benötigt.
1. Introduction
 Steps to solve this problem
1. Establish sound semantics for provenance
of subqueries
2. Algorithms for subquery provenance
computation
3. Integrate algorithms into a Provenance
Management system (Perm)

6
Dekompressor „“
benötigt.
1. Introduction
 Steps to solve this problem
1. Establish sound semantics for provenance
of subqueries
2. Algorithms for subquery provenance
computation
3. Integrate algorithms into a Provenance
Management system (Perm)

7
Dekompressor „“
benötigt.
1. Introduction
 Definition of contribution semantics
 Why/Influence-provenance
 Introduced in [Cui, Widom ICDE ‘00]
 Provenance represented as list of subsets of
the input relations
 Defined for a single algebra operator and a
single result tuple

8
Dekompressor „“
benötigt.
1. Introduction
 Definition 1: For a single algebra
operator op with input relations T1, ... , Tn a
list (T1*, ... ,Tn*) of maximal subsets of
the input relation is the provenance of a
tuple t from the result of op iff:
u op(T1*, ..., Tn*) = t
u For all i and t* with t* in Ti*:
op(T1*, ... Ti-1*, t* , Ti+1*, ... ,Tn*) != ∅

9
Dekompressor „“
benötigt.
1. Introduction
 Perm
 Provenance Extension of the Relational
Model
 Provenance Management System (PMS)
 “Pure” Relational representation of
provenance
 Provenance computation trough algebraic
query rewrite
 Implemented as extension of PostgreSQL

10
Dekompressor „“
benötigt.
1. Introduction
 Provenance representation
Original
Attributes
Relation 1
Attributes
Relation n
Attributes
Query
1
Original
Result
2 n

11
Dekompressor „“
benötigt.
1. Introduction
 Provenance representation
Original
Attributes
Relation R
Attributes
Relation S
Attributes
Query
R
Original
Result
S
r1
s1r2
t 1
t 1 r1
t 1 r2
s1
s1

12
Dekompressor „“
benötigt.
1. Introduction
 Provenance Computation though query
rewrite:
 Given query q generate query q+ that
computes the provenance of q
 Representation as defined before
 Rewrites operate on the algebraic
representation of a query
 Rewrite rules for each operator op that transform
op into a algebra statement that propagates the
provenance

13
Dekompressor „“
benötigt.
1. Introduction
 Rewrite rules example:
SELECT agg, G
FROM T
GROUP BY G
SELECT agg, G, prov(T)
FROM
(SELECT agg, G FROM T GROUP BY G) AS agg,
LEFT OUTER JOIN
(SELECT G AS G’, prov(T) FROM T+) AS prov
ON G = G’

14
Dekompressor „“
benötigt.
1. Introduction
 Rewrite rules example:
SELECT sum(revenue) AS sum, shop
FROM sales
GROUP BY shop
shop month revenue
Migros Jan 100
Migros Feb 10
Migros Mar 10
Coop Jan 25
Coop Feb 25
sales
sum shop
120 Migros
50 Coop
result

15
Dekompressor „“
benötigt.
1. Introduction
SELECT sum, shop, pShop, pMonth, pRevenue
FROM
(SELECT sum(revenue) AS sum, shop
FROM sales GROUP BY shop) AS agg
LEFT OUTER JOIN
(SELECT shop AS shop’, pShop, pMonth, pRevenue
FROM sales ) AS prov
ON shop = shop’
sum shop pShop pMonth pRevenu
e
120 Migros Migros Jan 100
120 Migros Migros Feb 10
120 Migros Migros Mar 10
50 Coop Coop Jan 25
50 Coop Coop Feb 25
+

16
Dekompressor „“
benötigt.
Overview
1. Introduction
4. Conclusion

17
Dekompressor „“
benötigt.
 Sublinks
 Subqueries in e.g. SELECT-clause
 Correlated
 References outside attributes
 Nested
 Sublink that contains sublinks
σa IN σ (b=3) (S) (R)
σa IN σ (b=a) (S) (R)
σa IN σ (b = ANY (T )) (S) (R)

18
Dekompressor „“
benötigt.
 What is the provenance of a sublink
according to Definition 1?
 Sublinks can be used in different contexts
 Selection
 Projection
 ...
 Sublink either
 Produces exactly one value
 Or produces a boolean value

19
Dekompressor „“
benötigt.
 Single uncorrelated ANY-sublinks in
selection conditions
 For other
 Types of sublinks
 Correlated sublinks
 Nested sublinks

20
Dekompressor „“
benötigt.
 For other
 Types of sublinks
 Correlated sublinks
 Nested sublinks
READ THE PAPER!

21
Dekompressor „“
benötigt.
 Single uncorrelated ANY-sublinks in
selection conditions
 The result of the sublink query is fixed
 For a given input tuple t the sublink condition
is either true or false
σa =ANY σ(b=3) (S) (R)

22
Dekompressor „“
benötigt.
 Some terminology
 The query of a sublink
 The conditional expression of a sublink
Tsub
q =σa =ANY Πb (S) (R)
Πb (S)
a = ANY Πb (S)Csub
Tsub
Csub

23
Dekompressor „“
benötigt.
 Sublink condition can play different roles in
a condition C of a selection (for one input
tuple t):
 Reqtrue: the selection condition is true, iff
is true
 Reqfalse: the selection condition is true, iff
is false
 Ind: the selection condition is true
indepedent of the result of
Csub
Csub
Csub

24
Dekompressor „“
benötigt.
 Some more terminology
 All tuples from the sublink query that fulfill the
“unquantified” sublink condition
 All tuples from the sublink query that do not
fulfill the “unquantified” sublink condition
Tsub
true
(t)
Tsub
false
(t)
Csub = (a = ANY σb=3(S)) Csub° = (a = b)

25
Dekompressor „“
benötigt.
 Back to ANY-sublinks in selections
 Proposition:
Tsub
*
(t) =
Tsub
true
(t) reqtrue
Tsub reqfalse,ind
⎧
⎨
⎩

26
Dekompressor „“
benötigt.
a
1
2
3
b c
1 100
2 10
4 24
SR
a
1
2
Result
Compute provenance for t = (1)
 Example:

27
Dekompressor „“
benötigt.
Tsub = Πb (S)
Tsub
true
(t) = {(1)}
is reqtrueCsub
Tsub
*
=Tsub
true
Csub° = (a = b)

28
Dekompressor „“
benötigt.
Tsub
true
(t) = {(1)}
b
1
2
4
Tsub
a
1
2
3
R
Csub° = (a = b)

29
Dekompressor „“
benötigt.
a
1
2
3
b c
1 100
2 10
4 24
SR
a
1
b
1
R
*
Tsub
*
b
1
2
4
Tsub
a
1
2
Result

30
Dekompressor „“
benötigt.
 Definition 1 is ambiguous for queries with
more than one sublink!
b
1
2
100
c
1
5
SR
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
Result
a
5
U

31
Dekompressor „“
benötigt.
 Definition 1 is ambiguous for queries with
more than one sublink!
b
1
2
100
c
1
5
SR
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
Result
a
5
U
true
false

32
Dekompressor „“
benötigt.
b
5
c
1
5
S*R*
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
U*
b
1
100
R*
b
1
S*
a
5
U*
Solution 1 Solution 2

33
Dekompressor „“
benötigt.
b
5
c
1
5
S*R*
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
U*
b
1
100
R*
b
1
S*
a
5
U*
true
false

34
Dekompressor „“
benötigt.
b
5
c
1
5
S*R*
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
U*
b
1
100
R*
b
1
S*
a
5
U*
false
true

35
Dekompressor „“
benötigt.
 Reasons for this ambiguity:
 The definition requires the provenance to
produce the same result
 But not to produce the same results for the
sublinks
-> Definition 1 produces false positives

36
Dekompressor „“
benötigt.
 Solution: Extend definition 1
 Add a third condition:
 For each sublink:
 If computed for
 one result tuple t
 one tuple from the provenance of the sublink
 Produces same sublink result as in the original
query

37
Dekompressor „“
benötigt.
b
5
c
5
S*R*
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
U*
b
1
100
R*
b
1
S*
a
5
U*

38
Dekompressor „“
benötigt.
 How to compute the provenance
according to the extended definition?
 Use query rewrite
 Generic strategy (Gen)
 Specialized strategies
 Use un-nesting
 Check: does not change the provenance

39
Dekompressor „“
benötigt.
 Gen-strategy
 For queries we cannot un-nest
1. Join original query with all possible
provenance tuples (base relations)
2. Rewrite the sublink query
3. Introduce additional correlation to simulate
a join between 1) and 2)

40
Dekompressor „“
benötigt.
Overview
1. Introduction
4. Conclusion

41
Dekompressor „“
benötigt.
 TPC-H benchmark (10 MB size)

42
Dekompressor „“
benötigt.
 TPC-H benchmark (1 GB size)

43
Dekompressor „“
benötigt.
Overview
1. Introduction
4. Conclusion

44
Dekompressor „“
benötigt.
4. Conclusion
 Definition 1 fails in the presence of
sublinks
 Can be extended to deal with sublinks
 Provenance computation for sublinks
 By using query rewrites
 Implemented in the Perm
 Future Work
 Physical provenance-aware operators

45
Dekompressor „“
benötigt.
Questions
? ? ?

EDBT 2009 - Provenance for Nested Subqueries

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

More from Boris Glavic

More from Boris Glavic (10)

EDBT 2009 - Provenance for Nested Subqueries

Editor's Notes