SlideShare a Scribd company logo
1 of 45
Provenance for Nested Subqueries
Boris
Glavic
Database Technology Group
Department of Informatics
University of Zurich
glavic@ifi.uzh.ch
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Gustavo Alonso
Systems Group
Department of Computer Science
ETH Zurich
alonso@inf.ethz.ch
2
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction
2. The Provenance of Subqueries
3. Experimental Results
4. Conclusion
3
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
Query
 Which input data item(s)
influenced which output data
item(s)?
 Granularity
 Tuple
 Attribute Value
 ...
 Contribution semantics
 Influence (Lineage / Why)
 Copy (Where)
 ...
4
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Most application domains that benefit from
provenance use complex queries
 Subqueries
 Correlated
 Nested
 Not supported by existing systems
 Semantics not clear
 Complex computation
5
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Steps to solve this problem
1. Establish sound semantics for provenance
of subqueries
2. Algorithms for subquery provenance
computation
3. Integrate algorithms into a Provenance
Management system (Perm)
6
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Steps to solve this problem
1. Establish sound semantics for provenance
of subqueries
2. Algorithms for subquery provenance
computation
3. Integrate algorithms into a Provenance
Management system (Perm)
7
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Definition of contribution semantics
 Why/Influence-provenance
 Introduced in [Cui, Widom ICDE ‘00]
 Provenance represented as list of subsets of
the input relations
 Defined for a single algebra operator and a
single result tuple
8
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Definition 1: For a single algebra
operator op with input relations T1, ... , Tn a
list (T1*, ... ,Tn*) of maximal subsets of
the input relation is the provenance of a
tuple t from the result of op iff:
u op(T1*, ..., Tn*) = t
u For all i and t* with t* in Ti*:
op(T1*, ... Ti-1*, t* , Ti+1*, ... ,Tn*) != ∅
9
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Perm
 Provenance Extension of the Relational
Model
 Provenance Management System (PMS)
 “Pure” Relational representation of
provenance
 Provenance computation trough algebraic
query rewrite
 Implemented as extension of PostgreSQL
10
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Provenance representation
Original
Attributes
Relation 1
Attributes
Relation n
Attributes
Query
1
Original
Result
2 n
11
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Provenance representation
Original
Attributes
Relation R
Attributes
Relation S
Attributes
Query
R
Original
Result
S
r1
s1r2
t 1
t 1 r1
t 1 r2
s1
s1
12
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Provenance Computation though query
rewrite:
 Given query q generate query q+ that
computes the provenance of q
 Representation as defined before
 Rewrites operate on the algebraic
representation of a query
 Rewrite rules for each operator op that transform
op into a algebra statement that propagates the
provenance
13
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Rewrite rules example:
SELECT agg, G
FROM T
GROUP BY G
SELECT agg, G, prov(T)
FROM
(SELECT agg, G FROM T GROUP BY G) AS agg,
LEFT OUTER JOIN
(SELECT G AS G’, prov(T) FROM T+) AS prov
ON G = G’
14
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
 Rewrite rules example:
SELECT sum(revenue) AS sum, shop
FROM sales
GROUP BY shop
shop month revenue
Migros Jan 100
Migros Feb 10
Migros Mar 10
Coop Jan 25
Coop Feb 25
sales
sum shop
120 Migros
50 Coop
result
15
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
1. Introduction
SELECT sum, shop, pShop, pMonth, pRevenue
FROM
(SELECT sum(revenue) AS sum, shop
FROM sales GROUP BY shop) AS agg
LEFT OUTER JOIN
(SELECT shop AS shop’, pShop, pMonth, pRevenue
FROM sales ) AS prov
ON shop = shop’
sum shop pShop pMonth pRevenu
e
120 Migros Migros Jan 100
120 Migros Migros Feb 10
120 Migros Migros Mar 10
50 Coop Coop Jan 25
50 Coop Coop Feb 25
+
16
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction
2. The Provenance of Subqueries
3. Experimental Results
4. Conclusion
17
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Sublinks
 Subqueries in e.g. SELECT-clause
 Correlated
 References outside attributes
 Nested
 Sublink that contains sublinks
σa IN σ (b=3) (S) (R)
σa IN σ (b=a) (S) (R)
σa IN σ (b = ANY (T )) (S) (R)
18
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 What is the provenance of a sublink
according to Definition 1?
 Sublinks can be used in different contexts
 Selection
 Projection
 ...
 Sublink either
 Produces exactly one value
 Or produces a boolean value
19
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Single uncorrelated ANY-sublinks in
selection conditions
 For other
 Types of sublinks
 Correlated sublinks
 Nested sublinks
20
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 For other
 Types of sublinks
 Correlated sublinks
 Nested sublinks
READ THE PAPER!
21
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Single uncorrelated ANY-sublinks in
selection conditions
 The result of the sublink query is fixed
 For a given input tuple t the sublink condition
is either true or false
σa =ANY σ(b=3) (S) (R)
22
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Some terminology
 The query of a sublink
 The conditional expression of a sublink
Tsub
q =σa =ANY Πb (S) (R)
Πb (S)
a = ANY Πb (S)Csub
Tsub
Csub
23
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Sublink condition can play different roles in
a condition C of a selection (for one input
tuple t):
 Reqtrue: the selection condition is true, iff
is true
 Reqfalse: the selection condition is true, iff
is false
 Ind: the selection condition is true
indepedent of the result of
Csub
Csub
Csub
24
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Some more terminology
 All tuples from the sublink query that fulfill the
“unquantified” sublink condition
 All tuples from the sublink query that do not
fulfill the “unquantified” sublink condition
Tsub
true
(t)
Tsub
false
(t)
Csub = (a = ANY σb=3(S)) Csub° = (a = b)
25
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Back to ANY-sublinks in selections
 Proposition:
Tsub
*
(t) =
Tsub
true
(t) reqtrue
Tsub reqfalse,ind
⎧
⎨
⎩
26
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
a
1
2
3
b c
1 100
2 10
4 24
SR
q =σa =ANY Πb (S) (R)
a
1
2
Result
Compute provenance for t = (1)
 Example:
27
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Tsub = Πb (S)
Tsub
true
(t) = {(1)}
is reqtrueCsub
Tsub
*
=Tsub
true
Csub° = (a = b)
q =σa =ANY Πb (S) (R)
28
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Tsub
true
(t) = {(1)}
q =σa =ANY Πb (S) (R)
b
1
2
4
Tsub
a
1
2
3
R
Csub° = (a = b)
Compute provenance for t = (1)
29
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
a
1
2
3
b c
1 100
2 10
4 24
SR
q =σa =ANY Πb (S) (R)
a
1
b
1
R
*
Tsub
*
b
1
2
4
Tsub
a
1
2
Result
Compute provenance for t = (1)
30
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Definition 1 is ambiguous for queries with
more than one sublink!
b
1
2
100
c
1
5
SR
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
Result
a
5
U
31
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Definition 1 is ambiguous for queries with
more than one sublink!
b
1
2
100
c
1
5
SR
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
Result
a
5
U
true
false
32
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5
c
1
5
S*R*
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
U*
b
1
100
R*
b
1
S*
a
5
U*
Solution 1 Solution 2
33
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5
c
1
5
S*R*
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
U*
b
1
100
R*
b
1
S*
a
5
U*
Solution 1 Solution 2
true
false
34
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5
c
1
5
S*R*
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
U*
b
1
100
R*
b
1
S*
a
5
U*
Solution 1 Solution 2
false
true
35
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Reasons for this ambiguity:
 The definition requires the provenance to
produce the same result
 But not to produce the same results for the
sublinks
-> Definition 1 produces false positives
36
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Solution: Extend definition 1
 Add a third condition:
 For each sublink:
 If computed for
 one result tuple t
 one tuple from the provenance of the sublink
 Produces same sublink result as in the original
query
37
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5
c
5
S*R*
q =σC1∨C2
(U)
C1 = (a =ANY R)
C2 = (a > ALL S)
t = (5)
a
5
U*
b
1
100
R*
b
1
S*
a
5
U*
Solution 1 Solution 2
38
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 How to compute the provenance
according to the extended definition?
 Use query rewrite
 Generic strategy (Gen)
 Specialized strategies
 Use un-nesting
 Check: does not change the provenance
39
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
2. The Provenance of Subqueries
 Gen-strategy
 For queries we cannot un-nest
1. Join original query with all possible
provenance tuples (base relations)
2. Rewrite the sublink query
3. Introduce additional correlation to simulate
a join between 1) and 2)
40
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction
2. The Provenance of Subqueries
3. Experimental Results
4. Conclusion
41
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
3. Experimental Results
 TPC-H benchmark (10 MB size)
42
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
3. Experimental Results
 TPC-H benchmark (1 GB size)
43
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Overview
1. Introduction
2. The Provenance of Subqueries
3. Experimental Results
4. Conclusion
44
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
4. Conclusion
 Definition 1 fails in the presence of
sublinks
 Can be extended to deal with sublinks
 Provenance computation for sublinks
 By using query rewrites
 Implemented in the Perm
 Future Work
 Physical provenance-aware operators
45
Zur Anzeige wird der QuickTime™
Dekompressor „“
benötigt.
Questions
? ? ?

More Related Content

Viewers also liked

Project management
Project managementProject management
Project management
Anshu Suri
 
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
Boris Glavic
 
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
Boris Glavic
 
Marketing plan elfasenior (www.elfas678.com)
Marketing plan elfasenior (www.elfas678.com)Marketing plan elfasenior (www.elfas678.com)
Marketing plan elfasenior (www.elfas678.com)
elfas678
 
Penyempurnaan mp elfasenior (www.elfas678.com)
Penyempurnaan mp elfasenior (www.elfas678.com)Penyempurnaan mp elfasenior (www.elfas678.com)
Penyempurnaan mp elfasenior (www.elfas678.com)
elfas678
 

Viewers also liked (12)

Project management
Project managementProject management
Project management
 
Sexually Transmitted Infections
Sexually Transmitted InfectionsSexually Transmitted Infections
Sexually Transmitted Infections
 
TaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningTaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data Mining
 
India jeje
India jejeIndia jeje
India jeje
 
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
 
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
 
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
 
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
 
2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator
 
Presentation elfas678
Presentation elfas678Presentation elfas678
Presentation elfas678
 
Marketing plan elfasenior (www.elfas678.com)
Marketing plan elfasenior (www.elfas678.com)Marketing plan elfasenior (www.elfas678.com)
Marketing plan elfasenior (www.elfas678.com)
 
Penyempurnaan mp elfasenior (www.elfas678.com)
Penyempurnaan mp elfasenior (www.elfas678.com)Penyempurnaan mp elfasenior (www.elfas678.com)
Penyempurnaan mp elfasenior (www.elfas678.com)
 

More from Boris Glavic

2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
Boris Glavic
 
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersTaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
Boris Glavic
 

More from Boris Glavic (10)

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
 
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
 
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
 
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
 
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
 
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersTaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database Virtualization
 
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceTaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, Ian
 

EDBT 2009 - Provenance for Nested Subqueries

  • 1. Provenance for Nested Subqueries Boris Glavic Database Technology Group Department of Informatics University of Zurich glavic@ifi.uzh.ch Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich alonso@inf.ethz.ch
  • 2. 2 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. Overview 1. Introduction 2. The Provenance of Subqueries 3. Experimental Results 4. Conclusion
  • 3. 3 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction Query  Which input data item(s) influenced which output data item(s)?  Granularity  Tuple  Attribute Value  ...  Contribution semantics  Influence (Lineage / Why)  Copy (Where)  ...
  • 4. 4 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Most application domains that benefit from provenance use complex queries  Subqueries  Correlated  Nested  Not supported by existing systems  Semantics not clear  Complex computation
  • 5. 5 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Steps to solve this problem 1. Establish sound semantics for provenance of subqueries 2. Algorithms for subquery provenance computation 3. Integrate algorithms into a Provenance Management system (Perm)
  • 6. 6 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Steps to solve this problem 1. Establish sound semantics for provenance of subqueries 2. Algorithms for subquery provenance computation 3. Integrate algorithms into a Provenance Management system (Perm)
  • 7. 7 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Definition of contribution semantics  Why/Influence-provenance  Introduced in [Cui, Widom ICDE ‘00]  Provenance represented as list of subsets of the input relations  Defined for a single algebra operator and a single result tuple
  • 8. 8 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Definition 1: For a single algebra operator op with input relations T1, ... , Tn a list (T1*, ... ,Tn*) of maximal subsets of the input relation is the provenance of a tuple t from the result of op iff: u op(T1*, ..., Tn*) = t u For all i and t* with t* in Ti*: op(T1*, ... Ti-1*, t* , Ti+1*, ... ,Tn*) != ∅
  • 9. 9 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Perm  Provenance Extension of the Relational Model  Provenance Management System (PMS)  “Pure” Relational representation of provenance  Provenance computation trough algebraic query rewrite  Implemented as extension of PostgreSQL
  • 10. 10 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Provenance representation Original Attributes Relation 1 Attributes Relation n Attributes Query 1 Original Result 2 n
  • 11. 11 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Provenance representation Original Attributes Relation R Attributes Relation S Attributes Query R Original Result S r1 s1r2 t 1 t 1 r1 t 1 r2 s1 s1
  • 12. 12 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Provenance Computation though query rewrite:  Given query q generate query q+ that computes the provenance of q  Representation as defined before  Rewrites operate on the algebraic representation of a query  Rewrite rules for each operator op that transform op into a algebra statement that propagates the provenance
  • 13. 13 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Rewrite rules example: SELECT agg, G FROM T GROUP BY G SELECT agg, G, prov(T) FROM (SELECT agg, G FROM T GROUP BY G) AS agg, LEFT OUTER JOIN (SELECT G AS G’, prov(T) FROM T+) AS prov ON G = G’
  • 14. 14 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Rewrite rules example: SELECT sum(revenue) AS sum, shop FROM sales GROUP BY shop shop month revenue Migros Jan 100 Migros Feb 10 Migros Mar 10 Coop Jan 25 Coop Feb 25 sales sum shop 120 Migros 50 Coop result
  • 15. 15 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction SELECT sum, shop, pShop, pMonth, pRevenue FROM (SELECT sum(revenue) AS sum, shop FROM sales GROUP BY shop) AS agg LEFT OUTER JOIN (SELECT shop AS shop’, pShop, pMonth, pRevenue FROM sales ) AS prov ON shop = shop’ sum shop pShop pMonth pRevenu e 120 Migros Migros Jan 100 120 Migros Migros Feb 10 120 Migros Migros Mar 10 50 Coop Coop Jan 25 50 Coop Coop Feb 25 +
  • 16. 16 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. Overview 1. Introduction 2. The Provenance of Subqueries 3. Experimental Results 4. Conclusion
  • 17. 17 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Sublinks  Subqueries in e.g. SELECT-clause  Correlated  References outside attributes  Nested  Sublink that contains sublinks σa IN σ (b=3) (S) (R) σa IN σ (b=a) (S) (R) σa IN σ (b = ANY (T )) (S) (R)
  • 18. 18 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  What is the provenance of a sublink according to Definition 1?  Sublinks can be used in different contexts  Selection  Projection  ...  Sublink either  Produces exactly one value  Or produces a boolean value
  • 19. 19 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Single uncorrelated ANY-sublinks in selection conditions  For other  Types of sublinks  Correlated sublinks  Nested sublinks
  • 20. 20 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  For other  Types of sublinks  Correlated sublinks  Nested sublinks READ THE PAPER!
  • 21. 21 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Single uncorrelated ANY-sublinks in selection conditions  The result of the sublink query is fixed  For a given input tuple t the sublink condition is either true or false σa =ANY σ(b=3) (S) (R)
  • 22. 22 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Some terminology  The query of a sublink  The conditional expression of a sublink Tsub q =σa =ANY Πb (S) (R) Πb (S) a = ANY Πb (S)Csub Tsub Csub
  • 23. 23 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Sublink condition can play different roles in a condition C of a selection (for one input tuple t):  Reqtrue: the selection condition is true, iff is true  Reqfalse: the selection condition is true, iff is false  Ind: the selection condition is true indepedent of the result of Csub Csub Csub
  • 24. 24 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Some more terminology  All tuples from the sublink query that fulfill the “unquantified” sublink condition  All tuples from the sublink query that do not fulfill the “unquantified” sublink condition Tsub true (t) Tsub false (t) Csub = (a = ANY σb=3(S)) Csub° = (a = b)
  • 25. 25 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Back to ANY-sublinks in selections  Proposition: Tsub * (t) = Tsub true (t) reqtrue Tsub reqfalse,ind ⎧ ⎨ ⎩
  • 26. 26 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries a 1 2 3 b c 1 100 2 10 4 24 SR q =σa =ANY Πb (S) (R) a 1 2 Result Compute provenance for t = (1)  Example:
  • 27. 27 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries Tsub = Πb (S) Tsub true (t) = {(1)} is reqtrueCsub Tsub * =Tsub true Csub° = (a = b) q =σa =ANY Πb (S) (R)
  • 28. 28 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries Tsub true (t) = {(1)} q =σa =ANY Πb (S) (R) b 1 2 4 Tsub a 1 2 3 R Csub° = (a = b) Compute provenance for t = (1)
  • 29. 29 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries a 1 2 3 b c 1 100 2 10 4 24 SR q =σa =ANY Πb (S) (R) a 1 b 1 R * Tsub * b 1 2 4 Tsub a 1 2 Result Compute provenance for t = (1)
  • 30. 30 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Definition 1 is ambiguous for queries with more than one sublink! b 1 2 100 c 1 5 SR q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 Result a 5 U
  • 31. 31 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Definition 1 is ambiguous for queries with more than one sublink! b 1 2 100 c 1 5 SR q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 Result a 5 U true false
  • 32. 32 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries b 5 c 1 5 S*R* q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 U* b 1 100 R* b 1 S* a 5 U* Solution 1 Solution 2
  • 33. 33 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries b 5 c 1 5 S*R* q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 U* b 1 100 R* b 1 S* a 5 U* Solution 1 Solution 2 true false
  • 34. 34 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries b 5 c 1 5 S*R* q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 U* b 1 100 R* b 1 S* a 5 U* Solution 1 Solution 2 false true
  • 35. 35 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Reasons for this ambiguity:  The definition requires the provenance to produce the same result  But not to produce the same results for the sublinks -> Definition 1 produces false positives
  • 36. 36 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Solution: Extend definition 1  Add a third condition:  For each sublink:  If computed for  one result tuple t  one tuple from the provenance of the sublink  Produces same sublink result as in the original query
  • 37. 37 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries b 5 c 5 S*R* q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 U* b 1 100 R* b 1 S* a 5 U* Solution 1 Solution 2
  • 38. 38 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  How to compute the provenance according to the extended definition?  Use query rewrite  Generic strategy (Gen)  Specialized strategies  Use un-nesting  Check: does not change the provenance
  • 39. 39 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Gen-strategy  For queries we cannot un-nest 1. Join original query with all possible provenance tuples (base relations) 2. Rewrite the sublink query 3. Introduce additional correlation to simulate a join between 1) and 2)
  • 40. 40 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. Overview 1. Introduction 2. The Provenance of Subqueries 3. Experimental Results 4. Conclusion
  • 41. 41 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 3. Experimental Results  TPC-H benchmark (10 MB size)
  • 42. 42 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 3. Experimental Results  TPC-H benchmark (1 GB size)
  • 43. 43 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. Overview 1. Introduction 2. The Provenance of Subqueries 3. Experimental Results 4. Conclusion
  • 44. 44 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 4. Conclusion  Definition 1 fails in the presence of sublinks  Can be extended to deal with sublinks  Provenance computation for sublinks  By using query rewrites  Implemented in the Perm  Future Work  Physical provenance-aware operators
  • 45. 45 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. Questions ? ? ?

Editor's Notes

  1. If we have to shorten: -remove query rewrite example in the introduction (-3 pages) Welcome to my presentation... My from ..., together with Gustavo from ... And its about ....
  2. The talk we be organised as follows: first a short introduction to out PMS Perm, the I’ll show what the Provenance of a subquery looks like and then how it can be computed, as usual a conclusion in the end
  3. In the context of relational database : The main problem faced can then be stated as: Which input... This problem can be solved for different levels of granularity of data items: Tuples, Attribute Values and so on. (We are looking at tuple level granularity) -different definitions of what influences means (we call this contribution semantics) for example only tuples that have been copied literally from the source to the result. (We are looking at influence contribution semantics which also have been called Why-Provenance
  4. Most app-doms where provenance would be important use complex query that use features like aggregation, user def. functions and subqueries in selections, aggregations, that are possibly correlated or nested Oooh, these are not supported by existing systems -add perm introd beofre this one, talk about ICDE paper -reasons why it is not supported
  5. Lets have a look at the contribution semantics we use for Perm. It was introduced by Cui et. Al. In 2000, prov. Was represented as a tuple of subsets ... And the definition defines the provenance of single operators (but is assumed to be transitive)
  6. 1 means the provenance if used as input of op produces exactly t and nothing else 2 means each tuple from each part of the provenance contributes something to the result (conter example, for selection all tuples that do not fullfill the selection condition obviously do not contribute something to the result but would be in the provenance if we leave out condition 2) Change slides: the disadvantage is the non relational representation of provenance, so we decided to use same semantics with another representation:
  7. Perm means: provenance extension of the relational model Uses “pure” ... Computes prov. By .... Uses influence contribution semantics with tuple level granularity -move before intro introduction
  8. Single result table that contains all the original result attributes and the attributes from the input relations of the query. Each original result tuple is extended by attaching contributing tuples from the base relations (and thus has to be duplicated if there is more than one contributing tuple from one of the input relations
  9. Single result table that contains all the original result attributes and the attributes from the input relations of the query. Each original result tuple is extended by attaching contributing tuples from the base relations (and thus has to be duplicated if there is more than one contributing tuple from one of the input relations
  10. Perm uses query rewrite techniques to compute the provenance of a query, ... -intro introd.
  11. Lets have a short example: (the + operator transforms a operator or algebra statement into a provenance computation) / P is a list of provenance attributes of an algebra expression. Here we have the rewrite rule for aggregation operator -SQL! Fast too to intro
  12. Lets have a short example: (the + operator transforms a operator or algebra statement into a provenance computation) / P is a list of provenance attributes of an algebra expression. Here we have the rewrite rule for aggregation operator -SQL! Fast too to intro
  13. Lets have a short example: (the + operator transforms a operator or algebra statement into a provenance computation) / P is a list of provenance attributes of an algebra expression. Here we have the rewrite rule for aggregation operator -SQL! Fast too to intro
  14. So now look what happens if we introduce subqueries
  15. We call them sublinks to distinguish between “normal” subqueries used in FROM-claus. We use algebraic representation (will see later why) Sublinks are called correlated if... Sublinks are called nested if...
  16. If we want to know, What is.... Then we are facing some problems sublinks can be used in different contexts: And we can make some observations: (for a given input tuple) a sublink prduces a contant value (which i either a boolean (e.g. IN, ANY, ...) or a data type contant (subqueries witout special sublink operator)
  17. Definition is cab ambigous
  18. We have been proven but I will only explain the main idee behind it
  19. Lets exercise an example: we have the follwing query and relations and are searching for the provenance of result tuple t = (1)
  20. The following tuples from Tsub and R fulfill condition one of the definition. (Note that 2 is not included because it would cause an additional result tuple 2)
  21. Definition is cab ambigous
  22. Definition is cab ambigous
  23. Definition is cab ambigous
  24. Definition is cab ambigous
  25. Definition is cab ambigous
  26. Definition is cab ambigous
  27. Definition is cab ambigous
  28. Definition is cab ambigous
  29. Now we know how the provenance of sublinks looks like, but how can we compute the provenance using query rewrite rules
  30. Now we know how the provenance of sublinks looks like, but how can we compute the provenance using query rewrite rules
  31. Exchange with results like in ICDE (whole of TPCH, focus on sublinks)