Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

EDBT 2009 - Provenance for Nested Subqueries

342 views

Published on

Data provenance is essential in applications such as scientific computing, curated databases, and data warehouses. Several systems have been developed that
provide provenance functionality for the relational data model. These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested subqueries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use.

In this paper we address this limitation by exploring the problem of provenance derivation when complex queries are involved. More precisely, we demonstrate that the widely used definition of Why-provenance fails in the presence of nested subqueries, and show how the definition can be modified to produce meaningful results for nested subqueries. We further present query rewrite rules to transform an SQL query into a query propagating provenance. The solution introduced in this paper allows us to track provenance information for a far wider subset of SQL than any of the existing approaches. We have incorporated these ideas into the Perm provenance management system engine and used it to evaluate the feasibility and performance of our approach.

Published in: Science, Technology
  • Be the first to comment

  • Be the first to like this

EDBT 2009 - Provenance for Nested Subqueries

  1. 1. Provenance for Nested Subqueries Boris Glavic Database Technology Group Department of Informatics University of Zurich glavic@ifi.uzh.ch Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich alonso@inf.ethz.ch
  2. 2. 2 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. Overview 1. Introduction 2. The Provenance of Subqueries 3. Experimental Results 4. Conclusion
  3. 3. 3 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction Query  Which input data item(s) influenced which output data item(s)?  Granularity  Tuple  Attribute Value  ...  Contribution semantics  Influence (Lineage / Why)  Copy (Where)  ...
  4. 4. 4 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Most application domains that benefit from provenance use complex queries  Subqueries  Correlated  Nested  Not supported by existing systems  Semantics not clear  Complex computation
  5. 5. 5 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Steps to solve this problem 1. Establish sound semantics for provenance of subqueries 2. Algorithms for subquery provenance computation 3. Integrate algorithms into a Provenance Management system (Perm)
  6. 6. 6 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Steps to solve this problem 1. Establish sound semantics for provenance of subqueries 2. Algorithms for subquery provenance computation 3. Integrate algorithms into a Provenance Management system (Perm)
  7. 7. 7 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Definition of contribution semantics  Why/Influence-provenance  Introduced in [Cui, Widom ICDE ‘00]  Provenance represented as list of subsets of the input relations  Defined for a single algebra operator and a single result tuple
  8. 8. 8 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Definition 1: For a single algebra operator op with input relations T1, ... , Tn a list (T1*, ... ,Tn*) of maximal subsets of the input relation is the provenance of a tuple t from the result of op iff: u op(T1*, ..., Tn*) = t u For all i and t* with t* in Ti*: op(T1*, ... Ti-1*, t* , Ti+1*, ... ,Tn*) != ∅
  9. 9. 9 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Perm  Provenance Extension of the Relational Model  Provenance Management System (PMS)  “Pure” Relational representation of provenance  Provenance computation trough algebraic query rewrite  Implemented as extension of PostgreSQL
  10. 10. 10 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Provenance representation Original Attributes Relation 1 Attributes Relation n Attributes Query 1 Original Result 2 n
  11. 11. 11 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Provenance representation Original Attributes Relation R Attributes Relation S Attributes Query R Original Result S r1 s1r2 t 1 t 1 r1 t 1 r2 s1 s1
  12. 12. 12 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Provenance Computation though query rewrite:  Given query q generate query q+ that computes the provenance of q  Representation as defined before  Rewrites operate on the algebraic representation of a query  Rewrite rules for each operator op that transform op into a algebra statement that propagates the provenance
  13. 13. 13 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Rewrite rules example: SELECT agg, G FROM T GROUP BY G SELECT agg, G, prov(T) FROM (SELECT agg, G FROM T GROUP BY G) AS agg, LEFT OUTER JOIN (SELECT G AS G’, prov(T) FROM T+) AS prov ON G = G’
  14. 14. 14 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction  Rewrite rules example: SELECT sum(revenue) AS sum, shop FROM sales GROUP BY shop shop month revenue Migros Jan 100 Migros Feb 10 Migros Mar 10 Coop Jan 25 Coop Feb 25 sales sum shop 120 Migros 50 Coop result
  15. 15. 15 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 1. Introduction SELECT sum, shop, pShop, pMonth, pRevenue FROM (SELECT sum(revenue) AS sum, shop FROM sales GROUP BY shop) AS agg LEFT OUTER JOIN (SELECT shop AS shop’, pShop, pMonth, pRevenue FROM sales ) AS prov ON shop = shop’ sum shop pShop pMonth pRevenu e 120 Migros Migros Jan 100 120 Migros Migros Feb 10 120 Migros Migros Mar 10 50 Coop Coop Jan 25 50 Coop Coop Feb 25 +
  16. 16. 16 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. Overview 1. Introduction 2. The Provenance of Subqueries 3. Experimental Results 4. Conclusion
  17. 17. 17 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Sublinks  Subqueries in e.g. SELECT-clause  Correlated  References outside attributes  Nested  Sublink that contains sublinks σa IN σ (b=3) (S) (R) σa IN σ (b=a) (S) (R) σa IN σ (b = ANY (T )) (S) (R)
  18. 18. 18 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  What is the provenance of a sublink according to Definition 1?  Sublinks can be used in different contexts  Selection  Projection  ...  Sublink either  Produces exactly one value  Or produces a boolean value
  19. 19. 19 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Single uncorrelated ANY-sublinks in selection conditions  For other  Types of sublinks  Correlated sublinks  Nested sublinks
  20. 20. 20 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  For other  Types of sublinks  Correlated sublinks  Nested sublinks READ THE PAPER!
  21. 21. 21 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Single uncorrelated ANY-sublinks in selection conditions  The result of the sublink query is fixed  For a given input tuple t the sublink condition is either true or false σa =ANY σ(b=3) (S) (R)
  22. 22. 22 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Some terminology  The query of a sublink  The conditional expression of a sublink Tsub q =σa =ANY Πb (S) (R) Πb (S) a = ANY Πb (S)Csub Tsub Csub
  23. 23. 23 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Sublink condition can play different roles in a condition C of a selection (for one input tuple t):  Reqtrue: the selection condition is true, iff is true  Reqfalse: the selection condition is true, iff is false  Ind: the selection condition is true indepedent of the result of Csub Csub Csub
  24. 24. 24 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Some more terminology  All tuples from the sublink query that fulfill the “unquantified” sublink condition  All tuples from the sublink query that do not fulfill the “unquantified” sublink condition Tsub true (t) Tsub false (t) Csub = (a = ANY σb=3(S)) Csub° = (a = b)
  25. 25. 25 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Back to ANY-sublinks in selections  Proposition: Tsub * (t) = Tsub true (t) reqtrue Tsub reqfalse,ind ⎧ ⎨ ⎩
  26. 26. 26 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries a 1 2 3 b c 1 100 2 10 4 24 SR q =σa =ANY Πb (S) (R) a 1 2 Result Compute provenance for t = (1)  Example:
  27. 27. 27 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries Tsub = Πb (S) Tsub true (t) = {(1)} is reqtrueCsub Tsub * =Tsub true Csub° = (a = b) q =σa =ANY Πb (S) (R)
  28. 28. 28 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries Tsub true (t) = {(1)} q =σa =ANY Πb (S) (R) b 1 2 4 Tsub a 1 2 3 R Csub° = (a = b) Compute provenance for t = (1)
  29. 29. 29 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries a 1 2 3 b c 1 100 2 10 4 24 SR q =σa =ANY Πb (S) (R) a 1 b 1 R * Tsub * b 1 2 4 Tsub a 1 2 Result Compute provenance for t = (1)
  30. 30. 30 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Definition 1 is ambiguous for queries with more than one sublink! b 1 2 100 c 1 5 SR q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 Result a 5 U
  31. 31. 31 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Definition 1 is ambiguous for queries with more than one sublink! b 1 2 100 c 1 5 SR q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 Result a 5 U true false
  32. 32. 32 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries b 5 c 1 5 S*R* q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 U* b 1 100 R* b 1 S* a 5 U* Solution 1 Solution 2
  33. 33. 33 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries b 5 c 1 5 S*R* q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 U* b 1 100 R* b 1 S* a 5 U* Solution 1 Solution 2 true false
  34. 34. 34 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries b 5 c 1 5 S*R* q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 U* b 1 100 R* b 1 S* a 5 U* Solution 1 Solution 2 false true
  35. 35. 35 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Reasons for this ambiguity:  The definition requires the provenance to produce the same result  But not to produce the same results for the sublinks -> Definition 1 produces false positives
  36. 36. 36 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Solution: Extend definition 1  Add a third condition:  For each sublink:  If computed for  one result tuple t  one tuple from the provenance of the sublink  Produces same sublink result as in the original query
  37. 37. 37 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries b 5 c 5 S*R* q =σC1∨C2 (U) C1 = (a =ANY R) C2 = (a > ALL S) t = (5) a 5 U* b 1 100 R* b 1 S* a 5 U* Solution 1 Solution 2
  38. 38. 38 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  How to compute the provenance according to the extended definition?  Use query rewrite  Generic strategy (Gen)  Specialized strategies  Use un-nesting  Check: does not change the provenance
  39. 39. 39 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 2. The Provenance of Subqueries  Gen-strategy  For queries we cannot un-nest 1. Join original query with all possible provenance tuples (base relations) 2. Rewrite the sublink query 3. Introduce additional correlation to simulate a join between 1) and 2)
  40. 40. 40 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. Overview 1. Introduction 2. The Provenance of Subqueries 3. Experimental Results 4. Conclusion
  41. 41. 41 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 3. Experimental Results  TPC-H benchmark (10 MB size)
  42. 42. 42 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 3. Experimental Results  TPC-H benchmark (1 GB size)
  43. 43. 43 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. Overview 1. Introduction 2. The Provenance of Subqueries 3. Experimental Results 4. Conclusion
  44. 44. 44 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. 4. Conclusion  Definition 1 fails in the presence of sublinks  Can be extended to deal with sublinks  Provenance computation for sublinks  By using query rewrites  Implemented in the Perm  Future Work  Physical provenance-aware operators
  45. 45. 45 Zur Anzeige wird der QuickTime™ Dekompressor „“ benötigt. Questions ? ? ?

×