TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, Updates, and Transactions

A Generic Provenance Middleware
for Database Queries, Updates, and
Transactions
Bahareh Sadat Arab1, Dieter Gawlick2, Venkatesh
Radhakrishnan2, Hao Guo1, Boris Glavic1
IIT DBGroup1
Oracle2

Outline
❶ Motivation and Overview
❷ GProM Vision
❸ Provenance for Transactions
2 GProM - Provenance for Queries, Updates, and Transactions

Introduction
• Data Provenance
– Information about the origin and creation process data
• Provenance tracking for database operations
– Considerable interest from database community in last decade
• The de-facto standard for database provenance [1,2,3,4,5]
– model provenance as annotations on data (e.g., tuples)
– compute the provenance by propagating annotations (query rewrite)
SELECT
DISTINCT Owner
FROM CannAcc;
[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of
Elegance in the Theory and Practice of Computation, Springer, 2013.
[2] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS,
2013.
[3] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB
Journal, 14(4):373–396, 2005.
[4] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty,
and Lineage. In VLDB, pages 1151–1154, 2006.
[5] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 41(3):5–14, 2012.

Use Cases
• Debugging data and transformations (queries)[1]
• Probabilistic databases (queries)[5]
• Auditing and compliance (transactions and update
statements)[6]
• Understanding data integration transformations (queries
and transactions)
• Assessing data quality and trust (queries and
transactions)[7]
 Computing provenance for updates and transactions is
essential for many use cases.
[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance
Information. In Search of Elegance in the Theory and Practice of Computation, pringer, 2013.
[5] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System
for Data, Uncertainty, and Lineage. In VLDB, 2006.
[6] D. Gawlick and V. Radhakrishnan. Fine grain provenance using temporal databases. In TaPP, 2011.
[7] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and provenance. SIGMOD Record, 2012.

Shortcomings of State-of-the-Art
• No practical implementation for updates
• No system or model supports transactions
• Inflexible provenance storage
– Always on [2,3]
– On-demand only [1]
• Query rewrite use atypical access patterns and
operator sequences
– -> leads to poor execution plans
• Most systems: only one type of provenance
[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. In Search of
Elegance in the Theory and Practice of Computation, pringer, 2013.
[2] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An Annotation Management System for Relational Databases. VLDB
Journal, 2005.
[3] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative data sharing via update exchange and provenance. TODS,
2013.

Objectives
1. Vision: Generic Provenance Database
Middleware (GProM).
– Provenance for
• Queries, updates, and transactions
– User decides when to compute and store
provenance
– Supports multiple provenance models
– Database-independent
2. Tracking provenance of concurrent
transactions
– Reenactment Queries

Contributions
1. First solution for provenance of transactions
2. Retroactive on-demand provenance
computation
– Using read-only reenactment
3. Only requires audit log + time travel
– Supported by most DBMS
– No additional storage and runtime overhead
4. Non-invasive provenance computation
– query rewrite + annotation propagation

Outline
❷ GProM Vision

System Architecture
• Database independent middleware
– Plug-able parser and SQL code generator
• Internal query representation
– Relational Algebra Graph Model (AGM)
• Core driver: Query rewrites
– Provenance Computation
– Flexible storage policies for provenance
– Provenance import/export
– AGM Optimizer (rewritten queries)
– Extensibility: Rewrite Specification Language (RSL)
• Initial prototype build on-top of Oracle

GProM Overview

Provenance Computation
• Query rewrite
– Take original query q and rewrite into q+
Computes original results + provenance
– Propagate provenance through operations
Q
Result
DB
Result +
Provenance
Q+

Example Rewrite
• Input:
SELECT DISTINCT u.Owner FROM Usacc u, CanAcc c WHERE u.ID = c.ID;
• Rewrite Parts:
USacc SELECT ID, Owner, Balance, Type,
ID AS P1, Owner AS P2, Balance AS P3, Type AS P4
FROM USacc
CanAcc SELECT ID, Owner, Balance, Type,
FROM CanAcc
WHERE u.ID = c.ID WHERE u.ID = c.ID
SELECT DISTINCT Owner SELECT Owner, P1, P2, P3, P4, P5, P6, P7, P8
• Output:
SELECT u.Owner, P1, P2, P3, P4, P5, P6, P7, P8
FROM
(SELECT ID, Owner, Balance, Type,
FROM USacc) u
(SELECT ID, Owner, Balance, Type,
FROM CanAcc) c
WHERE u.ID = c.ID;

Provenance Computation
• Operates on relational algebra representation of queries
– Fixed set of rewrite rules per provenance type:
• One per type of algebra operator
• Recursive top-down rewrite
– For each relation access: duplicate attributes as provenance
– For each operator: replace with algebra graph that propagates
provenance annotations
• Composable
UsAcc CanAcc UsAcc CanAcc

Supporting Past Queries, Updates,
and Transactions
• Only needs audit log and time travel
–supported by most DBMS
• Sufficient for provenance of past queries [4]
• Our contribution
–Sufficient for updates and transactions
[4] J. Zhang and H. Jagadish. Lost source provenance. In EDBT, 2010.

Provenance Generation and
Storage Policies
• GProM default
– Only compute provenance if explicitly requested
• User can register storage policies
– When to store which type of provenance
POLICY storeOnR {
FIRE ON Query, Insert q
WHEN Root(q) +=> Table(R)
COMPUTE PI-CS
STORE AS NEW TABLE
NAMING SCHEME Hash
}

Optimizing Rewritten Queries
• Query rewrite use atypical access patterns and
operator sequences
leads to poor execution plans
• Optimization for rewritten queries
– Heuristic
– Cost-based
SELECT ID, Owner, Balance,
CASE
WHEN Balance > 1000000
THEN 'Premium '
ELSE Type
END AS Type,
prov_CanAcc_ID,
prov_CanAcc_Owner,
prov_CanAcc_Balance,
prov_CanAcc_Type,
prov_USacc_ID,
prov_USacc_Owner,
prov_USacc_Balance,
prov_USacc_Type
FROM u1
...
SELECT ID, Owner, Balance, 'Premium ' AS Type,
prov_CanAcc_ID,
prov_CanAcc_Owner,
prov_CanAcc_Balance,
prov_CanAcc_Type,
prov_USacc_ID,
prov_USacc_Owner,
prov_USacc_Balance,
prov_USacc_Type
FROM u1
WHERE Balance > 1000000
UNION ALL
SELECT * FROM u1
WHERE (Balance > 1000000) IS NOT TRUE

Rewrite Extensibility
• Extensible using Rewrite Specification Language (RSL)
– Concise specification of rewrite rules
RULE mergeSelections {
FOR q => c => g
WHERE q->type = selection AND c->type = selection
REWRITE INTO
selection [pred = q->pred AND c->pred] => g
}
17
User
RSL
Manager
1
RSL
Provenance
Rewriter
PolicyPolicyRSL
2
RSL
Interpreter
3
1
2 3
4
GProM - Provenance for Queries, Updates, and Transactions

Outline
❷ GProM Vision

Provenance of Transactions

INSERT INTO USacc
(SELECT ID,
Owner,
Balance,
‘Standard’ AS Type
FROM CanAcc
WHERE Type = ‘US_dollar’);
UPDATE USacc
SET Type = ’Premium’
WHERE Balance > 1000000;
COMMIT;

INSERT INTO Usacc
(SELECT ID,
Owner,
Balance,
FROM CanAcc
UPDATE Usacc
u1
u2

• Our Approach:
Reenactment + Provenance Propagation
• Currently supports
– Snapshot Isolation
– Statement-level Snapshot Isolation
Gather
Transaction
Information
Construct
Update
Reenactment
Query
Rewrite For
Provenance
Computation
Execute
Query
1
Construct
Transaction
Reenactment
Query
2 3 4 5

1.Gather Transaction Information
• Retrieve SQL statements of transaction from audit log
• Update u1:
INSERT INTO USacc
(SELECT ID,
Owner,
Balance,
FROM CanAcc
• Update u2:
UPDATE Usacc

2. Translate Updates: Reenactment
• Update reads table version and outputs updated table version
• Multiple versions of the database
– Each modification of a tuple t causes a new version to be created
– Old tuple versions are kept (SI)
– Add version annotation τ to provenance of each updated row
• Use semi-ring model
UPDATE Usacc
SET Type=’Premium’
WHERE Balance>1000000;

2.Translate Updates
• Construct update reenactment query
– Simulates effect of update
– Read DB version seen by update using time travel
– Query result = updated table (Annotation-Equivalent)
SELECT ID, Owner, Balance, ’Standard’ AS Type
FROM CanAcc AS OF SCN 3652
WHERE Type=‘US_dollar’
UNION ALL
SELECT * FROM Usacc AS OF SCN 3652;
UPDATE Usacc
SELECT ID, Owner, Balance, ’Premium’ AS Type
FROM Usacc AS OF SCN 3652
WHERE Balance>1000000
UNION ALL
SELECT *
FROM Usacc AS OF SCN 3652
WHERE (Balance>1000000) IS NOT TRUE;
INSERT INTO Usacc
(SELECT ID,
Owner,
Balance,
FROM CanAcc

3. Construct Reenactment Query
• Simulates the whole transaction
– Annotation-Equivalent to original transaction
• Merge reenactment queries based on concurrency control protocol
– Each concurrency control requires a different merge process
– SERIALIZABLE (Snapshot isolation) -> modifications before the
transaction started + previous updates of the transaction
– READ COMMITTED (Snapshot isolation) -> sees committed changes
by concurrent transaction
WHIT U1 AS
(SELECT ID, Owner, Balance, ’Standard’ AS Type
WHERE Type=‘US_dollar’
UNION ALL
SELECT * FROM Usacc AS OF SCN 3652);
SELECT ID, Owner, Balance, ’Premium’ AS Type
FROM U1
WHERE Balance>1000000
UNION ALL
SELECT * FROM U1
WHERE (Balance>1000000) IS NOT TRUE;

4. Rewrite For Provenance
Computation
• Rewrite reenactment query to compute
provenance using annotation propagation
WITH
u1 AS
(SELECT ID, Owner, Balance, ’Standard ’ AS Type,
ID AS prov_CanAcc_ID,
. . .
NULL AS prov_USacc_ID,
. . .
1 AS updated,
WHERE Type = ’US dollar ’
UNION ALL
SELECT ID , Owner , Balance , Type ,
NULL AS prov_CanAcc_ID,
. . .
ID AS prov_USacc_ID,
. . .
0 AS updated
FROM USacc AS OF SCN 3652),
. . .
u1 AS
(SELECT . . .

4. Execute Query
• Execute query to retrieve provenance
Updated USacc Tuples Provenance from CanAcc Provenance from USacc
ID Owner Balance Type P1 P2 P3 P4 P5 P6
3 Alice Bright 1,500,000 Premium 3 Alice Bright 1,500,000 NULL NULL NULL
5 Mark Smith 50 Standard 5 Mark Smith 50 NULL NULL NULL

Conclusions
• We present our vision for GProM
– Database-independent middleware for computing
provenance of queries, updates, and transactions.
• First solution for provenance of transactions
• Query rewrite techniques on steroids:
– Provenance computation
– Transaction reenactment
– Provenance translation
– Provenance storage
– Optimization
• Extensible through RSL language

Future Works
• Implementing additional provenance types
• Comprehensive study of heuristic and cost-based
optimizations
• Design and implementation of RSL
• Implementing additional provenance formats
• Study reenactment for other concurrency control
mechanisms
– Locking protocols (2PL)
• Investigate additional Use-cases for Reenactment
– Transaction backout
– Retroactive What-if analysis

Questions?
• Homepage:
Bahareh: http://www.cs.iit.edu/~dbgroup/people/barab.php
Boris: http://www.cs.iit.edu/~glavic/
• DBGroup:
http://www.cs.iit.edu/~dbgroup/
• GProM Project (partially funded by Oracle)
http://www.cs.iit.edu/~dbgroup/research/oracletprov.php
• Perm
http://www.cs.iit.edu/~dbgroup/research/perm.php

References
[1] B. Glavic, R. J. Miller, and G. Alonso. Using SQL for Efficient
Generation and Querying of Provenance Information. In Search of
Elegance in the Theory and Practice of Computation, pages 291–320. Springer,
2013.
[2] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An
Annotation Management System for Relational Databases. VLDB Journal,
14(4):373–396, 2005.
[3] G. Karvounarakis, T. J. Green, Z. G. Ives, and V. Tannen. Collaborative
data sharing via update exchange and provenance. TODS, 38(3): 19, 2013.
[4] J. Zhang and H. Jagadish. Lost source provenance. In EDBT, pages 311–
322, 2010.
[5] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T.
Sugihara, and J. Widom. Trio: A System for Data, Uncertainty, and
Lineage. In VLDB, pages 1151–1154, 2006.
[6] D. Gawlick and V. Radhakrishnan. Fine grain provenance using
temporal databases. In TaPP, 2011.
[7] G. Karvounarakis and T. Green. Semiring-annotated data: Queries and
provenance. SIGMOD Record, 41(3):5–14, 2012.
32

Q-Bomb
• One pattern that arises from reenactment are long chains of
SELECT clauses using CASE
– Each level references attributes from next level multiple times
– Subquery pull-up creates expressions of size exponential in the number
of SELECT clauses
– In praxis: optimization never finishes
• Minimal example using one row table
SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, b
FROM SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, b
…
FROM SELECT CASE WHEN b < 100 THEN a ELSE a + 2 END AS a, b
FROM R
33

Example Provenance Computation
34

Example – Update Reenactment
35

Example – Trans. Reenactment
36

Types of Update Operations - Insert
• Insert executed at time t
• Updated version of R contains
1. All tuples from previous version
2. All newly inserted tuples
• Fixed tuple defined in VALUES clause
• Results of query over database version at t
Union these two sets
INSERT INTO R VALUES (v1, ... ,vn);
INSERT INTO R (q);
39
(SELECT * FROM R AS OF t)
UNION ALL
(SELECT v1 AS a1, ... , vn AS an);
(SELECT * FROM R AS OF t)
UNION ALL
(q(t));

Types of Update Operations - Delete
• Delete executed at time t
• Tuples in updated version of R:
– All tuples from for which Condition is not
fulfilled
DELETE FROM R WHERE C ; SELECT * FROM R AS OF t
WHERE (C) IS NOT TRUE;
40

Types of Update Operations - Update
• Update executed at time t
• Find tuples where Condition holds and update
the attribute values
• Find tuples where NOT Condition holds
Union these two sets
UPDATE R SET A WHERE C ;
(SELECT A’ FROM R AS OF t WHERE C)
UNION ALL
(SELECT * FROM R AS OF t WHERE (C) IS NOT TRUE)
41

READ COMMITTED
• Statement of a transaction T sees committed changes by concurrent
transaction
• For a given update we need to combine
– tuples produced by previous statements of same transaction
– tuples produced by transactions that committed before update
• Observations
– Once a transaction T modifies a tuple t, no other transaction can access t until T
commits
– Let ui be the update executed at time x of T that first modifies t
– ui will read the latest version committed x
– If we know ui then updates of T before x do not have to look at t
• Consider the database version 1 time unit (C-1) before commit of T
– This contains all the tuple versions seen by the first update of T updating each
individual tuple
– Let t be a tuple version in this version and it’s start time is y
– We know that updates from T which executed before y cannot have updated t
– We can use version C-1 as input for reenactment as long as we hide tuple
version t at y from an reenactment of an updated executed at x with x < y
42

READ COMMITTED
u1 AS
(SELECT
CASE WHEN Balance <=1000000 AND version <= 0 THEN 'Standard ' ELSE Type END AS Type ,
ID , Owner , Balance ,
CASE WHEN Balance <=1000000 AND version <= 0 THEN −1 ELSE version END AS version
FROM USacc AS OF SCN 3652)
,
u2 AS
(SELECT
CASE WHEN Balance > 1000000 AND version <= 1 THEN 'Premium' ELSE Type END AS Type ,
ID , Owner , Balance ,
CASE WHEN Balance > 1000000 AND version <= 1 THEN −1 ELSE version END AS version
FROM u1 )
SELECT ID , Owner , Balance , Type FROM u2 WHERE version = −1;43

Database Independence
• Encapsulate database-specific functionality in
pluggable modules.
• What needs to be adapted are :
1) Parser
2) SQL code generator
3) Metadata access
4) Audit log access
5) Time travel activation.
44

Accessing Several Tables
• Transactions Accessing Several Tables
– We require user to specify which table she is
interested in
– Replace access to table with query for last update
that modified the table
45
U1R1
R2
R3 U2
R1
U3
R3
U4 R1
R3

TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, Updates, and Transactions

Recommended

Recommended

More Related Content

Similar to TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, Updates, and Transactions

Similar to TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, Updates, and Transactions (20)

More from Boris Glavic

More from Boris Glavic (17)

Recently uploaded

Recently uploaded (20)

TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, Updates, and Transactions