Provenance in Databases and Scientific Workflows: Part II (Databases)

Provenance in Databases
and Scientific Workflows
Bertram Ludäscher
ludaesch@illinois.edu
31st Brazilian Symposium on Databases
October 4-7, 2016, Salvador, Bahia 1

• Part I: Provenance in Scientific Workflows
– Alta Vista: Provenance everywhere!
– Provenance & Scientific Workflows
– Provenance Models and Standards (not so much)
– Provenance Tools
• Example & Demo: YesWorkflow
• Part II: Provenance in Databases
– Foundations of provenance in databases
– Why-, How-, and Why-Not provenance
Outline of the Tutorial
A “Tour de Provenance”
2
Provenance @ SBBD'16

Types of Data Provenance
• Black-box
– know (next to) nothing at compile-time
– at runtime: keep some data lineage
– most prov sensu WF work use this
• White-box
– statically (compile-time) analyzable
– q(Y1,Y2) :- p(X1,X2), r(X1,Y1), s(X2,Y2)
– Most prov sensu DB work use this
• Grey-box
– can “look inside” (some black boxes)
– … e.g. b/c they have subworkflows
– … or FP signatures: A :: t1, t2à t3,t4
– … or semantic annotations (sem.types)
f
A
q
t1
t2
t3
t4
X1
X2
Y1
Y2
3

6th Stop: Provenance in Databases
• Some key questions:
– Why is tuple t in answer to query q(D)?
– Which set of tuples L in D does t depend on?
i.e., what is the lineage L of t ?
– How was t derived from its lineage L ?
• Also:
– Where in D do the values in t come from?
– Why is t’ not in q(D)?
• .. fasten your seatbelts …
4

5
Land of many different provenance species:
Why? How? Where?
Later: Why-Not? How many? How long?

(fine-grained, white-box)
6

Compare with:
Provenance in Scientific Workflows
• Some key questions:
– What is the lineage/trace T of data product (output) yi:
(y1 …, yn ) = execute(W, x, p) ?
• … given workflow/script W with inputs x and parameters p ?
• … i.e., find subset of x, p, and (program slices of) W on which a specific yi
depends!
– How can we store, query the provenance (trace) graph
effectively, efficiently?
• Regular Path Queries (RPQs), Lowest Common Ancestor (LCA)
• Temporal Query Languages (e.g. Past-Temporal Logic)
• other graph queries
– What is the difference between traces T1, T2?
– Does the trace (retrospective provenance) match the workflow
(prospective provenance)?
7

8
Provenance in (Scientific) Workflows
(“Coarse-grained”, “Black-box”)

What people do with “provenance”
• Which one is “workflows” vs “databases” ?
– Result validation
– Result debugging (science vs wf logic)
– Reproducibility and Repeatability
– Explanation (derivations, traces, proof trees)
– Runtime monitoring
• Profiling, benchmarking
– Performance Optimization (“smart re-run”)
– Fault-tolerance, crash-recovery
– Database view maintenance (e.g. data warehousing)
– Workflow design
9

Database Provenance: Some Pioneers …
Cui (PhD 2001), Widom:
TODS’00, VLDB’03
10

Database Provenance: Some Pioneers
Buneman et al.
ICDT 2001
(citations: 1000+)
11

Provenance
Semirings:
The Great
Database
Provenance
Unification*!
TJ Green et al:
PODS’07,
SIGMOD Record’12
12
*Restrictions apply:
positive queries only…Provenance @ SBBD'16

7th Stop: Provenance Polynomials
One Semiring to Rule them all!
(Theory strikes!)
Green, Karvounarakis, Tannen. Provenance semirings, PODS, 2007
13

Example: Go from X to Y in 3 hops!
(a = CS b = NCSA c = iSchool)
• Database: hop(X,Y) :=
• Query: 3hop(X,Y) :-
hop(X, Z1), hop(Z1, Z2), hop(Z2,Y).
a
p
b
q
r
c
s
Note: Cannot go from c to a in 3hops!
a
ppp+pqr+qrp
b
ppq+qrq
cpqs
ppr+qrr
rpq
rqs
hop(a,a, p).
hop(a,b, q).
hop(b,a, r)
hop(b,c, s).
3hop(a,a, p3+2pqr).
3hop(a,b, p2q+q2r).
…
3hop(a,c, pqs).
14

hop(S,T)
thop(S,T) :-
hop(S,U), hop(U,V), hop(V,T).
thop(S,T)
hop(a,a).
hop(a,b).
hop(b,a).
hop(b,c).
thop(a,a).
thop(a,b).
thop(a,c).
thop(b,a).
thop(b,b).
thop(b,c).
15
a b c
a b
c

hop(S,T)
thop(S,T, P1*P2*P3) :-
hop(S,U, P1), hop(U,V, P2), hop(V,T, P3).
thop(S,T)
a
p
b
q
r
c
s
a
ppp+pqr+qrp
b
ppq+qrq
cpqs
ppr+qrr
rpq
rqs
hop(a,a, p).
hop(a,b, q).
hop(b,a, r).
hop(b,c, s).
thop(a,a, p3+2pqr).
thop(a,b, p2q+q2r).
thop(a,c, pqs).
thop(b,a, p2r+r2q).
thop(b,b, rpq).
thop(b,c, rqs).
16

hop
thop(S,T) :-
hop(S,U), hop(U,V), hop(V,T).
thop
17
a b c
a b
c
Input
Three-Hop Query
Output

hop
thop(S,T, P) :-
hop(S,U, P1), hop(U,V, P2), hop(V,T, P3),
P = P1*P2*P3 .
thop
a
p
b
q
r
c
s
a
ppp+pqr+qrp
b
ppq+qrq
cpqs
ppr+qrr
rpq
rqs
18
Annotated Input
Rewritten Three-Hop Query
Annotated Output

Provenance Polynomials
,,Mein Schatz!”
p3 + 2pqr
p3 + pqr p + 2pqr
p + pqr
pqr
p + pqr
p
a
ppp+pqr+qrp
b
ppq+qrq
cpqs
ppr+qrr
rpq
rqs
19

8th Stop: The Negation & Why-Not Problem
• Provenance Semirings work well for:
– Positive Queries (e.g., RA+ )
• Challenges: Handling of
– set difference (~ negation)
– Why-Not provenance
– Missing Answer provenance
• A fresh look at provenance!
• … using an old idea: Game semantics!
– for query evaluation
20

Query evaluation
game
EDB: e(a,b), e(b,b)
a b
tc(X,Y) :- e(X,Y) # (1)--e(X,Y)-->(2)
tc(X,Y) :- # (1)--exists:Z-->(3)
e(X,Z), # (3)->(4)-e(X,Z)->(5)
tc(Z,Y). # (3)--X:=Z-->(1) 2
3
1
X := Z
4 5
e(X,Y)
exists:Z
e(X,Z)
3:(b,b,b) 1
1:(b,b) 11
4:(b,b) 1
1
1:(a,b) 1
3:(a,b,a) 1
2:(a,b) 01
3:(a,b,b) 1
2
2
3:(b,b,a) 1
2:(b,b) 01
4:(a,b) 1 5:(a,b) 01
5:(b,b) 01
3:(a,a,a) 1
4:(a,a) 0
1
1:(a,a) 2
1
3:(b,a,a) 1
4:(b,a) 0
1
1
1
1
3:(a,a,b) 2 1:(b,a) 2 3:(b,a,b) 2
Provenance’12 @Dagstuhl
with JanVdB TJ Green
Flum, Kubierschky, Ludäscher, Total and partial well-founded
Datalog coincide, ICDT-The-Bag-1997, Delphi, Greece
Eureka!
21

a b
tc(X,Y) :- e(X,Y) # (1)--e(X,Y)-->(2)
tc(X,Y) :- # (1)--exists:Z-->(3)
e(X,Z), # (3)->(4)-e(X,Z)->(5)
tc(Z,Y). # (3)--X:=Z-->(1) 2
3
1
X := Z
4 5
e(X,Y)
exists:Z
e(X,Z)
3:(b,b,b) 1
1:(b,b) 11
4:(b,b) 1
1
1:(a,b) 1
3:(a,b,a) 1
2:(a,b) 01
3:(a,b,b) 1
2
2
3:(b,b,a) 1
2:(b,b) 01
4:(a,b) 1 5:(a,b) 01
5:(b,b) 01
3:(a,a,a) 1
4:(a,a) 0
1
1:(a,a) 2
1
3:(b,a,a) 1
4:(b,a) 0
1
1
1
1
3:(a,a,b) 2 1:(b,a) 2 3:(b,a,b) 2
EDB: e(a,b), e(b,b)
Game
diagram
Instantiated
move graph
Flum, Kubierschky, Ludäscher, Total and
partial well-founded Datalog coincide,
ICDT-The-Bag-1997, Delphi, Greece
22
Eureka moment:
1. query evaluation = evaluation game (argument about truth in a database)
2. provenance = winning strategies (justified/winning arguments)

9th Stop: A Game
a k
b c l
d e m
g h nf
23

Solving the Game
a k
b c l
d e m
g h nf
All successors won è position lost
Some successor lost è position won
24

Solving the Game
a k
b c l
d e m
g h nf
All leaves (dead-ends) are immediately lost!
25

Solving the Game
a k
b c l
d e m
g h nf
X is won if there exists a move to a lost Y
26

Solving the Game
a k
b c l
d e m
g h nf
X is lost if all moves lead to a won Y
27

Solving the Game
a k
b c l
d e m
g h nf
Repeat until no change => drawn positions remain
28

10th Stop: Game Provenance
a
b
1
c
3
d e
f
1
g
3
m
h
1
k
l
oo
n
oo
oo
oo
2 2
2
• Game can be solved in time
linear in |Move|
• One rule to rule them all!
win(X) :- move(X,Y), not win(Y)
• node color => edge color
– good vs bad moves
• good moves = natural, new
notion of provenance!
Aside: Games ~ Argumentation Frameworks
win(X) :- move(X,Y), not win(Y)
def(X) :- attacks(Y,X), not def(Y)
Eureka!
29

Game Provenance
W
bad Dbad
L
winning
bad
drawing
n/a
delaying
n/a
n/a
a
b
1
c
3
d e
f
1
g
3
m
h
1
k
l
oo
n
oo
oo
oo
2 2
2
Extracting Provenance:
ü Why/how win(x)?
• [x] –G.(R.G)*–> [y]
ü Why-not win(x)?
• [x] –(R.G)*–> [y]
• [x] –(Y+)–> [y]
Move types
30

Game Provenance
a
b
1
c
3
d e
f
1
g
3
m
h
1
k
l
oo
n
oo
oo
oo
2 2
2
Extracting Provenance:
ü Why/how win(x)?
• [x] –G.(R.G)*–> [y]
ü Why-not win(x)?
• [x] –(R.G)*–> [y]
• [x] –(Y+)–> [y]
• Next: play a query
evaluation game
• => new why-(not)
provenance via games!
31

11th Stop: Provenance (or Query
Evaluation) Games Construction
“SLD-resolution game”
Next (Example):
A(X) :– B(X,Y,Z) … not C(X,Y) …
Eureka!
32

Translation: Q(I) => G Q(I)
A(X)
C(X)
B(X, Y )
r2(X, Y )
g1
2(X, Y )
g2
2(Y )
rB(X, Y )
rC (X)
¬A(X)
¬B(X, Y )
¬C(X)
B(X, Y )
C(X)
X:=Y
9Y
(a) Game template for QABC : A(X) : B(X, Y ), ¬C(Y ).
¬C(a)
¬C(b)
¬B(a, a)
¬B(a, b)
rB(b, a)
r2(b, a)¬A(b)
¬A(a)
g1
2(a, a)
B(a, b)
B(a, a)
C(a)
g2
2(a)
g2
2(b)
C(b)
¬B(b, a)
¬B(b, b)
rC (a)
A(b)
A(a)
r2(a, b)
r2(a, a)
g1
2(a, b) rB(a, b)
r2(b, b)
g1
2(b, b)
g1
2(b, a)
B(b, b)
B(b, a)
9a
9b
9b
9a
(b) Instantiated QABC game on I = {B(a, b), B(b, a), C(a)}.
A(b)
Figure 4: Alt
x
¬A :
x1 = a
33

Solve G Q(I) => Provenance!
¬B(a, b)¬A(a) B(a, b)
r2(a, b)
g1
2(a, b) rB(a, b)
¬C(a)
¬C(b)
¬B(a, a)
¬B(a, b)
rB(b, a)
r2(b, a)¬A(b)
¬A(a) rB(a, b)B(a, b)
B(a, a)
C(a)
g2
2(a)
g2
2(b)
C(b)
¬B(b, a)
¬B(b, b)
rC (a)
A(b)
A(a)
r2(a, b)
r2(a, a)
g1
2(a, b)
g1
2(a, a)
r2(b, b)
g1
2(b, b)
g1
2(b, a)
B(b, b)
B(b, a)
9a
9b
9b
9a
(c) Solved game: lost positions are (dark) red; won positions
are (light) green. Provenance edges (= good moves) are solid.
Bad moves are dashed and not part of the provenance. A(a) is
true (A(b) is false) as it is won (lost) in the solved game; the
game provenance explains why (why-not).
Figure 3: Provenance game for Q . The well-founded model of 34

Happy End (1 of 3)a p
b
q r
c
s
(a) input I ...
hop
a a p
a b q
b a r
b c s
(b) ... annotated.
3hop
a a p3
+ 2pqr
a b p2
q + q2
r
a c pqs
b a p2
r + qr2
b b pqr
b c qrs
(c) 3hop with provenance.
r1(a, a, b, a)
g2
1(a, a)
¬hop(b, a)
g1
1(a, a)
hop(b, a)
g2
1(a, b) g3
1(b, a)
rhop(b, a)
r1(a, a, a, a)
r1(a, a, a, b)
3hop(a, a)
g3
1(a, a)
rhop(a, a)
hop(a, b)
¬hop(a, a)
g1
1(a, b)
rhop(a, b)
g2
1(b, a)
¬hop(a, b)
hop(a, a)
9 a,a 9 b,a
9 a,b
(d) The game provenance of 3hop(a, a) ...
⇥
+
⇥
+
+
+ +
r
⇥
⇥
+
+
p
+
⇥
+
q
+
⇥
+
(e) ... is p3 + 2pqr.
Figure 1: Each edge hop(x, y) in the input graph I in (a) is annotated
Provenance Game on GQ(I)
= Provenance Polynomials
… for positive queries!
Yes!
35

Happy End (2 of 3)
… but also works for Why-Not provenance & non-monotonic
queries (i.e., Q can have negation) !!
Here: not 3hop(c,a) – can’t go back from GSLIS to CS
c a
g2
1(c, a)
¬3hop(c, a)
g2
1(c, c)g1
1(c, c)
r1(c, a, c, b)
¬hop(c, b)
hop(c, a)
g2
1(b, b)
¬hop(a, c)
hop(c, c)
g1
1(c, a)
r1(c, a, b, c)r1(c, a, a, b)
3hop(c, a)
hop(b, b)
g2
1(c, b)g2
1(a, c)
r1(c, a, a, c)
¬hop(c, c)
hop(c, b)
¬hop(c, a)
g1
1(c, b)
r1(c, a, b, b)
¬hop(b, b)
g3
1(c, a)
r1(c, a, a, a) r1(c, a, b, a)
hop(a, c)
r1(c, a, c, a) r1(c, a, c, c)
9 a,b 9 a,c 9 c,a 9 c,c9 b,c 9 b,b9 b,a9 a,a 9 c,b
Figure 2: Why-not provenance for 3hop(c, a) using provenance games.
gi
1 in the body of r1, thus claiming that gi
1 is false and hence that
the r1 instance doesn’t derive t. The ﬁrst player can counter and
demonstrate that gi
1 is true by selecting a rule instance or fact as
evidence for gi
1. The game proceeds in rounds until some player
cannot move and thus loses (the opponent wins). In [KLZ13] it36

Happy End (2 of 3)
5 leaf nodes ~ 5 missing
(“hypothetical”) edges
Insert those
=> 3hop(c,a) will be true!
g2
1(c, a)
¬3hop(c, a)
g2
1(c, c)g1
1(c, c)
r1(c, a, c, b)
¬hop(c, b)
hop(c, a)
g2
1(b, b)
¬hop(a, c)
hop(c, c)
g1
1(c, a)
r1(c, a, b, c)r1(c, a, a, b)
3hop(c, a)
hop(b, b)
g2
1(c, b)g2
1(a, c)
r1(c, a, a, c)
¬hop(c, c)
hop(c, b)
¬hop(c, a)
g1
1(c, b)
r1(c, a, b, b)
¬hop(b, b)
g3
1(c, a)
r1(c, a, a, a) r1(c, a, b, a)
hop(a, c)
r1(c, a, c, a) r1(c, a, c, c)
gi
demonstrate that gi
evidence for gi
cannot move and thus loses (the opponent wins). In [KLZ13] it
was shown how the provenance of a tuple t can be obtained via a
regular path query over a solved game graph like the one in Fig. 1d:
e.g., p3
+ 2pqr for 3hop(a, a) is represented by a solved game
as shown in Fig. 1e: for positive queries, solved games represent
semiring provenance by noting that won (green) and lost (red) po-
sitions correspond to “+” and “⇥” operations, respectively (leaves
represent input annotations, here: p, q, r, s) [KLZ13].
h labels t, u, v, w, and x. These missing edges
failed leaf nodes in Fig. 2. The table in Fig. 6
not provenance, with different combinations of
reconditions for a derivation of 3hop(c, a).
a p
b
q
c
u
r
x
s
t
w
v
h I with five additional, hypothetical edges (dashed).
t Game Construction
y QABC. To build the game, each ground tu-
currently ‘at’ a rule node is
firing is satisfied and creat
claim, the player moves to
The goal, if unsatisfied, wi
at least one goal is unsatisfi
for the rule node.
A detailed example usin
next section.
Constraint provenance
games by making them dom
tivating example, consider
are effectively the same as i
nodes that apply to more th
the firing r2(b, c) was not
has to find the node admitt
The subgraph of this node
explain why rule firings adm
Example Consider the ex
straint game in Fig. 5. After
cessed, the rule is processed
of A(X) is to select a node
in B and a node for the abse
domain, also captures the rule non-satisfaction of an infinite s
possible variable bindings to elements possibly outside the a
domain. Any constraint that has a variable that is only disequa
constrained represents an infinite set of firings. Consider the
node: R1 : X6=a, X6=b, Z1=a, Z2=a, Y =a. This correspon
the (hypothetical) 3hop path c
t
a
p
a
p
a and the situ
in which the edge t exist (see first row of Fig. 6). However, it
explains why the rule firing d ! a ! a ! a is not succes
The explanation is the failure of the first goal of the rule. In the
of X=c, it represents that there are no outgoing edges from
the case of X=d or any other invented value this is trivially tr
This shows that constraint provenance games do not suffer
the same problems as their fully-grounded counterparts. Pr
nance can be queried for any imaginable tuple, including one n
the active domain, and the provenance presented is still corre
the presence of a growing active domain.
r1(X, Y, Z1, Z2) X ! Z1 ! Z2 ! Y Why Not R1
[Fig. 2] [Fig. 7] Provenance [F
r1(c, a, a, a) c
t
a
p
a
p
a t ) t·p·p
r1(c, a, a, b) c
t
a
q
b
r
a t ) t·q·r
r1(c, a, a, c) c
t
a
u
c
t
a t, u ) t·u·t
r1(c, a, c, a) c
v
c
t
a
p
a t, v ) v·t·p
r1(c, a, b, c) c
w
b
s
c
t
a t, w ) w·s·t
r1(c, a, c, c) c
v
c
v
c
t
a t, v ) v·v·t
r1(c, a, c, b) c
v
c
w
b
r
a v, w ) v·w·r
r1(c, a, b, a) c
w
b
r
a
p
a w ) w·r·p
r1(c, a, b, b) c
w
b
x
b
r
a w, x ) w·x·r
Figure 6: The nine r1-instances in the first column correspond to
in Fig. 2 from left to right. The 3hop-path is shown in the second col
=> What-If provenance!
37

Are there more ways to fail?
(X, Y )
C (X)
(Y ).
(b, a)
g1
2(b, c)
g1
2(b, b)
r2(b, a)
¬B(b, c) B(b, c)
g2
2(a)
¬B(b, b)
rC (a)
A(b)
C(a)
B(b, b)r2(b, b)
r2(b, c)
9 c
9 a
9 b
Figure 4: Altered subgraph of Fig. 3c after adding c to the active domain.
¬B(a, a)
¬B(a, b)¬A(a)
g1
2(a, a)
B(a, b)
B(a, a)
r2(a, b)
g1
2(a, b) rB(a, b)
9b
¬C(a)
¬C(b)
¬B(a, a)
¬B(a, b)
rB(b, a)
r2(b, a)¬A(b)
B(a, a)
C(a)
g2
2(a)
g2
2(b)
C(b)
¬B(b, a)
¬B(b, b)
rC (a)
A(b)
A(a)
r2(a, b)
r2(a, a)
g1
2(a, b)
g1
2(a, a)
r2(b, b)
g1
2(b, b)
g1
2(b, a)
B(b, b)
B(b, a)
9a
9b
9b
9a
gure 3: Provenance game for QABC. The well-founded model of
n(X) : M(X, Y ), ¬win(Y ), applied to move graph M, solves the game.
A :
x1 =
A :
x1 =
¬A :
x1 6= a,
x1 6= b
A :
x1 6=
x1 6=
¬A :
x1 = b
¬A :
x1 = a
Figure 5: Constr
may represent ﬁn
Two branches that explain
Why-not A(b)
Adding a new constant c to the
domain => new why-not answer!
38

¬C(b)
¬B(a, a)
¬B(a, b)¬A(a)
g1
2(a, a)
B(a, b)
B(a, a)
g2
2(b)
C(b)
A(b)
A(a)
r2(a, b)
r2(a, a)
g1
2(a, b) rB(a, b)
r2(b, b)
9a
9b
¬C(a)
¬C(b)
¬B(a, a)
¬B(a, b)
rB(b, a)
r2(b, a)¬A(b)
B(a, a)
C(a)
g2
2(a)
g2
2(b)
C(b)
¬B(b, a)
¬B(b, b)
rC (a)
A(b)
A(a)
r2(a, b)
r2(a, a)
g1
2(a, b)
g1
2(a, a)
r2(b, b)
g1
2(b, b)
g1
2(b, a)
B(b, b)
B(b, a)
9a
9b
9b
9a
igure 3: Provenance game for QABC. The well-founded model of
in(X) : M(X, Y ), ¬win(Y ), applied to move graph M, solves the game.
he new binding for X; a condition “B(X, Y )” means that a move
s possible only if B(X, Y ) is true in I for the current X, Y values.2
Given database I, a template can be instantiated yielding a game
raph GQ(I) as in Fig. 3b. Note how template variables (e.g., Y )
ave been replaced by domain values (a or b), and that conditional
dges (e.g., labeled “C(X)”) became unconditional edges (e.g.,
(a) ! rC(a)) or no edge at all (e.g., from C(b)), depending on
whether or not the condition holds in I. To extract why(-not)
rovenance from a game graph GQ(I) as in Fig. 3b, we need to
olve the game first, i.e., determine which positions are won (light
¬B :
x1 6= a,
x1 6= b,
x2 = a
C :
x1 = a
A :
x1 = a
A :
x1 = b
¬C :
x1 6= a
¬A :
x1 6= a,
x1 6= b
C :
x1 6= a
R2 :
X = a,
Y = a
R2 :
X = a,
Y = b
B :
x1 6= a,
x2 6= a
R2 :
X 6= a,
Y 6= a
RB :
x1 = b
x2 = a
B :
x1 = a,
x2 = b
A :
x1 6= a,
x1 6= b
G2
2 : ¬C :
Y 6= a
G1
2 : B :
X 6= a,
X 6= b,
Y = a
¬A :
x1 = b
¬A :
x1 = a
¬B :
x1 6= a,
x2 6= a
¬B :
x1 = a,
x2 = b
B :
x1 = b,
x2 = a
RC :
x1 = a
RB :
x1 = a
x2 = b
R2 :
Y 6= b,
X = a,
Y 6= a
G1
2 : B :
X 6= a,
Y 6= a
G1
2 : B :
X = b,
Y = a
B :
x1 6= a,
x1 6= b,
x2 = a
R2 :
X 6= a,
X 6= b,
Y = a
G1
2 : B :
X = a,
Y = b
R2 :
X = b,
Y = a
¬C :
x1 = a
¬B :
x1 = b,
x2 = a
G2
2 : ¬C :
Y = a
Figure 5: Constraint provenance game for QABC. Unlike in Figure 3, node
may represent finite or infinite sets here.
GQ(I) thus consists only of edges that are matched by the regula
path queries (g.r)+
and r.(g.r)⇤
, i.e., alternating sequences o
green (winning) and red (delaying) moves [KLZ13].
3. Constraint Provenance Games
Consider the solved game graph of Fig. 3c. If the value c wer
added to the active domain, the provenance would be incomplete
e.g., to explain why-not A(b) there are two 9a, 9b branches ema
nating from A(b). However, with c in the active domain there is
third 9c branch via r2(b, c): see Fig. 4. We show that a modifie
Happy End (3 of 3)… sort of … C(X)
B(X, Y )
X, Y )
g1
2(X, Y )
g2
2(Y )
rB(X, Y )
rC (X)
¬B(X, Y )
¬C(X)
B(X, Y )
C(X)
X:=Y
mplate for QABC : A(X) : B(X, Y ), ¬C(Y ).
¬C(a)
¬C(b)
¬B(a, a)
¬B(a, b)
rB(b, a)
r2(b, a)
g1
2(a, a)
B(a, b)
B(a, a)
C(a)
g2
2(a)
g2
2(b)
C(b)
¬B(b, a)
¬B(b, b)
rC (a)
r2(a, b)
r2(a, a)
g1
2(a, b) rB(a, b)
r2(b, b)
g1
2(b, b)
g1
2(b, a)
B(b, b)
B(b, a)
ed QABC game on I = {B(a, b), B(b, a), C(a)}.
¬C(a)
¬C(b)
¬B(a, a)
¬B(a, b)
rB(b, a)
r2(b, a)
rB(a, b)B(a, b)
B(a, a)
C(a)
g2
2(a)
g2
2(b)
C(b)
¬B(b, a)
¬B(b, b)
rC (a)
r2(a, b)
r2(a, a)
g1
2(a, b)
g1
2(a, a)
r2(b, b)
g1
2(b, b)
g1
2(b, a)
B(b, b)
B(b, a)
me: lost positions are (dark) red; won positions
en. Provenance edges (= good moves) are solid.
e dashed and not part of the provenance. A(a) is
alse) as it is won (lost) in the solved game; the
nce explains why (why-not).
g1
2(b, c)
g1
2(b, b)
r2(b, a)
¬B(b, c) B(b, c)
g2
2(a)
¬B(b, b)
rC (a)
A(b)
C(a)
B(b, b)r2(b, b)
r2(b, c)
9 c
9 a
9 b
Figure 4: Altered subgraph of Fig. 3c after adding c to the active domain.
¬B :
x1 6= a,
x1 6= b,
x2 = a
C :
x1 = a
A :
x1 = a
A :
x1 = b
¬C :
x1 6= a
¬A :
x1 6= a,
x1 6= b
C :
x1 6= a
R2 :
X = a,
Y = a
R2 :
X = a,
Y = b
B :
x1 6= a,
x2 6= a
R2 :
X 6= a,
Y 6= a
RB :
x1 = b,
x2 = a
B :
x1 = a,
x2 = b
A :
x1 6= a,
x1 6= b
G2
2 : ¬C :
Y 6= a
G1
2 : B :
X 6= a,
X 6= b,
Y = a
B :
x2 6= b,
x1 = a
¬A :
x1 = b
¬A :
x1 = a
G1
2 : B :
Y 6= b,
X = a
¬B :
x1 6= a,
x2 6= a
¬B :
x1 = a,
x2 = b
B :
x1 = b,
x2 = a
RC :
x1 = a
¬B :
x2 6= b,
x1 = a
RB :
x1 = a,
x2 = b
R2 :
Y 6= b,
X = a,
Y 6= a
G1
2 : B :
X 6= a,
Y 6= a
G1
2 : B :
X = b,
Y = a
B :
x1 6= a,
x1 6= b,
x2 = a
R2 :
X 6= a,
X 6= b,
Y = a
G1
2 : B :
X = a,
Y = b
R2 :
X = b,
Y = a
¬C :
x1 = a
¬B :
x1 = b,
x2 = a
G2
2 : ¬C :
Y = a
Why-not provenance
complete only for
adom(I) = { a, b } !
Constraint why-not provenance
also captures new constants, i.e.,
for an unlimited domain
D = { a, b, c, … }
=> Constraint Provenance answer is
domain independent! (sort of)
39

Why-Not: The Full Story Emerges…
(sort of…)
R1 :
X 6= a,
X 6= b,
Z1 = c,
Z2 = c,
Y 6= c
¬hop :
x2 6= a,
x2 6= b,
x1 = a
R1 :
X 6= a,
X 6= b,
Z1 = c,
Z2 = b,
Y = a
3Hop :
x1 6= a,
x1 6= b,
x2 = a
R1 :
X 6= a,
X 6= b,
Z1 6= c,
Z1 6= a,
Z1 6= b,
Z2 = c,
Y 6= c
G1
1 : hop :
X 6= a,
X 6= b,
Z1 6= c
R1 :
X 6= a,
X 6= b,
Z1 = b,
Z2 = c,
Y 6= c
G1
1 : hop :
X 6= a,
X 6= b,
Z1 = c
¬hop :
x1 6= a,
x1 6= b,
x2 = c
hop :
x2 6= a,
x2 6= b,
x1 = a
R1 :
X 6= a,
X 6= b,
Z1 = a,
Z2 = a,
Y = a
¬hop :
x2 6= a,
x2 6= c,
x1 = b
G2
1 : hop :
U 6= a,
Z1 6= b,
Z2 6= c
R1 :
X 6= a,
X 6= b,
Z1 = c,
Z2 6= c,
Z2 6= a,
Z2 6= b,
Y 6= c
R1 :
X 6= a,
X 6= b,
Z1 6= c,
Z1 6= a,
Z1 6= b,
Z2 6= c,
Z2 6= a,
Z2 6= b,
Y 6= c
hop :
x1 6= a,
x1 6= b,
x2 6= c
¬hop :
x1 6= a,
x1 6= b,
x2 6= c
R1 :
X 6= a,
X 6= b,
Z1 = b,
Z2 = b,
Y = a
R1 :
X 6= a,
X 6= b,
Z1 6= c,
Z1 6= a,
Z1 6= b,
Z2 = b,
Y = a
G2
1 : hop :
Z1 6= a,
Z1 6= b,
Z2 = c
hop :
x2 6= a,
x2 6= c,
x1 = b
hop :
x1 6= a,
x1 6= b,
x2 = c
R1 :
X 6= a,
X 6= b,
Z1 = b,
Z2 = a,
Y = a
G2
1 : hop :
Z2 6= a,
Z2 6= c,
Z1 = b
R1 :
X 6= a,
X 6= b,
Z2 6= a,
Z2 6= b,
Z1 = a,
Y 6= c
R1 :
X 6= a,
X 6= b,
Z1 = c,
Z2 = a,
Y = a
R1 :
X 6= a,
X 6= b,
Z1 6= c,
Z1 6= a,
Z1 6= b,
Z2 = a,
Y = a
R1 :
X 6= a,
X 6= b,
Z1 = a,
Z2 = b,
Y = a
G3
1 : hop :
Z2 6= a,
Z2 6= b,
Y 6= c
R1 :
X 6= a,
X 6= b,
Z2 6= a,
Z2 6= c,
Z1 = b,
Z2 6= b,
Y 6= c
G2
1 : hop :
Z2 6= a,
Z2 6= b,
Z1 = a
Figure 9: The why-not provenance of 3hop(c, a). The provenance is represented in the failure of the claim that 3hop(c, a) is in the answer. This is argued
over the Boolean expression defining 3hop(x, y). A move from the source node to a child represents the choice of a Boolean expression that is sufficient to
g2
1(c, a)
¬3hop(c, a)
g2
1(c, c)g1
1(c, c)
r1(c, a, c, b)
¬hop(c, b)
hop(c, a)
g2
1(b, b)
¬hop(a, c)
hop(c, c)
g1
1(c, a)
r1(c, a, b, c)r1(c, a, a, b)
3hop(c, a)
hop(b, b)
g2
1(c, b)g2
1(a, c)
r1(c, a, a, c)
¬hop(c, c)
hop(c, b)
¬hop(c, a)
g1
1(c, b)
r1(c, a, b, b)
¬hop(b, b)
g3
1(c, a)
r1(c, a, a, a) r1(c, a, b, a)
hop(a, c)
r1(c, a, c, a) r1(c, a, c, c)
gi
demonstrate that gi
evidence for gi
cannot move and thus loses (the opponent wins). In [KLZ13] it
was shown how the provenance of a tuple t can be obtained via a
regular path query over a solved game graph like the one in Fig. 1d:
e.g., p3
+ 2pqr for 3hop(a, a) is represented by a solved game
as shown in Fig. 1e: for positive queries, solved games represent
semiring provenance by noting that won (green) and lost (red) po-
sitions correspond to “+” and “⇥” operations, respectively (leaves
represent input annotations, here: p, q, r, s) [KLZ13].
Why-Not Provenance and the Many Ways to Fail. Since games
are inherently symmetric (one player’s win is the opponent’s loss
and vice versa), the approach yields an elegant provenance model
that unifies why and why-not provenance. Consider the (dark, red)
node 3hop(c, a) in Fig. 2. The color coding indicates that the posi-
Constraint Provenance Games. We propose to solve th
lem of domain dependence by modifying provenance ga
that they can handle certain infinite relations that can be
represented. For example, in addition to the finitely many
why 3hop(c, a) fails over the active domain adom(I), ther
finitely many others, if we consider new constants d, e, . . .
of adom(I). For example, let relation R = {a, b} have tw
R(a) and R(b). If we want to know why-not R(c), we just
c /2 R. But we could also return a more general answer for w
R(x) and say that ¬R(x) is true for all x with x 6= a ^ x 6=
just for x = c). This approach is inspired by Chan’s Cons
Negation [Cha88], a form of constraint logic programming [
The key idea is to represent (potentially infinite) relations
constraints, i.e., Boolean combinations of equalities x = c
equalities x 6= c.
Overview and Contributions. Section 2 briefly explains ho
order queries are translated into games and how provenanc
tracted from solved games. In Section 3 we describe the co
tion of constraint provenance games; additional details and
ples are contained in the appendix. Our main contributio
(i) game provenance provides a uniform treatment of why an
not provenance for first-order logic (= relational algebra w
difference); (ii) for positive queries, the approach captures t
informative semiring provenance [GKT07, KG12]; (iii) we
a constraint provenance framework which yields domain in
dent provenance expressions, extending prior results [KLZ1
(iv) we implemented a prototype of constraint provenance g
inal database instance I plus a number of hypothetical
edges (dotted), with labels t, u, v, w, and x. These m
correspond to the failed leaf nodes in Fig. 2. The ta
contains the why-not provenance, with different com
missing edges as preconditions for a derivation of 3ho
a p
b
q
c
u
r
x
s
t
w
v
Figure 7: Input graph I with five additional, hypothetical ed
B. Constraint Game Construction
Consider the query QABC. To build the game, each
ple in the program such as B(a, b) is replaced by
B: x1=a, x2=b (a conjunction).
First, the subgraph for EDB predicates is created. T
of the game is constructed iteratively similar to quer
For rules whose subgoals are all on EDB predicat
nodes/edges are generated. For IDB predicates that
the head of EDB-only rules, tuple nodes are generate
5 missing edges
9 minimal combinations
A. Why-Not 3hop(c, a) Dissected
Consider the input graph in Fig. 1a and its why-not
for 3hop(c, a) in Fig. 2. The graph encodes the re
3hop(c, a) is not in the answer. Moving from the lost 3
Fig. 2, there are nine possible rule instantiations r1(c, a
of which represent a reason why there is no 3hop(c, a)
diate nodes z1, z2 2 {a, b, c}. To better understand th
explanations, consider the input graph in Fig. 7. It conta
inal database instance I plus a number of hypothetical
edges (dotted), with labels t, u, v, w, and x. These mi
correspond to the failed leaf nodes in Fig. 2. The tab
contains the why-not provenance, with different com
missing edges as preconditions for a derivation of 3hop
a p
b
q
c
u
r
x
s
t
w
v
Figure 7: Input graph I with five additional, hypothetical ed
+ … ?
Constraints imply
15 disjoint relations over
key variables X, Z1, Z2, Y
40

Provenance Games: Summary
• (1) Game Provenance
– The win-move game has a natural why and why-not provenance “built-in”
• “good” and “bad moves”
• è discard bad moves è game provenance
• (2) Provenance Games
– Query evaluation also is a game!
– Game provenance can be applied to query evaluation game
=> uniform why + why-not provenance
• (3) Constraint Provenance
– Domain independent (some infinite domains OK)
– Prototypically implemented
• (4) Future Work
– Make theory practical!
• e.g. implement in Boris Glavic’s Perm or GPROM system
– Theoretical properties
– Relation to Argumentation Frameworks
– Clarify relationship to monus semirings (Floris Geerts et al)
– Higher-order reasons!
41

Why-Not: so many
answers, so little
time
• The crux of
current why-not
approaches:
– Enumerate all
ways that
could/might have
worked, but
failed…
• Idea
è abstract those
many, many
explanations!
TaPP’15
42

Conclusions
• Provenance is an important, active, and broad area of
research in databases and scientific workflows
– Both in specialized (TAPP, IPAW) and mainstream venues (SBBD,
VLDB, SIGMOD, EDBT, ICDE, PODS, ICDT, ..)
• There are (still) many deeply technical and practical challenges:
– Efficient capture, management, use of provenance
– Models, semantics, query languages
– Provenance .. for others? Or provenance for self!
– Interdisciplinary work; cross-fertilization: databases, workflows,
programming languages, security, …, various scientific communities
(bioinformatics, ...)
• … oh, and it’s also a lot of fun!
– Interested to join?
– Ludaesch@illinois.edu
43

Why-Not Provenance References
• Köhler, Sven, Bertram Ludäscher, and Daniel Zinn. "First-
order provenance games.” In Search of Elegance in the
Theory and Practice of Computation. Peter Buneman
Festschrift, LNCS 8000. Springer Berlin Heidelberg, 2013.
• Riddle, Sean, Sven Köhler, and Bertram Ludäscher.
"Towards constraint provenance games.” 6th USENIX
Workshop on the Theory and Practice of Provenance
(TaPP 2014).
• Glavic, Boris, Sven Köhler, Sean Riddle, and Bertram
Ludäscher. "Towards constraint-based explanations for
answers and non-answers.” 7th USENIX Workshop on
the Theory and Practice of Provenance (TaPP 2015).
44

Other References (Part I, II)
• (coming soon)
45

Provenance in Databases and Scientific Workflows: Part II (Databases)

Recommended

Recommended

More Related Content

Similar to Provenance in Databases and Scientific Workflows: Part II (Databases)

Similar to Provenance in Databases and Scientific Workflows: Part II (Databases) (20)

More from Bertram Ludäscher

More from Bertram Ludäscher (20)

Recently uploaded

Recently uploaded (20)

Provenance in Databases and Scientific Workflows: Part II (Databases)