Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Bertram	
  Ludäscher	
  	
  
ludaesch@illinois.edu	
  
	
  
Director,	
  Center	
  for	
  Informa0cs	
  Research	
  in	
  ...
•  Provenance	
  -­‐	
  Alta	
  Vista	
  
•  Provenance	
  in	
  Databases	
  
–  Brief	
  literature	
  overview/walk-­‐t...
Introduc=ons	
  should	
  come	
  first!	
  
What	
  is	
  “provenance”?	
  	
  
•  Oxford	
  English	
  DicBonary:	
  	
  ...
The	
  Many	
  Faces	
  of	
  Provenance	
  	
  
•  Cosmology	
  
•  Geology,	
  Stra0graphy	
  
•  Phylogeny	
  
–  the	
...
Provenance	
  everywhere	
  …	
  	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ...
Computa=onal	
  Provenance	
  
•  Origin	
  and	
  processing	
  history	
  of	
  an	
  ar=fact	
  
– usually:	
  data	
  ...
Provenance	
  in	
  Databases	
  
•  Some	
  key	
  quesBons:	
  
– Why	
  is	
  t	
  in	
  q(D)?	
  
– Which	
  set	
  of...
Provenance in Databases
(“fine-grained”, “white-box”)
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ...
Provenance in Databases
(fine-grained, white-box)
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ...
Provenance	
  in	
  Scien=fic	
  Workflows	
  
•  Some	
  key	
  quesBons:	
  
–  What	
  is	
  the	
  lineage/trace	
  T	
 ...
Provenance in (Scientific) Workflows
(“Coarse-grained”, “Black-box”)
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	...
What people do with “provenance”
•  Result	
  validaBon	
  	
  	
  
•  Result	
  debugging	
  (science	
  vs	
  wf	
  logi...
Database	
  Provenance:	
  Some	
  Pioneers	
  …	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	...
Database	
  Provenance:	
  Some	
  Pioneers	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
...
Provenance	
  
Semirings:	
  	
  
The	
  Great	
  
Unifica=on!	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  ...
Provenance	
  Polynomials	
  
One	
  Semiring	
  to	
  Rule	
  them	
  all!	
  
(Theory	
  strikes!)	
  
Bertram	
  Ludäsc...
Example:	
  Go	
  from	
  X	
  to	
  Y	
  in	
  3	
  hops!	
  
(a	
  =	
  CS 	
   	
  	
  b	
  =	
  NCSA 	
   	
  	
  c	
 ...
Provenance	
  Polynomials	
  
	
  	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
 ...
Problem:	
  Nega=on	
  &	
  Why-­‐Not	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ...
Query	
  evalua=on	
  
game	
  
EDB:	
  	
  e(a,b),	
  e(b,b)	
  	
  
a b
tc(X,Y) :- e(X,Y) # (1)--e(X,Y)-->(2)
tc(X,Y) :-...
a b
tc(X,Y) :- e(X,Y) # (1)--e(X,Y)-->(2)
tc(X,Y) :- # (1)--exists:Z-->(3)
e(X,Z), # (3)->(4)-e(X,Z)->(5)
tc(Z,Y). # (3)--...
A	
  Game	
  
a	
   k	
  
b	
   c	
   l	
  
d	
   e	
   m	
  
g	
   h	
   n	
  f	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	...
Solving	
  the	
  Game	
  
a	
   k	
  
b	
   c	
   l	
  
d	
   e	
   m	
  
g	
   h	
   n	
  f	
  
Bertram	
  Ludäscher	
  ...
Solving	
  the	
  Game	
  
a	
   k	
  
b	
   c	
   l	
  
d	
   e	
   m	
  
g	
   h	
   n	
  f	
  
Bertram	
  Ludäscher	
  ...
Solving	
  the	
  Game	
  
a	
   k	
  
b	
   c	
   l	
  
d	
   e	
   m	
  
g	
   h	
   n	
  f	
  
Bertram	
  Ludäscher	
  ...
Solving	
  the	
  Game	
  
a	
   k	
  
b	
   c	
   l	
  
d	
   e	
   m	
  
g	
   h	
   n	
  f	
  
Bertram	
  Ludäscher	
  ...
Solving	
  the	
  Game	
  
a	
   k	
  
b	
   c	
   l	
  
d	
   e	
   m	
  
g	
   h	
   n	
  f	
  
Bertram	
  Ludäscher	
  ...
Game	
  Provenance	
  
a
b
1
c
3
d e
f
1
g
3
m
h
1
k
l
oo
n
oo
oo
oo
2 2
2
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
 ...
Game	
  Provenance	
  
W
bad Dbad
L
winning
bad
drawing
n/a
delaying
n/a
n/a
a
b
1
c
3
d e
f
1
g
3
m
h
1
k
l
oo
n
oo
oo
oo...
Game	
  Provenance	
  
a
b
1
c
3
d e
f
1
g
3
m
h
1
k
l
oo
n
oo
oo
oo
2 2
2
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
 ...
Provenance	
  (or	
  Query	
  Evalua0on)	
  Games	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ...
Transla=on:	
  Q(I) => G Q(I)
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ...
Solve	
  G Q(I)	
  =>	
  Provenance!	
  	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	...
Happy	
  End	
  (1	
  of	
  3)	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
 ...
Happy	
  End	
  (2	
  of	
  3)	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
 ...
Happy	
  End	
  (2	
  of	
  3)	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
 ...
Are	
  there	
  more	
  ways	
  to	
  fail?	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
...
¬C(b)
¬B(a, a)
¬B(a, b)
r2(b, a)¬A(b)
¬A(a)
g1
2(a, a)
B(a, b)
B(a, a)
C(a)
g2
2(a)
g2
2(b)
C(b)
¬B(b, b)
rC (a)
A(b)
A(a)...
Why-­‐Not:	
  The	
  Full	
  Story	
  Emerges…	
  
(sort	
  of…)	
  	
  	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	...
Provenance	
  Games:	
  Summary	
  
•  (1)	
  Game	
  Provenance	
   	
   	
   	
   	
   	
  	
  	
  	
   	
   	
  	
  
– ...
Why-­‐Not:	
  so	
  many	
  
answers,	
  so	
  limle	
  
=me	
  
•  The	
  crux	
  of	
  
current	
  why-­‐not	
  
approac...
Why-­‐Not	
  Provenance	
  References	
  
•  Köhler,	
  Sven,	
  Bertram	
  Ludäscher,	
  and	
  Daniel	
  Zinn.	
  "First...
From “Climate Gate” to Reproducible Science
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
...
YesWorkflow:	
  Yes!	
  
Scripts	
  can	
  be	
  Workflows,	
  too!	
  
Bertram	
  Ludäscher	
  	
  
	
  
Graduate	
  School...
YesWorkflow	
  ~	
  noWorkflow	
  
•  “not	
  only	
  Workflow”	
  
–  AutomaBcally	
  capture	
  runBme/
retrospec=ve	
  pro...
Reproducibility:	
  (yesterday’s	
  discussion	
  cont’d)	
  
	
  -­‐	
  What	
  ques=ons	
  should	
  we	
  ask?	
  
	
  ...
YesWorkflow	
  =	
  Scripts	
  +	
  Comments	
  
•  Scripts	
  can	
  be	
  hard	
  to	
  digest,	
  communicate	
  
•  Ide...
User	
  Comments:	
  YW	
  @Annota=ons	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
 ...




 

 
...
Paleoclimate	
  Reconstruc=on	
  …	
  	
  	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
 ...






...
Mul=-­‐Scale	
  Synthesis	
  and	
  Terrestrial	
  Model	
  
Intercomparison	
  Project	
  (MsTMIP)	
  
Bertram	
  Ludäsch...
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_s...
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_s...
YW	
  Demo!	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
...
YesWorkflow	
  References	
  
•  hGp://yesworkflow.org	
  	
  
•  T.	
  McPhillips,	
  S.	
  Bowers,	
  K.	
  Belhajjame,	
 ...
Janiform	
  Demo	
  (by	
  Jens	
  Dimrich)	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
...
Janiform	
  Demo	
  (Jens	
  Dimrich)	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ...
hmps://youtu.be/f4iKwdERXhI	
  	
  
Bertram	
  Ludäscher	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
...
Conclusions	
  
•  Provenance	
  is	
  an	
  acBve	
  and	
  broad	
  area	
  of	
  
research	
  	
  
– …	
  in	
  databas...
Upcoming SlideShare
Loading in …5
×

DAIS Seminar: The Many Faces of Provenance in Databases and Workflows

456 views

Published on

DAIS Seminar, Feb 9, 2016.
(updated slides)

Published in: Data & Analytics
  • Be the first to comment

DAIS Seminar: The Many Faces of Provenance in Databases and Workflows

  1. 1. Bertram  Ludäscher     ludaesch@illinois.edu     Director,  Center  for  Informa0cs  Research  in  Science  &  Scholarship  (CIRSS)   Graduate  School  of  Library  and  Informa0on  Science  (GSLIS)   Na0onal  Center  for  Supercompu0ng  Applica0ons  (NCSA)   Department  of  Computer  Science  (CS@UIUC)   The  Many  Faces  of  Provenance   in  Databases  &  Workflows  
  2. 2. •  Provenance  -­‐  Alta  Vista   •  Provenance  in  Databases   –  Brief  literature  overview/walk-­‐through  (disclaimer..)   •  Provenance  in  Scien=fic  Workflows   –  …  and  ComputaBonal  Reproducibility     –  …Reproducibility  PDbF  Demo  (Jens  DiGrich)   •  Contact:  ludaesch@illinois.edu   – Office:  4032  NCSA  or  319  GSLIS   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   Outline   2  
  3. 3. Introduc=ons  should  come  first!   What  is  “provenance”?     •  Oxford  English  DicBonary:     – The  place  of  origin  or  earliest  known  history  of   something:   •  an  orange  rug  of  Iranian  provenance   – The  beginning  of  something’s  existence;  its  origin:   •  they  try  to  understand  the  whole  universe,  its  provenance   and  fate   – A  record  of  ownership  of  a  work  of  art  or  an  anBque,   used  as  a  guide  to  authenBcity  or  quality:   •  the  manuscript  has  a  dis0nguished  provenance   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   Kicking  Bird    by  Shahin  Gholizadeh     3  
  4. 4. The  Many  Faces  of  Provenance     •  Cosmology   •  Geology,  Stra0graphy   •  Phylogeny   –  the  Tree  of  Life   •  Genealogy   –  your  family:  literally   •  Academic  Pedigree   –  “Doktorvater”   •  Etymology   •  Chain  of  custody   –  of  art(ifacts)     •  …  origins,  history  …     Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   4  
  5. 5. Provenance  everywhere  …     Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   5  
  6. 6. Computa=onal  Provenance   •  Origin  and  processing  history  of  an  ar=fact   – usually:  data  (products),  figures,  ...   – someBmes:  workflow  (and  script)  evoluBon  …   •  Different  sub-­‐communiBes:   – Provenance  in  databases   – Provenance  in  (scien=fic)  workflows   – ...  programming  languages,  systems/security,  …     Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   6  
  7. 7. Provenance  in  Databases   •  Some  key  quesBons:   – Why  is  t  in  q(D)?   – Which  set  of  tuples  L  in  D  does  t  depend  on?            i.e.,  what  is  the  lineage  of  t  ?     – How  was  t  derived  from  its  lineage  L  ?         •  Also:   – Where  in  D  do  the  values  in  t  come  from?   – Why  is  t’  not  in  q(D)?   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   7  
  8. 8. Provenance in Databases (“fine-grained”, “white-box”) Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   8  
  9. 9. Provenance in Databases (fine-grained, white-box) Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   9  
  10. 10. Provenance  in  Scien=fic  Workflows   •  Some  key  quesBons:   –  What  is  the  lineage/trace  T  of  data  product  (output)  yi:            (y1  …,  yn  )  =  execute(W,  x,  p)  ?   •  …  given  workflow/script  W  with  inputs  x  and  parameters  p  ?   •  …  i.e.,  find  subset  of  x,  p,  and  (program  slices  of)  W  on  which  a    specific  yi   depends!   –  How  can  we  store,  query  the  provenance  (trace)  graph   effecBvely,  efficiently?     •  Regular  Path  Queries  (RPQs),  Lowest  Common  Ancestor  (LCA)   •  Temporal  Query  Languages  (e.g.  Past-­‐Temporal  Logic)   •  other  graph  queries   –  What  is  the  difference  between  traces  T1,  T2?   –  Does  the  trace  (retrospec=ve  provenance)  match  the  workflow   (prospec=ve  provenance)?     Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   10  
  11. 11. Provenance in (Scientific) Workflows (“Coarse-grained”, “Black-box”) Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   11  
  12. 12. What people do with “provenance” •  Result  validaBon       •  Result  debugging  (science  vs  wf  logic)   •  Reproducibility  and  Repeatability       •  ExplanaBon  (derivaBons,  traces,  proof  trees)   •  RunBme  monitoring   –  Profiling,  benchmarking   •  Performance  OpBmizaBon  (“smart  re-­‐run”)   •  Fault-­‐tolerance,  crash-­‐recovery   •  Database  view  maintenance  (e.g.  data  warehousing)   •  Workflow  design    Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   12  
  13. 13. Database  Provenance:  Some  Pioneers  …   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   Cui  (PhD  2001),  Widom:   TODS’00,  VLDB’03   13  
  14. 14. Database  Provenance:  Some  Pioneers   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   Buneman  et  al.   ICDT  2001   (cita=ons:  1000+)     14  
  15. 15. Provenance   Semirings:     The  Great   Unifica=on!   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   TJ  Green  et  al:   PODS’07,   SIGMOD  Record’12   15  
  16. 16. Provenance  Polynomials   One  Semiring  to  Rule  them  all!   (Theory  strikes!)   Bertram  Ludäscher                                                                                                 DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   Green,  Karvounarakis,  Tannen.  Provenance  semirings,  PODS,  2007   16  
  17. 17. Example:  Go  from  X  to  Y  in  3  hops!   (a  =  CS      b  =  NCSA      c  =  GSLIS)   •  Database:          hop(X,Y)  :=         •  Query:    3hop(X,Y)  :-­‐              hop(X,  Z1),  hop(Z1,  Z2),  hop(Z2,Y).   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   a p b q r c s Note:  Cannot  go  from  c  to  a  in  3hops!     a ppp+pqr+qrp b ppq+qrq cpqs ppr+qrr rpq rqs hop(a,a,  p).   hop(a,b,  q).   hop(b,a,  r)   hop(b,c,  s).   3hop(a,a,  p3+2pqr).   3hop(a,b,  p2q+q2r).   …     3hop(a,c,  pqs).   17  
  18. 18. Provenance  Polynomials       Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   ,,Mein  Schatz!”        p3  +  2pqr                          p3  +    pqr                        p  +  2pqr                          p  +    pqr                          pqr                          p  +    pqr                     p   a ppp+pqr+qrp b ppq+qrq cpqs ppr+qrr rpq rqs 18  
  19. 19. Problem:  Nega=on  &  Why-­‐Not   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   •  Provenance  Semirings  work  well  for:   – Posi=ve  Queries  (e.g.,  RA+  )   •  Challenges:  Handling  of     – set  difference  (~  nega=on)   – Why-­‐not  provenance       •  A  fresh  look  at  provenance!   •  …  using  an  old  idea:  Game  seman=cs!       19  
  20. 20. Query  evalua=on   game   EDB:    e(a,b),  e(b,b)     a b tc(X,Y) :- e(X,Y) # (1)--e(X,Y)-->(2) tc(X,Y) :- # (1)--exists:Z-->(3) e(X,Z), # (3)->(4)-e(X,Z)->(5) tc(Z,Y). # (3)--X:=Z-->(1) 2 3 1 X := Z 4 5 e(X,Y) exists:Z e(X,Z) 3:(b,b,b) 1 1:(b,b) 11 4:(b,b) 1 1 1:(a,b) 1 3:(a,b,a) 1 2:(a,b) 01 3:(a,b,b) 1 2 2 3:(b,b,a) 1 2:(b,b) 01 4:(a,b) 1 5:(a,b) 01 5:(b,b) 01 3:(a,a,a) 1 4:(a,a) 0 1 1:(a,a) 2 1 3:(b,a,a) 1 4:(b,a) 0 1 1 1 1 3:(a,a,b) 2 1:(b,a) 2 3:(b,a,b) 2 Provenance’12  @Dagstuhl       with  JanVdB  TJ  Green         Flum,  Kubierschky,  Ludäscher,  Total  and  par=al  well-­‐founded   Datalog  coincide,  ICDT-­‐The-­‐Bag-­‐1997,  Delphi,  Greece   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   Eureka! 20  
  21. 21. a b tc(X,Y) :- e(X,Y) # (1)--e(X,Y)-->(2) tc(X,Y) :- # (1)--exists:Z-->(3) e(X,Z), # (3)->(4)-e(X,Z)->(5) tc(Z,Y). # (3)--X:=Z-->(1) 2 3 1 X := Z 4 5 e(X,Y) exists:Z e(X,Z) 3:(b,b,b) 1 1:(b,b) 11 4:(b,b) 1 1 1:(a,b) 1 3:(a,b,a) 1 2:(a,b) 01 3:(a,b,b) 1 2 2 3:(b,b,a) 1 2:(b,b) 01 4:(a,b) 1 5:(a,b) 01 5:(b,b) 01 3:(a,a,a) 1 4:(a,a) 0 1 1:(a,a) 2 1 3:(b,a,a) 1 4:(b,a) 0 1 1 1 1 3:(a,a,b) 2 1:(b,a) 2 3:(b,a,b) 2 EDB:    e(a,b),  e(b,b)     Game   diagram   Instan=ated   move  graph   Flum,  Kubierschky,  Ludäscher,  Total  and  par=al  well-­‐founded   Datalog  coincide,  ICDT-­‐The-­‐Bag-­‐1997,  Delphi,  Greece   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   21  
  22. 22. A  Game   a   k   b   c   l   d   e   m   g   h   n  f   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   22  
  23. 23. Solving  the  Game   a   k   b   c   l   d   e   m   g   h   n  f   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   All  successors  won  è  posiBon  lost                    Some  successor  lost  è  posiBon  won   23  
  24. 24. Solving  the  Game   a   k   b   c   l   d   e   m   g   h   n  f   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   All  leaves  (dead-­‐ends)  are  immediately  lost!   24  
  25. 25. Solving  the  Game   a   k   b   c   l   d   e   m   g   h   n  f   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   X  is  won  if  there  exists  a  move  to  a  lost  Y   25  
  26. 26. Solving  the  Game   a   k   b   c   l   d   e   m   g   h   n  f   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   X  is  lost  if  all  moves  lead  to  a  won  Y   26  
  27. 27. Solving  the  Game   a   k   b   c   l   d   e   m   g   h   n  f   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   Repeat  un=l  no  change  =>  drawn  posi=ons  remain   27  
  28. 28. Game  Provenance   a b 1 c 3 d e f 1 g 3 m h 1 k l oo n oo oo oo 2 2 2 Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   •  Game  can  be  solved  in  Bme   linear  in  |Move|   •  One  rule  to  rule  them  all!   win(X)  :-­‐  move(X,Y),  not  win(Y)   •  node  color  =>  edge  color     –  good  vs  bad  moves   •  good  moves  =  natural,  new   no=on  of  provenance!     Aside:  Games  ~  ArgumentaBon  Frameworks   win(X)  :-­‐  move(X,Y),  not  win(Y)   def(X)  :-­‐  aGacks(Y,X),  not  def(Y)   Eureka! 28  
  29. 29. Game  Provenance   W bad Dbad L winning bad drawing n/a delaying n/a n/a a b 1 c 3 d e f 1 g 3 m h 1 k l oo n oo oo oo 2 2 2 Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   Extrac=ng  Provenance:   ü  Why/how  win(x)?                     •  [x]  –G.(R.G)*–>  [y]   ü  Why-­‐not  win(x)?     •  [x]  –(R.G)*–>  [y]   •  [x]    –(Y+)–>      [y]   Move  types   29  
  30. 30. Game  Provenance   a b 1 c 3 d e f 1 g 3 m h 1 k l oo n oo oo oo 2 2 2 Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   Extrac=ng  Provenance:   ü  Why/how  win(x)?                     •  [x]  –G.(R.G)*–>  [y]   ü  Why-­‐not  win(x)?     •  [x]  –(R.G)*–>  [y]   •  [x]    –(Y+)–>      [y]   •  Next:  play  a  query   evalua=on  game   •  =>  new  why-­‐(not)   provenance  via  games!   30  
  31. 31. Provenance  (or  Query  Evalua0on)  Games   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC    “SLD-­‐resolu=on  game”     Next  (Example):        A(X)  :–  B(X,Y,Z)    …  not  C(X,Y)  …     Eureka! 31  
  32. 32. Transla=on:  Q(I) => G Q(I) Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   A(X) C(X) B(X, Y ) r2(X, Y ) g1 2(X, Y ) g2 2(Y ) rB(X, Y ) rC (X) ¬A(X) ¬B(X, Y ) ¬C(X) B(X, Y ) C(X) X:=Y 9Y (a) Game template for QABC : A(X) : B(X, Y ), ¬C(Y ). ¬C(a) ¬C(b) ¬B(a, a) ¬B(a, b) rB(b, a) r2(b, a)¬A(b) ¬A(a) g1 2(a, a) B(a, b) B(a, a) C(a) g2 2(a) g2 2(b) C(b) ¬B(b, a) ¬B(b, b) rC (a) A(b) A(a) r2(a, b) r2(a, a) g1 2(a, b) rB(a, b) r2(b, b) g1 2(b, b) g1 2(b, a) B(b, b) B(b, a) 9a 9b 9b 9a (b) Instantiated QABC game on I = {B(a, b), B(b, a), C(a)}. rB(b, a)¬B(b, a) g1 2(b, a) B(b, a) A(b) 9 c 9 a 9 b Figure 4: Altered A : x1 = a ¬A : x1 = a 32  
  33. 33. Solve  G Q(I)  =>  Provenance!     Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   ¬B(a, a) ¬B(a, b)¬A(a) g1 2(a, a) B(a, b) B(a, a) g2(b) A(a) r2(a, b) r2(a, a) g1 2(a, b) rB(a, b) 9a 9b (b) Instantiated QABC game on I = {B(a, b), B(b, a), C(a)}. ¬C(a) ¬C(b) ¬B(a, a) ¬B(a, b) rB(b, a) r2(b, a)¬A(b) ¬A(a) rB(a, b)B(a, b) B(a, a) C(a) g2 2(a) g2 2(b) C(b) ¬B(b, a) ¬B(b, b) rC (a) A(b) A(a) r2(a, b) r2(a, a) g1 2(a, b) g1 2(a, a) r2(b, b) g1 2(b, b) g1 2(b, a) B(b, b) B(b, a) 9a 9b 9b 9a (c) Solved game: lost positions are (dark) red; won positions are (light) green. Provenance edges (= good moves) are solid. Bad moves are dashed and not part of the provenance. A(a) is true (A(b) is false) as it is won (lost) in the solved game; the game provenance explains why (why-not). Figure 3: Provenance game for QABC. The well-founded model of win(X) : M(X, Y ), ¬win(Y ), applied to move graph M, solves the game. F m 33  
  34. 34. Happy  End  (1  of  3)   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   ly ny x- er ry or ot. ld gs t- m nd n, u- o- nt is te ee a p b q r c s (a) input I ... hop a a p a b q b a r b c s (b) ... annotated. 3hop a a p3 + 2pqr a b p2 q + q2 r a c pqs b a p2 r + qr2 b b pqr b c qrs (c) 3hop with provenance. r1(a, a, b, a) g2 1(a, a) ¬hop(b, a) g1 1(a, a) hop(b, a) g2 1(a, b) g3 1(b, a) rhop(b, a) r1(a, a, a, a) r1(a, a, a, b) 3hop(a, a) g3 1(a, a) rhop(a, a) hop(a, b) ¬hop(a, a) g1 1(a, b) rhop(a, b) g2 1(b, a) ¬hop(a, b) hop(a, a) 9 a,a 9 b,a 9 a,b (d) The game provenance of 3hop(a, a) ... ⇥ + ⇥ + + + + r ⇥ ⇥ + + p + ⇥ + q + ⇥ + (e) ... is p3 + 2pqr. Figure 1: Each edge hop(x, y) in the input graph I in (a) is annotated (p, q, r, ...) in (b). The answer to Q3hop is shown in (c) with provenance polynomials [KG12]. The game provenance [KLZ13], e.g., of 3hop(a, a) Provenance  Game  on  GQ(I)       =    Provenance  Polynomials     …  for  posi=ve  queries!   Yes! 34  
  35. 35. Happy  End  (2  of  3)   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   …  but  also  works  for  Why-­‐Not  provenance  &  non-­‐monotonic   queries  (i.e.,  Q  can  have  nega=on)  !!     Here:  not  3hop(c,a)  –  can’t  go  back  from            GSLIS    to        CS                                    c                        a     g2 1(c, a) ¬3hop(c, a) g2 1(c, c)g1 1(c, c) r1(c, a, c, b) ¬hop(c, b) hop(c, a) g2 1(b, b) ¬hop(a, c) hop(c, c) g1 1(c, a) r1(c, a, b, c)r1(c, a, a, b) 3hop(c, a) hop(b, b) g2 1(c, b)g2 1(a, c) r1(c, a, a, c) ¬hop(c, c) hop(c, b) ¬hop(c, a) g1 1(c, b) r1(c, a, b, b) ¬hop(b, b) g3 1(c, a) r1(c, a, a, a) r1(c, a, b, a) hop(a, c) r1(c, a, c, a) r1(c, a, c, c) 9 a,b 9 a,c 9 c,a 9 c,c9 b,c 9 b,b9 b,a9 a,a 9 c,b Figure 2: Why-not provenance for 3hop(c, a) using provenance games. gi 1 in the body of r1, thus claiming that gi 1 is false and hence that the r1 instance doesn’t derive t. The first player can counter and demonstrate that gi 1 is true by selecting a rule instance or fact as evidence for gi 1. The game proceeds in rounds until some player cannot move and thus loses (the opponent wins). In [KLZ13] it was shown how the provenance of a tuple t can be obtained via a Yes! 35  
  36. 36. Happy  End  (2  of  3)   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   5  leaf  nodes  ~    5  missing   (“hypothe=cal”)  edges     Insert  those     =>  3hop(c,a)  will  be  true!     g2 1(c, a) ¬3hop(c, a) g2 1(c, c)g1 1(c, c) r1(c, a, c, b) ¬hop(c, b) hop(c, a) g2 1(b, b) ¬hop(a, c) hop(c, c) g1 1(c, a) r1(c, a, b, c)r1(c, a, a, b) 3hop(c, a) hop(b, b) g2 1(c, b)g2 1(a, c) r1(c, a, a, c) ¬hop(c, c) hop(c, b) ¬hop(c, a) g1 1(c, b) r1(c, a, b, b) ¬hop(b, b) g3 1(c, a) r1(c, a, a, a) r1(c, a, b, a) hop(a, c) r1(c, a, c, a) r1(c, a, c, c) 9 a,b 9 a,c 9 c,a 9 c,c9 b,c 9 b,b9 b,a9 a,a 9 c,b Figure 2: Why-not provenance for 3hop(c, a) using provenance games. gi 1 in the body of r1, thus claiming that gi 1 is false and hence that the r1 instance doesn’t derive t. The first player can counter and demonstrate that gi 1 is true by selecting a rule instance or fact as evidence for gi 1. The game proceeds in rounds until some player cannot move and thus loses (the opponent wins). In [KLZ13] it was shown how the provenance of a tuple t can be obtained via a regular path query over a solved game graph like the one in Fig. 1d: e.g., p3 + 2pqr for 3hop(a, a) is represented by a solved game as shown in Fig. 1e: for positive queries, solved games represent semiring provenance by noting that won (green) and lost (red) po- sitions correspond to “+” and “⇥” operations, respectively (leaves represent input annotations, here: p, q, r, s) [KLZ13]. Why-Not Provenance and the Many Ways to Fail. Since games C le th r w fi o R c R ju N T c e O o tr ti p ( n d in onsider the input graph in Fig. 7. It contains the orig- nstance I plus a number of hypothetical (or missing) with labels t, u, v, w, and x. These missing edges the failed leaf nodes in Fig. 2. The table in Fig. 6 hy-not provenance, with different combinations of as preconditions for a derivation of 3hop(c, a). a p b q c u r x s t w v graph I with five additional, hypothetical edges (dashed). aint Game Construction query QABC. To build the game, each ground tu- gram such as B(a, b) is replaced by a constraint argues that a tuple agreeing with currently ‘at’ a rule node is figh firing is satisfied and creates t claim, the player moves to a g The goal, if unsatisfied, will be at least one goal is unsatisfied. for the rule node. A detailed example using th next section. Constraint provenance gam games by making them domain tivating example, consider Fig. are effectively the same as in Fig nodes that apply to more than o the firing r2(b, c) was not suffi has to find the node admitting The subgraph of this node reac explain why rule firings admitte Example Consider the examp straint game in Fig. 5. After all E cessed, the rule is processed. Int of A(X) is to select a node wh in B and a node for the absence correspond to a valid rule firing its positive presence in Fig. 9 (the source node) is accordingly lost. Each of the rule nodes referenced in the table, which explain the negative provenance of a rule firing grounded in the active domain, also captures the rule non-satisfaction of an infinite set of possible variable bindings to elements possibly outside the active domain. Any constraint that has a variable that is only disequality- constrained represents an infinite set of firings. Consider the rule node: R1 : X6=a, X6=b, Z1=a, Z2=a, Y =a. This corresponds to the (hypothetical) 3hop path c t a p a p a and the situation in which the edge t exist (see first row of Fig. 6). However, it also explains why the rule firing d ! a ! a ! a is not successful. The explanation is the failure of the first goal of the rule. In the case of X=c, it represents that there are no outgoing edges from c. In the case of X=d or any other invented value this is trivially true. This shows that constraint provenance games do not suffer from the same problems as their fully-grounded counterparts. Prove- nance can be queried for any imaginable tuple, including one not in the active domain, and the provenance presented is still correct in the presence of a growing active domain. r1(X, Y, Z1, Z2) X ! Z1 ! Z2 ! Y Why Not R1 Node [Fig. 2] [Fig. 7] Provenance [Fig. 9] r1(c, a, a, a) c t a p a p a t ) t·p·p 2 r1(c, a, a, b) c t a q b r a t ) t·q·r 3 r1(c, a, a, c) c t a u c t a t, u ) t·u·t 7 r1(c, a, c, a) c v c t a p a t, v ) v·t·p 14 r1(c, a, b, c) c w b s c t a t, w ) w·s·t 6 r1(c, a, c, c) c v c v c t a t, v ) v·v·t 12 r1(c, a, c, b) c v c w b r a v, w ) v·w·r 15 r1(c, a, b, a) c w b r a p a w ) w·r·p 4 r1(c, a, b, b) c w b x b r a w, x ) w·x·r 1 Figure 6: The nine r1-instances in the first column correspond to those in Fig. 2 from left to right. The 3hop-path is shown in the second column, with missing/hypothetical edges (dashed) t, u, v, w, x and existing edges p, q, r, s; see Fig. 7. The third column shows the why-not provenance of =>  What-­‐If  provenance!   Yes! 36  
  37. 37. Are  there  more  ways  to  fail?   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   rB(X, Y ) rC (X) Y ) ) , ¬C(Y ). rB(b, a) rC (a) g1 2(b, c) g1 2(b, b) r2(b, a) ¬B(b, c) B(b, c) g2 2(a) ¬B(b, b) rC (a) A(b) C(a) B(b, b)r2(b, b) r2(b, c) 9 c 9 a 9 b Figure 4: Altered subgraph of Fig. 3c after adding c to the active domain. B :G1 2 : B : ¬B : ¬B(a, a) ¬B(a, b)¬A(a) g1 2(a, a) B(a, b) B(a, a) g2 2(b) C(b) A(a) r2(a, b) r2(a, a) g1 2(a, b) rB(a, b) 9a 9b (b) Instantiated QABC game on I = {B(a, b), B(b, a), C(a)}. ¬C(a) ¬C(b) ¬B(a, a) ¬B(a, b) rB(b, a) r2(b, a)¬A(b) ¬A(a) rB(a, b)B(a, b) B(a, a) C(a) g2 2(a) g2 2(b) C(b) ¬B(b, a) ¬B(b, b) rC (a) A(b) A(a) r2(a, b) r2(a, a) g1 2(a, b) g1 2(a, a) r2(b, b) g1 2(b, b) g1 2(b, a) B(b, b) B(b, a) 9a 9b 9b 9a (c) Solved game: lost positions are (dark) red; won positions are (light) green. Provenance edges (= good moves) are solid. Bad moves are dashed and not part of the provenance. A(a) is true (A(b) is false) as it is won (lost) in the solved game; the game provenance explains why (why-not). Figure 3: Provenance game for QABC. The well-founded model of win(X) : M(X, Y ), ¬win(Y ), applied to move graph M, solves the game. the new binding for X; a condition “B(X, Y )” means that a move A : x1 = a A : x1 = b ¬A : x1 6= a, x1 6= b A : x1 6= a, x1 6= b ¬A : x1 = b ¬A : x1 = a Figure 5: Constraint p may represent finite or GQ(I) thus consists + Two  branches  that  explain    Why-­‐not  A(b)   Adding  a  new  constant  c  to  the   domain  =>  new  why-­‐not  answer!   Oh no … L 37  
  38. 38. ¬C(b) ¬B(a, a) ¬B(a, b) r2(b, a)¬A(b) ¬A(a) g1 2(a, a) B(a, b) B(a, a) C(a) g2 2(a) g2 2(b) C(b) ¬B(b, b) rC (a) A(b) A(a) r2(a, b) r2(a, a) g1 2(a, b) rB(a, b) r2(b, b) g1 2(b, b) B(b, b) 9a 9b 9b 9a (b) Instantiated QABC game on I = {B(a, b), B(b, a), C(a)}. ¬C(a) ¬C(b) ¬B(a, a) ¬B(a, b) rB(b, a) r2(b, a)¬A(b) ¬A(a) rB(a, b)B(a, b) B(a, a) C(a) g2 2(a) g2 2(b) C(b) ¬B(b, a) ¬B(b, b) rC (a) A(b) A(a) r2(a, b) r2(a, a) g1 2(a, b) g1 2(a, a) r2(b, b) g1 2(b, b) g1 2(b, a) B(b, b) B(b, a) 9a 9b 9b 9a (c) Solved game: lost positions are (dark) red; won positions are (light) green. Provenance edges (= good moves) are solid. Bad moves are dashed and not part of the provenance. A(a) is true (A(b) is false) as it is won (lost) in the solved game; the game provenance explains why (why-not). Figure 3: Provenance game for QABC. The well-founded model of win(X) : M(X, Y ), ¬win(Y ), applied to move graph M, solves the game. the new binding for X; a condition “B(X, Y )” means that a move is possible only if B(X, Y ) is true in I for the current X, Y values.2 Given database I, a template can be instantiated yielding a game graph GQ(I) as in Fig. 3b. Note how template variables (e.g., Y ) have been replaced by domain values (a or b), and that conditional edges (e.g., labeled “C(X)”) became unconditional edges (e.g., C(a) ! rC(a)) or no edge at all (e.g., from C(b)), depending on whether or not the condition holds in I. To extract why(-not) provenance from a game graph GQ(I) as in Fig. 3b, we need to solve the game first, i.e., determine which positions are won (light green) or lost (dark red); see Fig. 3c. There is a surprisingly simple and elegant solution: the (unstratified) Datalog¬ rule Qwm:= ¬B : x1 6= a, x1 6= b, x2 = a C : x1 = a A : x1 = a A : x1 = b ¬C : x1 6= a ¬A : x1 6= a, x1 6= b C : x1 6= a R2 : X = a, Y = a R2 : X = a, Y = b B : x1 6= a, x2 6= a R2 : X 6= a, Y 6= a RB : x1 = b, x2 = a B : x1 = a, x2 = b A : x1 6= a, x1 6= b G2 2 : ¬C : Y 6= a G1 2 : B : X 6= a, X 6= b, Y = a B : x2 6= b, x1 = a ¬A : x1 = b ¬A : x1 = a G1 2 : B : Y 6= b, X = a ¬B : x1 6= a, x2 6= a ¬B : x1 = a, x2 = b B : x1 = b, x2 = a RC : x1 = a ¬B : x2 6= b, x1 = a RB : x1 = a, x2 = b R2 : Y 6= b, X = a, Y 6= a G1 2 : B : X 6= a, Y 6= a G1 2 : B : X = b, Y = a B : x1 6= a, x1 6= b, x2 = a R2 : X 6= a, X 6= b, Y = a G1 2 : B : X = a, Y = b R2 : X = b, Y = a ¬C : x1 = a ¬B : x1 = b, x2 = a G2 2 : ¬C : Y = a Figure 5: Constraint provenance game for QABC. Unlike in Figure 3, nodes may represent finite or infinite sets here. GQ(I) thus consists only of edges that are matched by the regular path queries (g.r)+ and r.(g.r)⇤ , i.e., alternating sequences of green (winning) and red (delaying) moves [KLZ13]. 3. Constraint Provenance Games Consider the solved game graph of Fig. 3c. If the value c were added to the active domain, the provenance would be incomplete: e.g., to explain why-not A(b) there are two 9a, 9b branches ema- nating from A(b). However, with c in the active domain there is a third 9c branch via r2(b, c): see Fig. 4. We show that a modified game construction (Fig. 5) based on constraints can be used to au- tomatically include such extensions of the active domain, thereby Happy  End  (3  of  3)…  sort  of  …     Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   C(X) B(X, Y ) r2(X, Y ) g1 2(X, Y ) g2 2(Y ) rB(X, Y ) rC (X) ¬B(X, Y ) ¬C(X) B(X, Y ) C(X) X:=Y 9Y me template for QABC : A(X) : B(X, Y ), ¬C(Y ). ¬C(a) ¬C(b) ¬B(a, a) ¬B(a, b) rB(b, a) r2(b, a) g1 2(a, a) B(a, b) B(a, a) C(a) g2 2(a) g2 2(b) C(b) ¬B(b, a) ¬B(b, b) rC (a) r2(a, b) r2(a, a) g1 2(a, b) rB(a, b) r2(b, b) g1 2(b, b) g1 2(b, a) B(b, b) B(b, a) 9a 9b 9b 9a tantiated QABC game on I = {B(a, b), B(b, a), C(a)}. ¬C(a) ¬C(b) ¬B(a, a) ¬B(a, b) rB(b, a) r2(b, a) rB(a, b)B(a, b) B(a, a) C(a) g2 2(a) g2 2(b) C(b) ¬B(b, a) ¬B(b, b) rC (a) r2(a, b) r2(a, a) g1 2(a, b) g1 2(a, a) r2(b, b) g1 2(b, b) g1 2(b, a) B(b, b) B(b, a) 9a 9b 9b 9a ved game: lost positions are (dark) red; won positions ht) green. Provenance edges (= good moves) are solid. oves are dashed and not part of the provenance. A(a) is (b) is false) as it is won (lost) in the solved game; the provenance explains why (why-not). Provenance game for QABC. The well-founded model of (X, Y ), ¬win(Y ), applied to move graph M, solves the game. g1 2(b, c) g1 2(b, b) r2(b, a) ¬B(b, c) B(b, c) g2 2(a) ¬B(b, b) rC (a) A(b) C(a) B(b, b)r2(b, b) r2(b, c) 9 c 9 a 9 b Figure 4: Altered subgraph of Fig. 3c after adding c to the active domain. ¬B : x1 6= a, x1 6= b, x2 = a C : x1 = a A : x1 = a A : x1 = b ¬C : x1 6= a ¬A : x1 6= a, x1 6= b C : x1 6= a R2 : X = a, Y = a R2 : X = a, Y = b B : x1 6= a, x2 6= a R2 : X 6= a, Y 6= a RB : x1 = b, x2 = a B : x1 = a, x2 = b A : x1 6= a, x1 6= b G2 2 : ¬C : Y 6= a G1 2 : B : X 6= a, X 6= b, Y = a B : x2 6= b, x1 = a ¬A : x1 = b ¬A : x1 = a G1 2 : B : Y 6= b, X = a ¬B : x1 6= a, x2 6= a ¬B : x1 = a, x2 = b B : x1 = b, x2 = a RC : x1 = a ¬B : x2 6= b, x1 = a RB : x1 = a, x2 = b R2 : Y 6= b, X = a, Y 6= a G1 2 : B : X 6= a, Y 6= a G1 2 : B : X = b, Y = a B : x1 6= a, x1 6= b, x2 = a R2 : X 6= a, X 6= b, Y = a G1 2 : B : X = a, Y = b R2 : X = b, Y = a ¬C : x1 = a ¬B : x1 = b, x2 = a G2 2 : ¬C : Y = a Figure 5: Constraint provenance game for QABC. Unlike in Figure 3, nodes may represent finite or infinite sets here. Why-­‐not  provenance   complete  only  for   adom(I)  =  {  a,  b  }  !   Constraint  why-­‐not  provenance   also  captures  new  constants,  i.e.,   for  an  unlimited  domain     D  =  {  a,  b,  c,  …  }   =>  Constraint  Provenance  answer  is   domain  independent!  (sort  of)     38  
  39. 39. Why-­‐Not:  The  Full  Story  Emerges…   (sort  of…)       Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   R1 : X 6= a, X 6= b, Z1 = c, Z2 = c, Y 6= c ¬hop : x2 6= a, x2 6= b, x1 = a R1 : X 6= a, X 6= b, Z1 = c, Z2 = b, Y = a 3Hop : x1 6= a, x1 6= b, x2 = a R1 : X 6= a, X 6= b, Z1 6= c, Z1 6= a, Z1 6= b, Z2 = c, Y 6= c G1 1 : hop : X 6= a, X 6= b, Z1 6= c R1 : X 6= a, X 6= b, Z1 = b, Z2 = c, Y 6= c G1 1 : hop : X 6= a, X 6= b, Z1 = c ¬hop : x1 6= a, x1 6= b, x2 = c hop : x2 6= a, x2 6= b, x1 = a R1 : X 6= a, X 6= b, Z1 = a, Z2 = a, Y = a ¬hop : x2 6= a, x2 6= c, x1 = b G2 1 : hop : U 6= a, Z1 6= b, Z2 6= c R1 : X 6= a, X 6= b, Z1 = c, Z2 6= c, Z2 6= a, Z2 6= b, Y 6= c R1 : X 6= a, X 6= b, Z1 6= c, Z1 6= a, Z1 6= b, Z2 6= c, Z2 6= a, Z2 6= b, Y 6= c hop : x1 6= a, x1 6= b, x2 6= c ¬hop : x1 6= a, x1 6= b, x2 6= c R1 : X 6= a, X 6= b, Z1 = b, Z2 = b, Y = a R1 : X 6= a, X 6= b, Z1 6= c, Z1 6= a, Z1 6= b, Z2 = b, Y = a G2 1 : hop : Z1 6= a, Z1 6= b, Z2 = c hop : x2 6= a, x2 6= c, x1 = b hop : x1 6= a, x1 6= b, x2 = c R1 : X 6= a, X 6= b, Z1 = b, Z2 = a, Y = a G2 1 : hop : Z2 6= a, Z2 6= c, Z1 = b R1 : X 6= a, X 6= b, Z2 6= a, Z2 6= b, Z1 = a, Y 6= c R1 : X 6= a, X 6= b, Z1 = c, Z2 = a, Y = a R1 : X 6= a, X 6= b, Z1 6= c, Z1 6= a, Z1 6= b, Z2 = a, Y = a R1 : X 6= a, X 6= b, Z1 = a, Z2 = b, Y = a G3 1 : hop : Z2 6= a, Z2 6= b, Y 6= c R1 : X 6= a, X 6= b, Z2 6= a, Z2 6= c, Z1 = b, Z2 6= b, Y 6= c G2 1 : hop : Z2 6= a, Z2 6= b, Z1 = a Figure 9: The why-not provenance of 3hop(c, a). The provenance is represented in the failure of the claim that 3hop(c, a) is in the answer. This is argued over the Boolean expression defining 3hop(x, y). A move from the source node to a child represents the choice of a Boolean expression that is sufficient to capture a rule deriving 3hop(c, a). The opponent counters with a subset of this conjunction that is claimed not to be true. The game continues until it reaches the EDB. There exists no equivalent grounded provenance game. g2 1(c, a) ¬3hop(c, a) g2 1(c, c)g1 1(c, c) r1(c, a, c, b) ¬hop(c, b) hop(c, a) g2 1(b, b) ¬hop(a, c) hop(c, c) g1 1(c, a) r1(c, a, b, c)r1(c, a, a, b) 3hop(c, a) hop(b, b) g2 1(c, b)g2 1(a, c) r1(c, a, a, c) ¬hop(c, c) hop(c, b) ¬hop(c, a) g1 1(c, b) r1(c, a, b, b) ¬hop(b, b) g3 1(c, a) r1(c, a, a, a) r1(c, a, b, a) hop(a, c) r1(c, a, c, a) r1(c, a, c, c) 9 a,b 9 a,c 9 c,a 9 c,c9 b,c 9 b,b9 b,a9 a,a 9 c,b Figure 2: Why-not provenance for 3hop(c, a) using provenance games. gi 1 in the body of r1, thus claiming that gi 1 is false and hence that the r1 instance doesn’t derive t. The first player can counter and demonstrate that gi 1 is true by selecting a rule instance or fact as evidence for gi 1. The game proceeds in rounds until some player cannot move and thus loses (the opponent wins). In [KLZ13] it was shown how the provenance of a tuple t can be obtained via a regular path query over a solved game graph like the one in Fig. 1d: e.g., p3 + 2pqr for 3hop(a, a) is represented by a solved game as shown in Fig. 1e: for positive queries, solved games represent semiring provenance by noting that won (green) and lost (red) po- sitions correspond to “+” and “⇥” operations, respectively (leaves represent input annotations, here: p, q, r, s) [KLZ13]. Why-Not Provenance and the Many Ways to Fail. Since games are inherently symmetric (one player’s win is the opponent’s loss and vice versa), the approach yields an elegant provenance model that unifies why and why-not provenance. Consider the (dark, red) node 3hop(c, a) in Fig. 2. The color coding indicates that the posi- tion 3hop(c, a) is lost (the atom is false), i.e., all outgoing moves to a node r1(x, y, z1, z2) lead to a position that is won for the oppo- Constraint Provenance Games. We propose to solve the prob lem of domain dependence by modifying provenance games s that they can handle certain infinite relations that can be finitel represented. For example, in addition to the finitely many reason why 3hop(c, a) fails over the active domain adom(I), there are in finitely many others, if we consider new constants d, e, . . . outsid of adom(I). For example, let relation R = {a, b} have two tuple R(a) and R(b). If we want to know why-not R(c), we just point t c /2 R. But we could also return a more general answer for why-no R(x) and say that ¬R(x) is true for all x with x 6= a ^ x 6= b (no just for x = c). This approach is inspired by Chan’s Constructiv Negation [Cha88], a form of constraint logic programming [Stu95] The key idea is to represent (potentially infinite) relations throug constraints, i.e., Boolean combinations of equalities x = c and dis equalities x 6= c. Overview and Contributions. Section 2 briefly explains how first order queries are translated into games and how provenance is ex tracted from solved games. In Section 3 we describe the construc tion of constraint provenance games; additional details and exam ples are contained in the appendix. Our main contributions are (i) game provenance provides a uniform treatment of why and why not provenance for first-order logic (= relational algebra with set difference); (ii) for positive queries, the approach captures the mos informative semiring provenance [GKT07, KG12]; (iii) we develo a constraint provenance framework which yields domain indepen dent provenance expressions, extending prior results [KLZ13]; an (iv) we implemented a prototype of constraint provenance games. diate nodes z1, z2 2 {a, b, c}. To better understand these w explanations, consider the input graph in Fig. 7. It contains th inal database instance I plus a number of hypothetical (or m edges (dotted), with labels t, u, v, w, and x. These missing correspond to the failed leaf nodes in Fig. 2. The table in contains the why-not provenance, with different combinat missing edges as preconditions for a derivation of 3hop(c, a a p b q c u r x s t w v Figure 7: Input graph I with five additional, hypothetical edges (d B. Constraint Game Construction Consider the query QABC. To build the game, each grou ple in the program such as B(a, b) is replaced by a co B: x1=a, x2=b (a conjunction). First, the subgraph for EDB predicates is created. The rem of the game is constructed iteratively similar to query exe For rules whose subgoals are all on EDB predicates, go nodes/edges are generated. For IDB predicates that were the head of EDB-only rules, tuple nodes are generated. G rule nodes/edges are added for rules when the subgraph for a subgoals has been generated, and for predicates when the su 5  missing  edges   9  minimal  combina0ons       A. Why-Not 3hop(c, a) Dissected Consider the input graph in Fig. 1a and its why-not prove for 3hop(c, a) in Fig. 2. The graph encodes the reason 3hop(c, a) is not in the answer. Moving from the lost 3hop(c Fig. 2, there are nine possible rule instantiations r1(c, a, z1, z of which represent a reason why there is no 3hop(c, a) via in diate nodes z1, z2 2 {a, b, c}. To better understand these w explanations, consider the input graph in Fig. 7. It contains th inal database instance I plus a number of hypothetical (or mi edges (dotted), with labels t, u, v, w, and x. These missing correspond to the failed leaf nodes in Fig. 2. The table in contains the why-not provenance, with different combinati missing edges as preconditions for a derivation of 3hop(c, a a p b q c u r x s t w v Figure 7: Input graph I with five additional, hypothetical edges (da +  …  ?   Constraints  imply    15  disjoint  rela0ons  over   key  variables  X,  Z1,  Z2,  Y     Oh Boy! 39  
  40. 40. Provenance  Games:  Summary   •  (1)  Game  Provenance                         –  The  win-­‐move  game  has  a  natural  why  and  why-­‐not  provenance  “built-­‐in”   •  “good”  and  “bad  moves”   •  è  discard  bad  moves  è  game  provenance     •  (2)  Provenance  Games                             –  Query  evaluaBon  also  is  a  game!   –  Game  provenance  can  be  applied  to  query  evaluaBon  game   =>  uniform  why  +  why-­‐not  provenance     •  (3)  Constraint  Provenance                 –  Domain  independent  (some  infinite  domains  OK)   –  Prototypically  implemented   •  (4)  Future  Work                                               –  Make  theory  pracBcal!     •  e.g.  implement  in  Boris  Glavic’s  Perm  or  GPROM    system   –  TheoreBcal  properBes   –  RelaBon  to  ArgumentaBon  Frameworks     –  Clarify  relaBonship  to  monus  semirings  (Floris  Geerts  et  al)   –  Higher-­‐order  reasons!   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   40  
  41. 41. Why-­‐Not:  so  many   answers,  so  limle   =me   •  The  crux  of   current  why-­‐not   approaches:   –  Enumerate  all   ways  that  could/ might  have   worked,  but   failed…   •  Idea     è  abstract  those   many,  many   explanaBons!   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   TaPP’15   41  
  42. 42. Why-­‐Not  Provenance  References   •  Köhler,  Sven,  Bertram  Ludäscher,  and  Daniel  Zinn.  "First-­‐ order  provenance  games.”  In  Search  of  Elegance  in  the   Theory  and  Prac0ce  of  Computa0on.  Peter  Buneman   Festschrio,    LNCS  8000.  Springer  Berlin  Heidelberg,  2013.   •  Riddle,  Sean,  Sven  Köhler,  and  Bertram  Ludäscher.   "Towards  constraint  provenance  games.”  6th  USENIX   Workshop  on  the  Theory  and  Prac0ce  of  Provenance   (TaPP  2014).     •  Glavic,  Boris,  Sven  Köhler,  Sean  Riddle,  and  Bertram   Ludäscher.  "Towards  constraint-­‐based  explana=ons  for   answers  and  non-­‐answers.”  7th  USENIX  Workshop  on   the  Theory  and  Prac0ce  of  Provenance  (TaPP  2015).   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   42  
  43. 43. From “Climate Gate” to Reproducible Science Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   43  
  44. 44. YesWorkflow:  Yes!   Scripts  can  be  Workflows,  too!   Bertram  Ludäscher       Graduate  School  of  Library  and  Informa0on  Science  (GSLIS)   Na0onal  Center  for  Supercompu0ng  Applica0ons  (NCSA)   Department  of  Computer  Science  (CS@UIUC)   24.  –  29.  Januar  2016,  Dagstuhl  Seminar  16041   Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science  
  45. 45. YesWorkflow  ~  noWorkflow   •  “not  only  Workflow”   –  AutomaBcally  capture  runBme/ retrospec=ve  provenance  of  (Python)   scripts   •  YesWorkflow  (“YW  1.0”)   –  Yes,  (some)  scripts  are  workflows,  too!   –  Expose  prospec=ve  provenance  (=   workflow)  hidden  in  the  script  via  simple   user  annotaBon   •  Combining  NW  +  YW  =>  more  provenance   mileage  (TaPP’15)  -­‐  Shared  vision:     –  focus  on  provenance  from/for  scripts     •  “YW  2.0”   –  Focus  on  queries  and  views  that  combine   many  sources  of  provenance  (recorded,   logged:  prov  engineer;  reconstructed  (prov   sleuth);  ...)     =>  follow-­‐up  study  for  NW  +  YW  !     Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   45  
  46. 46. Reproducibility:  (yesterday’s  discussion  cont’d)    -­‐  What  ques=ons  should  we  ask?    -­‐  What  queries  should  we  enable?   •  Cui  bono?  (others,  publishers,  …  ?)   •  Provenance-­‐for-­‐Self                vs  Provenance-­‐for-­‐others   •  Reproducibility-­‐for-­‐Self    vs  Reproducibility-­‐for-­‐others   •  For  key  terms,  e.g.,  Carole’s     –  …  rerun,  repeat,  replicate,  reproduce,  reuse,  …     •  ..  ask  “what  informa=on/insight  do  I  gain  from   reproducing,  repea=ng,  replica=ng…  ?”     •  What  is  fixed  and  what  does  the  study  vary?     =>  Research  ObjecBve,  Method/Algorithm,  ImplementaBon,   Pla‚orm/Environment,  Actors/People,  input  Data  (params,  raw  data)   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   46  
  47. 47. YesWorkflow  =  Scripts  +  Comments   •  Scripts  can  be  hard  to  digest,  communicate   •  Idea:     – Add  structured  comments  (cf.  JavaDoc)   =>    reveal  workflow  structure  and  dataflow    =>    obtain  some  scienBfic  workflow  benefits   •   …  ASAP  …     Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   47  
  48. 48. User  Comments:  YW  @Annota=ons   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   @begin GO_Analysis @in hgCutoff @in … @out BP_Summl_file @out … @end GO_Analysis ...   48  
  49. 49.            Get  3  views  for  the  price  of  1!   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC                        Process  view   Data  view   Combined  view   49  
  50. 50. Paleoclimate  Reconstruc=on  …       Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   GetModernClimate PRISM_annual_growing_season_precipitation SubsetAllData dendro_series_for_calibration dendro_series_for_reconstruction CAR_Analysis_unique cellwise_unique_selected_linear_models CAR_Analysis_union cellwise_union_selected_linear_models CAR_Reconstruction_union raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors CAR_Reconstruction_union_output ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif master_data_directory prism_directory tree_ring_datacalibration_years retrodiction_years •  …  explained  using  YesWorkflow   Kyle  B.,  (computaBonal)  archeologist:     "It  took  me  about  20  minutes  to  comment.  Less   than  an  hour  to  learn  and  YW-­‐annotate,  all-­‐told."   50  
  51. 51.                          Figure 4: Process workflow view of an A↵ymetrix analysis script (in R). 4 YesWorkflow Examples In the following we show YesWorkflow views extracted from real-world scientific use cases. The scripts were annoted with YW tags by scientists and script authors, using a very modest training and mark-up e↵ort.1 Due to lack of space, the actual MATLAB and R scripts with their YW markup are not included here. However, they are all available from the yw-idcc-15 repository on the YW GitHub site [Yes15]. 4.1 Analysis of Gene Expression Microarray Data Gene  Expression  Microarray  Data  Analysis   •  [Normalize]     –  NormalizaBon  of  data  across  microarray  datasets     •  [SelectDEGs]     –  SelecBon  of  differenBally  expressed  genes  between  condiBons     •  [GO  Analysis]     –  determinaBon  of  gene  ontology  staBsBcs  for  the  resulBng  datasets     •  [MakeHeatmap]     –  creaBon  of  a  heatmap  of  the  differenBally  expressed  genes.     Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   Tyler  Kolisnik,  Mark  Bieda   51  
  52. 52. Mul=-­‐Scale  Synthesis  and  Terrestrial  Model   Intercomparison  Project  (MsTMIP)   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   fetch_drought_variable drought_variable_1 fetch_effect_variable effect_variable_1 convert_effect_variable_units effect_variable_2 create_land_water_mask land_water_mask init_data_variables predrought_effect_variable_1 drought_value_variable_1 recovery_time_variable_1 drought_number_variable_1 define_droughts sigma_dv_event month_dv_length detrend_deseasonalize_effect_variable effect_variable_3 calculate_data_variables recovery_time_variable_2 drought_value_variable_2 predrought_effect_variable_2 drought_number_variable_2 export_recovery_time_figure output_recovery_time_figure export_drought_value_variable_figure output_drought_value_variable_figure export_predrought_effect_variable_figure output_predrought_effect_variable_figure export_drought_number_variable_figure output_drought_number_figure input_drough_variable input_effect_variable Christopher  Schwalm,   Yaxing  Wei   52  
  53. 53. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Q3:  Where  is  the  raw  image  of  the  corrected  image   DRT322_11000ev_030.img?    run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   53  
  54. 54. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     Q5:  What  casseme-­‐id  had  the  sample  leading  to   DRT240_10000ev_001.img?   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   54  
  55. 55. YW  Demo!   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   55  
  56. 56. YesWorkflow  References   •  hGp://yesworkflow.org     •  T.  McPhillips,  S.  Bowers,  K.  Belhajjame,  B.  Ludäscher  (2015).   Retrospec=ve  Provenance  Without  a  Run=me  Provenance   Recorder.  7th  USENIX  Workshop  on  the  Theory  and  Prac0ce  of   Provenance  (TaPP'15).     •  T.  McPhillips,  T.  Song,  T.  Kolisnik,  S.  Aulenbach,  K.  Belhajjame,   R.K.  Bocinsky,  Y.  Cao,  J.  Cheney,  F.  ChirigaB,  S.  Dey,  J.  Freire,  C.   Jones,  J.  Hanken,  K.W.  KinBgh,  T.A.  Kohler,  D.  Koop,  J.A.  Macklin,   P.  Missier,  M.  Schildhauer,  C.  Schwalm,  Y.  Wei,  M.  Bieda,  B.   Ludäscher  (2015).  YesWorkflow:  A  User-­‐Oriented,  Language-­‐ Independent  Tool  for  Recovering  Workflow  Informa=on  from   Scripts.  Interna0onal  Journal  of  Digital  Cura0on  10,  298-­‐313.   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   56  
  57. 57. Janiform  Demo  (by  Jens  Dimrich)   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   57  
  58. 58. Janiform  Demo  (Jens  Dimrich)   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   58  
  59. 59. hmps://youtu.be/f4iKwdERXhI     Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   59  
  60. 60. Conclusions   •  Provenance  is  an  acBve  and  broad  area  of   research     – …  in  databases     – …  in  scienBfic  workflows   – Both  in  specialized  (TAPP,  IPAW)  and  maintream   venues  (VLDB,  SIGMOD,  EDBT,  ICDE,  PODS,  ICDT,  ..)   •  Great  topics  in  theory,  pracBce/engineering  or   both!   Bertram  Ludäscher                                                                                                DAIS  Seminar  2016-­‐02-­‐09.  CS  @  UIUC   60  

×