SlideShare a Scribd company logo
1 of 77
Download to read offline
YesWorkflow: More Provenance
Mileage from Scientific Workflows
and Scripts!
Bertram	
  Ludäscher	
  
Director, Center for Informatics Research in Science and Scholarship (CIRSS)
Professor, Graduate School of Library and Information Science (GSLIS)
Faculty affiliate, NCSA & Department of Computer Science
Outline	
  
•  All	
  things	
  “Provenance”	
  …	
  	
  
•  Provenance:	
  Why	
  should	
  you	
  care?	
  
•  Provenance	
  in	
  Databases	
  
– Why-­‐,	
  How-­‐,	
  …,	
  Why-­‐Not	
  Provenance	
  
•  …	
  vs	
  Provenance	
  in	
  ScienCfic	
  Workflows	
  
•  YesWorkflow:	
  Doing	
  more	
  (someCmes	
  with	
  less) 	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   2	
  
Provenance	
  Palooza	
  	
  
•  Provenance	
  	
  
–  …	
  or	
  provenience?	
  
•  Chain	
  of	
  custody	
  
•  Lineage	
  
•  Pedigree	
  
•  Genealogy	
  	
  
•  Phylogeny	
  	
  
•  History	
  
•  Origin	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   3
Provenance	
  Research	
  everywhere	
  …	
  	
  
…	
  and	
  here:	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   4	
  
Provenance as we all know it
•  Oxford English Dictionary:
–  coming from some particular source or quarter; origin,
derivation
–  the history or pedigree of a work of art, manuscript, rare
book, etc.
–  concretely, a record of the passage of an item through
its various owners (“chain of custody”)
•  Merriam-Webster:
–  prov·e·nance noun ˈpräv-nəәn(t)s, ˈprä-vəә-ˌnän(t)s
–  the origin or source of something
•  Origin:
–  French, from provenir to come forth, originate, from Latin
provenire, from pro- forth + venire to come
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   5	
  
Provenance
6
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
  
Is	
  this	
  a	
  real	
  Leonardo?	
  Lack	
  of	
  reliable	
  
Provenance	
  casts	
  a	
  doubt	
  on	
  this	
  …	
  	
  
Pedigree
7
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   8	
  
Natural	
  History:	
  	
  
Understanding	
  what	
  happened…	
  
Zrzavý,	
  Jan,	
  David	
  Storch,	
  and	
  Stanislav	
  Mihulka.	
  
EvoluIon:	
  Ein	
  Lese-­‐Lehrbuch.	
  Springer-­‐Verlag,	
  2009.	
  
Author:	
  Jkwchui	
  (Based	
  on	
  
drawing	
  by	
  Truth-­‐seeker2004)	
  
Provenience	
  vs	
  Provenance	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   9
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   10
Society	
  of	
  American	
  Archivists	
  
	
  hVp://www2.archivists.org/glossary/
terms/p/provenance	
  	
  
•  Principle	
  of	
  
provenance	
  
(respect	
  des	
  
fonds)	
  
•  Keep	
  
records	
  of	
  
different	
  
origins	
  
separate	
  to	
  
preserve	
  
context	
  	
  
Archivists	
  
So	
  what	
  is	
  “provenance”	
  (sensu	
  W3C)	
  ?	
  
•  Provenance	
  refers	
  to	
  the	
  sources	
  of	
  informaIon,	
  including	
  en11es	
  
and	
  processes,	
  involving	
  in	
  producing	
  or	
  delivering	
  an	
  ar1fact	
  (*)	
  
•  Provenance	
  is	
  a	
  descripIon	
  of	
  how	
  things	
  came	
  to	
  be,	
  and	
  how	
  
they	
  came	
  to	
  be	
  in	
  the	
  state	
  they	
  are	
  in	
  today	
  	
  (*)	
  
•  Provenance	
  is	
  a	
  record	
  that	
  describes	
  the	
  people,	
  ins1tu1ons,	
  en11es,	
  
and	
  ac1vi1es,	
  involved	
  in	
  producing,	
  influencing,	
  or	
  delivering	
  a	
  piece	
  
of	
  data	
  or	
  a	
  thing	
  in	
  the	
  world	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   11	
  
Outline	
  
•  All	
  things	
  “Provenance”	
  …	
  	
  
•  Provenance:	
  Why	
  should	
  you	
  care?	
  
•  Provenance	
  in	
  Databases	
  
– Why-­‐,	
  How-­‐,	
  …,	
  Why-­‐Not	
  Provenance	
  
•  Provenance	
  in	
  ScienCfic	
  Workflows	
  
•  YesWorkflow:	
  Doing	
  more	
  (someCmes	
  with	
  less) 	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   12	
  
Provenance	
  =>	
  Transparency	
  
•  =	
  “Externally-­‐facing”	
  
provenance	
  	
  
– “Them-­‐Provenance”	
  
•  Later:	
  “Internally-­‐
facing”	
  provenance	
  
– “Me-­‐Provenance”	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   13	
  
Climate	
  Change:	
  Whodunnit?	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   14	
  
Tracing	
  the	
  sources	
  (data,	
  code)	
  	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   15	
  
From “Climate Gate” to Reproducible Science
16
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
  
Data & Provenance Management: Single Model
17
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
  
Data & Provenance Management: Model
Chains
18
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
  
Some things people do with “provenance”
•  Result	
  validaCon	
  	
  	
  
•  Result	
  debugging	
  (science	
  vs	
  wf	
  logic)	
  
•  Reproducibility	
  and	
  Repeatability	
  	
  	
  
•  ExplanaCon	
  (derivaCons,	
  traces,	
  proof	
  trees)	
  
•  RunCme	
  monitoring	
  
–  Profiling,	
  benchmarking	
  
•  Performance	
  OpCmizaCon	
  (“smart	
  re-­‐run”)	
  
•  Fault-­‐tolerance,	
  crash-­‐recovery	
  
•  Database	
  view	
  maintenance	
  (e.g.	
  data	
  warehousing)	
  
•  …	
  	
  
19
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
  
Provenance for Virtual Joint Experiments
•  How do we ensure that Charlie gets a complete account of the history of
Wc s outputs?
•  How do we ensure that Alice gets her due (partial) credit when Charlie
uses Bob s data v?
è traces TA and TB will be critical
è need to compose them to obtain TC
We	
  can	
  view	
  the	
  composiCon	
  WC	
  as	
  a	
  new,	
  virtual	
  workflow	
  
Charlie
Alice
(1) develop! WA
(2) run! RA zx Bob
(3) develop!WB
(5) run!RB vuf
v
WC:=
(6) inspect
provenance!
(7) understand,
generate!
WA WS WB
uzx
(4) data sharing!
TA! TB!f -1
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   20	
  
Open	
  Provenance	
  Model	
  =>	
  W3C	
  Prov	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   21	
  
W3C	
  Prov:	
  One	
  size	
  fits	
  all?	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   22	
  
Outline	
  
•  All	
  things	
  “Provenance”	
  …	
  	
  
•  Provenance:	
  Why	
  should	
  you	
  care?	
  
•  Provenance	
  in	
  Databases	
  
– Why-­‐,	
  How-­‐,	
  …,	
  Why-­‐Not	
  Provenance	
  
•  Provenance	
  in	
  ScienCfic	
  Workflows	
  
•  YesWorkflow:	
  Doing	
  more	
  (someCmes	
  with	
  less) 	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   23	
  
Types of Data Provenance
•  Black-box
–  know (next to) nothing at compile-time
–  at runtime: keep some data lineage
–  most prov sensu WF work use this
•  White-box
–  statically (compile-time) analyzable
–  q(Y1,Y2) :- p(X1,X2), r(X1,Y1), s(X2,Y2)
–  Most prov sensu DB work use this
•  Grey-box
–  can “look inside” (some black boxes)
–  … e.g. b/c they have subworkflows
–  … or FP signatures: A :: t1, t2à t3,t4
–  … or semantic annotations (sem.types)
f
A
q
t1
t2
t3
t4
X1
X2
Y1
Y2
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   24	
  
Provenance	
  in	
  Databases	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   25	
  
Source:	
  Val	
  Tannen	
  
Provenance	
  in	
  Databases	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   26	
  
Source:	
  Val	
  Tannen	
  
Provenance	
  in	
  Databases	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   27	
  
Source:	
  Val	
  Tannen	
  
AbstracQng	
  the	
  structure	
  of	
  querying	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   28	
  
Source:	
  Val	
  Tannen	
  
In	
  database	
  provenance,	
  tuples	
  are	
  either	
  combined	
  
conjunc1vely	
  (*)	
  or	
  disjunc1vely	
  (+)	
  è	
  That’s	
  the	
  core	
  model!	
  
Provenance	
  Polynomials	
  
One	
  Semiring	
  to	
  Rule	
  them	
  all!	
  
(DB	
  theory	
  strikes!)	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  
and	
  Scripts	
   29	
  
Green,	
  Karvounarakis,	
  Tannen.	
  Provenance	
  semirings,	
  PODS,	
  2007	
  
Unifying	
  most	
  prior	
  
work	
  in	
  a	
  simple	
  model!	
  
Example:	
  Go	
  from	
  X	
  to	
  Y	
  in	
  3	
  hops!	
  
(e.g.,	
  a	
  =	
  CS 	
   	
  	
  b	
  =	
  NCSA 	
   	
  	
  c	
  =	
  GSLIS)	
  
•  Database:	
   	
   	
   	
  	
  hop(X,Y)	
  :=	
  	
  
	
  
	
  
•  Query:	
  	
  3hop(X,Y)	
  :-­‐	
  	
  
	
   	
   	
   	
   	
  hop(X,	
  Z1),	
  hop(Z1,	
  Z2),	
  hop(Z2,Y).	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   30	
  
a
p
b
q
r
c
s
Note:	
  Can	
  not	
  go	
  from	
  c	
  to	
  a	
  in	
  3hops!	
  	
  
a
ppp+pqr+qrp
b
ppq+qrq
cpqs
ppr+qrr
rpq
rqs
hop(a,a,	
  p).	
  
hop(a,b,	
  q).	
  
hop(b,a,	
  r)	
  
hop(b,c,	
  s).	
  
3hop(a,a,	
  p3+2pqr).	
  
3hop(a,b,	
  p2q+q2r).	
  
…	
  	
  
3hop(a,c,	
  pqs).	
  
Provenance	
  Polynomials	
  
	
  	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  
and	
  Scripts	
   31	
  
,,Mein	
  Schatz!”	
  
	
  	
  	
  p3	
  +	
  2pqr	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  p3	
  +	
  	
  pqr	
  	
  	
  	
  	
  	
  	
  	
  	
   	
  	
  	
  p	
  +	
  2pqr	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  p	
  +	
  	
  pqr	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  pqr	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  p	
  +	
  	
  pqr	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
p	
  
a
ppp+pqr+qrp
b
ppq+qrq
cpqs
ppr+qrr
rpq
rqs
32
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
  
Provenance in
Databases
NegaQon	
  &	
  Why-­‐Not	
  Provenance	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   33	
  
•  Provenance	
  Semirings	
  work	
  well	
  for:	
  
– PosiQve	
  Queries	
  (e.g.,	
  RA+	
  )	
  
•  Challenges:	
  Handling	
  of	
  	
  
– set	
  difference	
  (~	
  negaQon)	
  
– Why-­‐not	
  provenance	
  	
  	
  
– Missing	
  Answer	
  provenance	
  
	
  
•  A	
  fresh	
  look	
  at	
  provenance!	
  
•  …	
  using	
  an	
  old	
  idea:	
  Game	
  semanQcs!	
  	
  	
  
Provenance	
  (or	
  Query	
  EvaluaIon)	
  Games	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   34	
  
	
  “SLD-­‐resoluQon	
  game” 	
  	
  
	
  A(X)	
  :–	
  B(X,Y,Z)	
  	
  …	
  not	
  C(X,Y)	
  …	
  
	
  
Eureka!
[KLZ13]	
  Köhler,	
  S.,	
  Ludäscher,	
  B.,	
  &	
  Zinn,	
  D.	
  (2013).	
  First-­‐order	
  provenance	
  games.	
  
In	
  Search	
  of	
  Elegance	
  in	
  the	
  Theory	
  and	
  PracIce	
  of	
  ComputaIon.	
  Springer	
  
TranslaQon:	
  Q(I) => G Q(I)
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   35	
  
A(X)
C(X)
B(X, Y )
r2(X, Y )
g1
2(X, Y )
g2
2(Y )
rB(X, Y )
rC (X)
¬A(X)
¬B(X, Y )
¬C(X)
B(X, Y )
C(X)
X:=Y
9Y
(a) Game template for QABC : A(X) : B(X, Y ), ¬C(Y ).
¬C(a)
¬C(b)
¬B(a, a)
¬B(a, b)
rB(b, a)
r2(b, a)¬A(b)
¬A(a)
g1
2(a, a)
B(a, b)
B(a, a)
C(a)
g2
2(a)
g2
2(b)
C(b)
¬B(b, a)
¬B(b, b)
rC (a)
A(b)
A(a)
r2(a, b)
r2(a, a)
g1
2(a, b) rB(a, b)
r2(b, b)
g1
2(b, b)
g1
2(b, a)
B(b, b)
B(b, a)
9a
9b
9b
9a
(b) Instantiated QABC game on I = {B(a, b), B(b, a), C(a)}.
Source	
  [KLZ13]	
  
Solve	
  G Q(I)	
  =>	
  Provenance!	
  	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   36	
  
¬B(a, b)¬A(a) B(a, b)
r2(a, b)
g1
2(a, b) rB(a, b)
(b) Instantiated QABC game on I = {B(a, b), B(b, a), C(a)}.
¬C(a)
¬C(b)
¬B(a, a)
¬B(a, b)
rB(b, a)
r2(b, a)¬A(b)
¬A(a) rB(a, b)B(a, b)
B(a, a)
C(a)
g2
2(a)
g2
2(b)
C(b)
¬B(b, a)
¬B(b, b)
rC (a)
A(b)
A(a)
r2(a, b)
r2(a, a)
g1
2(a, b)
g1
2(a, a)
r2(b, b)
g1
2(b, b)
g1
2(b, a)
B(b, b)
B(b, a)
9a
9b
9b
9a
(c) Solved game: lost positions are (dark) red; won positions
are (light) green. Provenance edges (= good moves) are solid.
Bad moves are dashed and not part of the provenance. A(a) is
true (A(b) is false) as it is won (lost) in the solved game; the
game provenance explains why (why-not).
Figure 3: Provenance game for Q . The well-founded model of
Source	
  [KLZ13]	
  
Provenance	
  ~	
  Query	
  EvaluaQon	
  Game	
  	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   37	
  
a p
b
q r
c
s
(a) input I ...
hop
a a p
a b q
b a r
b c s
(b) ... annotated.
3hop
a a p3
+ 2pqr
a b p2
q + q2
r
a c pqs
b a p2
r + qr2
b b pqr
b c qrs
(c) 3hop with provenance.
r1(a, a, b, a)
g2
1(a, a)
¬hop(b, a)
g1
1(a, a)
hop(b, a)
g2
1(a, b) g3
1(b, a)
rhop(b, a)
r1(a, a, a, a)
r1(a, a, a, b)
3hop(a, a)
g3
1(a, a)
rhop(a, a)
hop(a, b)
¬hop(a, a)
g1
1(a, b)
rhop(a, b)
g2
1(b, a)
¬hop(a, b)
hop(a, a)
9 a,a 9 b,a
9 a,b
(d) The game provenance of 3hop(a, a) ...
⇥
+
⇥
+
+
+ +
r
⇥
⇥
+
+
p
+
⇥
+
q
+
⇥
+
(e) ... is p3 + 2pqr.
Provenance	
  Game	
  on	
  GQ(I)	
  	
  	
  
=	
  	
  Provenance	
  Polynomials	
  	
  
…	
  for	
  posiQve	
  queries!	
  
Source	
  [KLZ13]	
  
Provenance	
  ~	
  Query	
  EvaluaQon	
  Game	
  	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   38	
  
…	
  but	
  also	
  works	
  for	
  Why-­‐Not	
  provenance	
  &	
  non-­‐monotonic	
  
queries	
  (i.e.,	
  Q	
  can	
  have	
  negaQon)	
  !!	
  
	
  
Here:	
  not	
  3hop(c,a)	
  –	
  can’t	
  go	
  back	
  from	
  	
  	
  	
  	
  	
  GSLIS	
   	
  to	
  	
  	
  	
  CS	
  
	
   	
   	
   	
   	
   	
   	
   	
   	
   	
   	
   	
  	
  	
  	
  	
  	
  c	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  a	
  
g2
1(c, a)
¬3hop(c, a)
g2
1(c, c)g1
1(c, c)
r1(c, a, c, b)
¬hop(c, b)
hop(c, a)
g2
1(b, b)
¬hop(a, c)
hop(c, c)
g1
1(c, a)
r1(c, a, b, c)r1(c, a, a, b)
3hop(c, a)
hop(b, b)
g2
1(c, b)g2
1(a, c)
r1(c, a, a, c)
¬hop(c, c)
hop(c, b)
¬hop(c, a)
g1
1(c, b)
r1(c, a, b, b)
¬hop(b, b)
g3
1(c, a)
r1(c, a, a, a) r1(c, a, b, a)
hop(a, c)
r1(c, a, c, a) r1(c, a, c, c)
9 a,b 9 a,c 9 c,a 9 c,c9 b,c 9 b,b9 b,a9 a,a 9 c,b
Figure 2: Why-not provenance for 3hop(c, a) using provenance games.
gi
1 in the body of r1, thus claiming that gi
1 is false and hence that
the r1 instance doesn’t derive t. The first player can counter and
demonstrate that gi
1 is true by selecting a rule instance or fact as
evidence for gi
1. The game proceeds in rounds until some player
cannot move and thus loses (the opponent wins). In [KLZ13] it
Source	
  [KLZ13]	
  
Database	
  Provenance:	
  Summary	
  
•  Fine-­‐grained	
  “white-­‐box”	
  provenance	
  
•  Solved	
  (preVy	
  much)	
  for	
  posiQve	
  queries	
  
•  …	
  not	
  so	
  much	
  for	
  negaQon	
  and	
  “Why-­‐Not”	
  
– AcCve	
  area	
  of	
  research!	
  
•  Some	
  research	
  prototypes	
  …	
  	
  
•  …	
  and	
  some	
  real-­‐world	
  implementaCons!	
  
•  Note:	
  Those	
  in	
  need	
  of	
  provenance	
  o`en	
  
already	
  “do	
  it”!!	
  
– Crash	
  recovery,	
  audiCng,	
  concurrency	
  control,	
  …	
  	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   39	
  
Outline	
  
•  All	
  things	
  “Provenance”	
  …	
  	
  
•  Provenance:	
  Why	
  should	
  you	
  care?	
  
•  Provenance	
  in	
  Databases	
  
– Why-­‐,	
  How-­‐,	
  …,	
  Why-­‐Not	
  Provenance	
  
•  Provenance	
  in	
  ScienQfic	
  Workflows	
  
•  YesWorkflow:	
  Doing	
  more	
  (someCmes	
  with	
  less) 	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   40	
  
Scientific Workflows: ASAP!
•  Automation
–  wfs to automate computational aspects of science
•  Scaling (exploit and optimize machine cycles)
–  wfs should make use of parallel compute resources
–  wfs should be able handle large data
•  Abstraction, Evolution, Reuse (human cycles)
–  wfs should be easy to (re-)use, evolve, share
•  Provenance
–  wfs should capture processing history, data lineage
è traceable data- and wf-evolution
è  Reproducible Science
Trident	
  
Workbench	
  
VisTrails	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   41	
  
Es	
  war	
  einmal	
  …	
  	
  	
  
Phylogenetics workflow in Kepler (2005)
Graphical interface
§  Canvas for assembling
and displaying the
workflow.
§  Library of workflow
blocks (‘actors’) that can
be dragged onto the
canvas and connected.
§  Arrows that represent
control dependencies or
paths of data flow.
§  A run button.
These features are
not essential to
managing actual
scientific workflows.
What	
  some	
  of	
  us	
  think	
  of	
  when	
  we	
  hear	
  the	
  
term	
  ‘scienQfic	
  workflows’	
  
Source:	
  Tim	
  McPhillips	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   42	
  
10	
  Key	
  FuncQons	
  of	
  a	
  Sci-­‐WFS	
  
1.  Automate programs and services scientists already use.
2.  Schedule invocations of programs and services correctly and efficiently
– in parallel where possible.
3.  Manage dataflow to, from, and between programs and services.
4.  Enable scientists (not just developers) to author or modify workflows
easily.
5.  Predict what a workflow will do when executed: prospective provenance.
6.  Record what actually happens during workflow execution.
7.  Reveal retrospective provenance – how workflow products were
derived from inputs via programs and services.
8.  Organize intermediate and final data products as desired by users.
9.  Enable scientists to version, share and publish their workflows.
10.  Empower scientists who wish to automate additional programs and
services themselves.
These functions–not actors—distinguish scientific workflow
automation from general scientific software development.
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   43	
  
Tim	
  McPhillips	
  et	
  al.	
  
Yes, scripts are (can be) workflows too!
Interactive Visualization
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   44	
  
SKOPE:	
  Synthesized	
  Knowledge	
  Of	
  Past	
  Environments	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   45	
  
Bocinsky,	
  Kohler	
  et	
  al.	
  study	
  rain-­‐fed	
  maize	
  of	
  Anasazi	
  	
  
–  Four	
  Corners;	
  AD	
  600–1500.	
  Climate	
  change	
  influenced	
  Mesa	
  Verde	
  MigraQons;	
  late	
  
13th	
  century	
  AD.	
  Uses	
  network	
  of	
  tree-­‐ring	
  chronologies	
  to	
  reconstruct	
  a	
  spaQo-­‐
temporal	
  climate	
  field	
  at	
  a	
  fairly	
  high	
  resoluCon	
  (~800	
  m)	
  from	
  AD	
  1–2000.	
  Algorithm	
  
esCmates	
  joint	
  informaCon	
  in	
  tree-­‐rings	
  and	
  a	
  climate	
  signal	
  to	
  idenCfy	
  “best”	
  	
  tree-­‐ring	
  
chronologies	
  for	
  climate	
  reconstrucCng.	
  
K.	
  Bocinsky,	
  T.	
  Kohler,	
  A	
  2000-­‐year	
  reconstrucCon	
  of	
  the	
  rain-­‐fed	
  
maize	
  agricultural	
  niche	
  in	
  the	
  US	
  Southwest.	
  Nature	
  
Communica1ons.	
  doi:10.1038/ncomms6618	
  	
  
… implemented as an R Script …
…	
  HPCBio	
  Workflows	
  @	
  Illinois	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   46	
  
	
  NaIonal	
  Petascale	
  
CompuIng	
  Facility	
  
Broad	
  InsQtute:	
  	
  Recommended	
  workflow	
  for	
  variant	
  analysis	
  
Liudmila	
  Mainzer,	
  
Victor	
  Jongeneel	
  
HPC	
  Bio	
  @	
  Illinois	
  
Quickly,	
  say:	
  	
  #!/bin/bash	
  
It’s	
  Qme	
  to	
  shi`	
  control	
  …	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   47	
  
•  …	
  back	
  from	
  being	
  consumers	
  of	
  someone	
  
else’s	
  (=	
  our)	
  tools	
  ..	
  	
  
–  “Just	
  click	
  here!”	
  
•  ...	
  to	
  tool	
  makers!	
  
–  ScienCsts	
  who	
  author	
  workflows	
  as	
  scripts!	
  
•  Go	
  where	
  the	
  wild	
  things	
  (users!)	
  are	
  …	
  	
  	
  
–  Yes,	
  develop	
  for	
  “end	
  users”	
  …	
  	
  	
  
–  …	
  but	
  don’t	
  forget	
  the	
  tool	
  makers!	
  
•  Can	
  we	
  do	
  this	
  together?	
  	
  
Mount	
  
Sample	
  
Screen
Sample	

Align	
  
Sample	
  
Expose	
  
Sample	
  
Analyze	
  
Images	
  
Check	
  
Criteri
a	
  
Calculat
e	
  
Strategy	
  
Collect	
  
Data	
  Set	
  
Calculat
e	
  Maps	
  
List	
  
Peaks	
  
Run	
  
Search	
  
Refine	
  
Structur
e	
  
Integrat
e	
  
Images	
  
Scale	
  
ReflecQon
s	
  
Merge	
  
ReflecQons	
  
Calc	
  
Amplitude
s	
  
Collect
Data	

Process	
  
Data	

Solve	
  
Structure	

Analyze	
  
Density	

Blu-Ice	

LABELIT	

molrep	
  
refmac	
  
z	
  
ipmosflm	
  
xds	

pointless	
  
scala	
  
xtriage	
  
truncate	
  
rfree	
  
Example:	
  AutoDrug	
  Workflow	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   48	
  
Tsai,	
  Y.,	
  McPhillips,	
  S.	
  E.,	
  González,	
  A.,	
  McPhillips,	
  T.	
  M.,	
  
Zinn,	
  D.,	
  Cohen,	
  A.	
  E.,	
  ...	
  &	
  SolCs,	
  S.	
  (2013).	
  AutoDrug:	
  fully	
  
automated	
  macromolecular	
  crystallography	
  workflows	
  for	
  
fragment-­‐based	
  drug	
  discovery.	
  Acta	
  Crystallographica	
  
SecCon	
  D:	
  Biological	
  Crystallography,	
  69(5),	
  796-­‐803.	
  
Diffraction images	

Experimental electron
density and protein
model	

Full protein structure	

3D	
  Protein	
  Structure	
  
DeterminaQon	
  by	
  X-­‐ray	
  
Crystallography	
  
	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   49	
  
Source:	
  Tim	
  McPhillips	
  
Crystal	
  
in	
  
loop	
  
Sample mounting
robot	

Cassette shipping
dewar	

Crystal mounting pin	

Sample cassette	

Automated	
  Sample	
  Handling	
  
Alice,	
  the	
  high-­‐throughput	
  crystallographer:	
  When	
  the	
  first	
  shi|	
  of	
  her	
  beam	
  
Cme	
  begins,	
  technicians	
  at	
  the	
  beam	
  line	
  load	
  the	
  three	
  casseVes	
  into	
  a	
  liquid	
  
nitrogen	
  dewar	
  within	
  reach	
  of	
  the	
  sample-­‐mounCng	
  robot	
  and	
  close	
  the	
  
radiaCon	
  door.	
  	
  From	
  this	
  point	
  Alice	
  is	
  able	
  to	
  control	
  beam	
  line	
  operaCons	
  
remotely.	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   50	
  
Source:	
  Tim	
  McPhillips	
  
Remote	
  beam	
  line	
  operaQon	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   51	
  
Source:	
  Tim	
  McPhillips	
  
Outline	
  
•  All	
  things	
  “Provenance”	
  …	
  	
  
•  Provenance:	
  Why	
  should	
  you	
  care?	
  
•  Provenance	
  in	
  Databases	
  
– Why-­‐,	
  How-­‐,	
  …,	
  Why-­‐Not	
  Provenance	
  
•  Provenance	
  in	
  ScienCfic	
  Workflows	
  
•  YesWorkflow:	
  Doing	
  more	
  with	
  Provenance!	
  
– …	
  someCmes	
  using	
  less	
  (e.g.,	
  no	
  provenance	
  recorder)	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   52	
  
GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
?	
  
YesWorkflow:	
  	
  
Yes,	
  scripts	
  are	
  workflows,	
  too!	
  
•  Script	
  vs	
  Workflows/ASAP:	
  
– Automation:	
  	
  *****	
  
– Scaling:	
  	
  	
  	
  	
  **	
  
– Abstraction:	
  *	
  	
  
– Provenance:	
  	
  **	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   53	
  
Enter:	
  YesWorkflow!	
  (yesworkflow.org)	
  
•  YesWorkflow	
  (YW)	
  
–  Grass-­‐roots	
  effort	
  	
  	
  
–  …	
  meeCng	
  the	
  scienCsts/users	
  where	
  they	
  R!	
  
•  R,	
  Matlab,	
  (i)Python,	
  Jupyter,	
  …	
  
–  Scripts	
  +	
  simple	
  user	
  annotaCons	
  
•  =>	
  Reveal	
  the	
  workflow	
  model/abstracQon	
  	
  
	
  …	
  that	
  underlies	
  the	
  (script)	
  implementaIon	
  
•  =>	
  YW	
  can	
  give	
  us	
  more	
  of	
  ASAP!	
  
–  First	
  YW:	
  	
  ASAP	
  (AbstracCon)...	
  
–  Then	
  YW-­‐recon:	
  ASAP	
  (reconstrucCng	
  runQme	
  Provenance)	
  
54	
  More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
  
Related	
  Work,	
  other	
  Approaches	
  
…	
  to	
  bring	
  workflow/provenance	
  benefits	
  to	
  
scripts:	
  
•  RunQme	
  Provenance	
  Recorders:	
  
– use	
  (R,	
  Python,	
  ..)	
  libraries	
  and/or	
  code	
  
instrumentaQon	
  to	
  capture	
  runQme	
  observables	
  
•  file	
  read/write,	
  funcCon	
  calls,	
  program	
  variables	
  &	
  state,	
  …	
  
– noWorkflow	
  system	
  	
  
•  [Murta-­‐Braganholo-­‐ChirigaC-­‐Koop-­‐Freire-­‐IPAW14]	
  	
  
•  exploit	
  Python	
  profiling	
  library	
  to	
  capture	
  runCme	
  
provenance	
  
=>	
  helps	
  with	
  "S"	
  and	
  "P"	
  	
  	
  
	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   55	
  
YW	
  (prospec1ve)	
  and	
  	
  
YW-­‐Recon	
  (retrospec1ve)	
  Provenance	
  
•  1.	
  YW:	
  Annotate	
  Script	
  =>	
  YW	
  Model	
  
–  Annotate	
  @BEGIN..@END,	
  @IN,	
  @OUT	
  
–  Visualize,	
  share,	
  be	
  happy	
  J	
  	
  
•  2.	
  Run	
  script	
  
–  Files	
  are	
  read	
  and	
  wriVen	
  
–  Folder-­‐	
  &	
  Filenames	
  have	
  metadata	
  
•  3.	
  YW-­‐Recon	
  
–  Use	
  @URI	
  tags	
  that	
  link	
  YW	
  Model	
  ó	
  Persisted	
  Data	
  
–  Run	
  URI-­‐template	
  queries	
  	
  
•  cf.	
  “ls	
  -­‐R”	
  &	
  RegEx	
  matching	
  
•  4.	
  YW-­‐Query	
  
–  Answer	
  the	
  user’s	
  provenance	
  queries	
  	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   56	
  
YW	
  annotaQons:	
  Model	
  your	
  Workflow!	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   57	
  
YesWorkflow:	
  ProspecQve	
  &	
  RetrospecCve	
  
Provenance	
  …	
  (almost)	
  for	
  free!	
  	
  
•  YW	
  annotaCons	
  in	
  
the	
  script	
  (R,	
  
Python,	
  Matlab)	
  
are	
  used	
  to	
  
recreate	
  the	
  
workflow	
  view	
  
from	
  the	
  script	
  …	
  	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   58	
  
cassette_id
sample_score_cutoff
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
YW!	
  
Voila!	
  The	
  Workflow	
  revealed!	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   59	
  
cassette_id
sample_score_cutoff
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv




 

 
 
Get	
  3	
  views	
  for	
  the	
  price	
  of	
  1!	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   60	
  

 

 





 
 
   


Process	
  view	
  
Data	
  view	
  
Combined	
  view	
  
GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
Paleoclimate	
  ReconstrucQon	
  (EnviRecon.org)	
  	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   61	
  
•  …	
  explained	
  using	
  YesWorkflow!	
  
Kyle	
  B.,	
  (computaConal)	
  archaeologist:	
  	
  
"It	
  took	
  me	
  about	
  20	
  minutes	
  to	
  comment.	
  Less	
  
than	
  an	
  hour	
  to	
  learn	
  and	
  YW-­‐annotate,	
  all-­‐told."	
  
Provenance
Lands
62	
  
Workflow	
  Modeling	
  &	
  Design	
  
(a.k.a.	
  prospec1ve	
  provenance	
  
“Workflow-­‐land”)	
  
RunQme	
  Provenance	
  	
  
(a.k.a.	
  traces,	
  logs,	
  	
  	
  
retrospec1ve	
  
provenance,	
  
“Trace-­‐land”)	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
  
run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
YW-­‐RECON:	
  ProspecCve	
  &	
  RetrospecQve	
  
Provenance	
  …	
  (almost)	
  for	
  free!	
  	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   63	
  
cassette_id
sample_score_cutoff
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
•  URI-­‐templates	
  link	
  conceptual	
  enCCes	
  
to	
  runQme	
  provenance	
  “le|	
  behind”	
  by	
  
the	
  script	
  author	
  …	
  	
  
•  …	
  facilitaCng	
  provenance	
  reconstrucQon	
  
YW	
  (prospec1ve)	
  and	
  	
  
YW-­‐Recon	
  (retrospec1ve)	
  Provenance	
  
•  1.	
  YW:	
  Annotate	
  Script	
  =>	
  YW	
  Model	
  
–  Annotate	
  @BEGIN..@END,	
  @IN,	
  @OUT	
  
–  Visualize,	
  share,	
  be	
  happy	
  J	
  	
  
•  2.	
  Run	
  script	
  
–  Files	
  are	
  read	
  and	
  wriVen	
  
–  Folder-­‐	
  &	
  Filenames	
  have	
  metadata	
  
•  3.	
  YW-­‐Recon	
  
–  Use	
  @URI	
  tags	
  that	
  link	
  YW	
  Model	
  ó	
  Persisted	
  Data	
  
–  Run	
  URI-­‐template	
  queries	
  	
  
•  cf.	
  “ls	
  -­‐R”	
  &	
  RegEx	
  matching	
  
•  4.	
  YW-­‐Query	
  
–  Answer	
  the	
  user’s	
  provenance	
  queries	
  	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   64	
  
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
cassette_id
sample_score_cutoff
Data	
  collecQon	
  workflow	
  (X-­‐ray	
  diffracCon)	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   65	
  
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
cassette_id
sample_score_cutoff
Data	
  collecCon	
  workflow:	
  runQme	
  data	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   66	
  
run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
1.  YW	
  annotaQons	
  =>	
  YW	
  model	
  
2.  Files	
  &	
  Folders	
  le`	
  by	
  a	
  run	
  =>	
  runQme	
  (meta-­‐)data	
  
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
cassette_id
sample_score_cutoff
Q1:	
  What	
  samples	
  did	
  the	
  script	
  run	
  collect	
  images	
  
from?	
  
run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   67	
  
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
cassette_id
sample_score_cutoff
Q2:	
  What	
  energies	
  were	
  used	
  for	
  image	
  collecCon	
  from	
  
sample	
  DRT322?	
  
run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   68	
  
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
cassette_id
sample_score_cutoff
Q3:	
  Where	
  is	
  the	
  raw	
  image	
  of	
  the	
  corrected	
  image	
  
DRT322_11000ev_030.img?	
  	
  run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   69	
  
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
cassette_id
sample_score_cutoff
run/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
│  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
  
Q5:	
  What	
  casseqe-­‐id	
  had	
  the	
  sample	
  leading	
  to	
  
DRT240_10000ev_001.img?	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   70	
  
Querying	
  
Provenance	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   71	
  
Taking	
  YW	
  for	
  a	
  spin	
  …	
  	
  
•  “To	
  document	
  on-­‐the	
  fly,	
  specifically	
  for	
  a	
  given	
  
workflow	
  configuraIon	
  invoked:	
  	
  
–  do	
  not	
  insert	
  annotaIons	
  into	
  code,	
  
–  but	
  rather	
  have	
  code	
  print	
  annota1ons	
  into	
  a	
  special	
  log	
  
during	
  execuIon,	
  
–  then	
  parse	
  that	
  log!”	
   	
   	
  –	
  Liudmila	
  Mainzer	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   72	
  
Source:	
  L	
  Mainzer,	
  V	
  Jongeneel	
  (IGB	
  &	
  NCSA)	
  	
  
Conclusions	
  
•  Provenance	
  
–  …	
  in	
  databases	
  
–  …	
  in	
  scienCfic	
  workflows	
  
•  Scripts	
  are	
  (o|en)	
  workflows	
  too!	
  
•  è	
  Need	
  to	
  support	
  provenance	
  management	
  for	
  
scripts	
  and	
  scienCfic	
  workflows!	
  
•  One	
  size	
  might	
  not	
  fit	
  all	
  …	
  
–  Use	
  prospecCve,	
  retrospecCve	
  (recorded,	
  reconstructed	
  
provenance)	
  
•  Facilitate	
  “insider”	
  (or	
  “deep”)	
  provenance	
  
–  …	
  the	
  stuff	
  scienCsts	
  need	
  to	
  get	
  their	
  job	
  done!	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   73	
  
Deep	
  Provenance	
  to	
  get	
  the	
  science	
  done!	
  
•  When	
  reconstrucCng	
  the	
  
past	
  climate,	
  need	
  to	
  
know	
  which	
  tree-­‐ring	
  
source	
  was	
  used!	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   74	
  
CRTZ
MVNP
ESPN
LANL
Arizona
Colorado
New Mexico
Utah
Douglas fir
Pinyon and juniper
Spruce, pine, and true fir
GHCN stations
K.	
  Bocinsky,	
  T.	
  Kohler,	
  A	
  2000-­‐year	
  reconstrucCon	
  of	
  the	
  rain-­‐fed	
  maize	
  agricultural	
  
niche	
  in	
  the	
  US	
  Southwest.	
  Nature	
  Communica1ons.	
  doi:10.1038/ncomms6618	
  	
  
Conclusions	
  (Cont’d)	
  
•  YesWorkflow:	
  Go	
  where	
  the	
  users	
  are!	
  
–  …	
  they	
  already	
  capture	
  provenance	
  through	
  metadata!	
  
•  Beware	
  your	
  level	
  of	
  provenance	
  abstracQon	
  
–  Let	
  the	
  user	
  provide	
  a	
  workflow	
  model	
  easily!	
  	
  
•  YW-­‐Recon:	
  
–  …	
  finishing	
  support	
  for	
  retrospecQve	
  provenance	
  without	
  using	
  a	
  
runCme	
  provenance	
  recorder!	
  
–  Key	
  insight:	
  scienCsts	
  already	
  leave	
  provenance	
  “bread	
  crumbs”	
  
behind!	
  (it’s	
  not	
  an	
  accident!)	
  
•  Future	
  Work:	
  
–  Build	
  systems	
  that	
  work	
  with	
  the	
  exisCng	
  workflow	
  of	
  scienCsts!	
  
–  There	
  are	
  many	
  research	
  quesCons	
  &	
  opportuniCes	
  out	
  there!	
  
•  e.g.:	
  Why-­‐Not	
  provenance	
  for	
  scienCfic	
  workflows	
  anyone?	
  	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   75	
  
References	
  	
  …	
  	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   76	
  
References	
  (cont’d)	
  
More	
  Provenance	
  Mileage	
  from	
  Workflows	
  and	
  Scripts	
   77	
  

More Related Content

Viewers also liked

Provenance in Databases and Scientific Workflows: Part II (Databases)
Provenance in Databases and Scientific Workflows: Part II (Databases)Provenance in Databases and Scientific Workflows: Part II (Databases)
Provenance in Databases and Scientific Workflows: Part II (Databases)Bertram Ludäscher
 
DAIS Seminar: The Many Faces of Provenance in Databases and Workflows
DAIS Seminar: The Many Faces of Provenance in Databases and WorkflowsDAIS Seminar: The Many Faces of Provenance in Databases and Workflows
DAIS Seminar: The Many Faces of Provenance in Databases and WorkflowsBertram Ludäscher
 
From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceFrom Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceBertram Ludäscher
 
A Sightseeing Tour of Provenance in Databases & Workflows
A Sightseeing Tour of Provenance in Databases & WorkflowsA Sightseeing Tour of Provenance in Databases & Workflows
A Sightseeing Tour of Provenance in Databases & WorkflowsBertram Ludäscher
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineBertram Ludäscher
 
Querying Provenance Information: Basic Notions and an Example from Paleoclima...
Querying Provenance Information: Basic Notions and an Example from Paleoclima...Querying Provenance Information: Basic Notions and an Example from Paleoclima...
Querying Provenance Information: Basic Notions and an Example from Paleoclima...Bertram Ludäscher
 

Viewers also liked (6)

Provenance in Databases and Scientific Workflows: Part II (Databases)
Provenance in Databases and Scientific Workflows: Part II (Databases)Provenance in Databases and Scientific Workflows: Part II (Databases)
Provenance in Databases and Scientific Workflows: Part II (Databases)
 
DAIS Seminar: The Many Faces of Provenance in Databases and Workflows
DAIS Seminar: The Many Faces of Provenance in Databases and WorkflowsDAIS Seminar: The Many Faces of Provenance in Databases and Workflows
DAIS Seminar: The Many Faces of Provenance in Databases and Workflows
 
From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceFrom Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & Provenance
 
A Sightseeing Tour of Provenance in Databases & Workflows
A Sightseeing Tour of Provenance in Databases & WorkflowsA Sightseeing Tour of Provenance in Databases & Workflows
A Sightseeing Tour of Provenance in Databases & Workflows
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefine
 
Querying Provenance Information: Basic Notions and an Example from Paleoclima...
Querying Provenance Information: Basic Notions and an Example from Paleoclima...Querying Provenance Information: Basic Notions and an Example from Paleoclima...
Querying Provenance Information: Basic Notions and an Example from Paleoclima...
 

Similar to Works 2015-provenance-mileage

Why-Not Provenance through Game Semantics
Why-Not Provenance through Game SemanticsWhy-Not Provenance through Game Semantics
Why-Not Provenance through Game SemanticsBertram Ludäscher
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesBertram Ludäscher
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesBertram Ludäscher
 
Provenance and DataONE: Facilitating Reproducible Science
Provenance and DataONE: Facilitating Reproducible ScienceProvenance and DataONE: Facilitating Reproducible Science
Provenance and DataONE: Facilitating Reproducible ScienceBertram Ludäscher
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...Paolo Missier
 
A Brief Provenance Tour … via DataONE
A Brief Provenance Tour  … via DataONEA Brief Provenance Tour  … via DataONE
A Brief Provenance Tour … via DataONEBertram Ludäscher
 
Interpretation, Context, and Metadata: Examples from Open Context
Interpretation, Context, and Metadata: Examples from Open ContextInterpretation, Context, and Metadata: Examples from Open Context
Interpretation, Context, and Metadata: Examples from Open ContextEric Kansa
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: SummarizationMarina Santini
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"Pinar Alper
 
From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceFrom Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceBertram Ludäscher
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research ObjectsDavid De Roure
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho
 
Kurator: Towards Data Curation for Mere Mortals
Kurator: Towards Data Curation for Mere MortalsKurator: Towards Data Curation for Mere Mortals
Kurator: Towards Data Curation for Mere MortalsBertram Ludäscher
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingRayhan Ferdous
 
GSLIS Research Showcase Presentation (Expanded)
GSLIS Research Showcase Presentation (Expanded)GSLIS Research Showcase Presentation (Expanded)
GSLIS Research Showcase Presentation (Expanded)Bertram Ludäscher
 
DMTM 2015 - 04 Data Exploration
DMTM 2015 - 04 Data ExplorationDMTM 2015 - 04 Data Exploration
DMTM 2015 - 04 Data ExplorationPier Luca Lanzi
 
Digital Medieval Data Curation
Digital Medieval Data CurationDigital Medieval Data Curation
Digital Medieval Data Curationblalbritton
 
The Power of Probabilistic Thinking (keynote talk at ASE 2016)
The Power of Probabilistic Thinking (keynote talk at ASE 2016)The Power of Probabilistic Thinking (keynote talk at ASE 2016)
The Power of Probabilistic Thinking (keynote talk at ASE 2016)David Rosenblum
 
Exploring Word2Vec in Scala
Exploring Word2Vec in ScalaExploring Word2Vec in Scala
Exploring Word2Vec in ScalaGary Sieling
 

Similar to Works 2015-provenance-mileage (20)

Why-Not Provenance through Game Semantics
Why-Not Provenance through Game SemanticsWhy-Not Provenance through Game Semantics
Why-Not Provenance through Game Semantics
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science Tales
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
 
Provenance and DataONE: Facilitating Reproducible Science
Provenance and DataONE: Facilitating Reproducible ScienceProvenance and DataONE: Facilitating Reproducible Science
Provenance and DataONE: Facilitating Reproducible Science
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...
 
A Brief Provenance Tour … via DataONE
A Brief Provenance Tour  … via DataONEA Brief Provenance Tour  … via DataONE
A Brief Provenance Tour … via DataONE
 
Interpretation, Context, and Metadata: Examples from Open Context
Interpretation, Context, and Metadata: Examples from Open ContextInterpretation, Context, and Metadata: Examples from Open Context
Interpretation, Context, and Metadata: Examples from Open Context
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
 
From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceFrom Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & Provenance
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
Kurator: Towards Data Curation for Mere Mortals
Kurator: Towards Data Curation for Mere MortalsKurator: Towards Data Curation for Mere Mortals
Kurator: Towards Data Curation for Mere Mortals
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to Reporting
 
GSLIS Research Showcase Presentation (Expanded)
GSLIS Research Showcase Presentation (Expanded)GSLIS Research Showcase Presentation (Expanded)
GSLIS Research Showcase Presentation (Expanded)
 
DMTM 2015 - 04 Data Exploration
DMTM 2015 - 04 Data ExplorationDMTM 2015 - 04 Data Exploration
DMTM 2015 - 04 Data Exploration
 
Digital Medieval Data Curation
Digital Medieval Data CurationDigital Medieval Data Curation
Digital Medieval Data Curation
 
The Power of Probabilistic Thinking (keynote talk at ASE 2016)
The Power of Probabilistic Thinking (keynote talk at ASE 2016)The Power of Probabilistic Thinking (keynote talk at ASE 2016)
The Power of Probabilistic Thinking (keynote talk at ASE 2016)
 
Exploring Word2Vec in Scala
Exploring Word2Vec in ScalaExploring Word2Vec in Scala
Exploring Word2Vec in Scala
 

More from Bertram Ludäscher

Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Bertram Ludäscher
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database RulesBertram Ludäscher
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database RulesBertram Ludäscher
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsBertram Ludäscher
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Bertram Ludäscher
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueBertram Ludäscher
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsBertram Ludäscher
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseBertram Ludäscher
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...Bertram Ludäscher
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...Bertram Ludäscher
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...Bertram Ludäscher
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsBertram Ludäscher
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsBertram Ludäscher
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachBertram Ludäscher
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchBertram Ludäscher
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatBertram Ludäscher
 
From Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceFrom Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceBertram Ludäscher
 
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligionWild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligionBertram Ludäscher
 
Using YesWorkflow hybrid queries to reveal data lineage from data curation ac...
Using YesWorkflow hybrid queries to reveal data lineage from data curation ac...Using YesWorkflow hybrid queries to reveal data lineage from data curation ac...
Using YesWorkflow hybrid queries to reveal data lineage from data curation ac...Bertram Ludäscher
 

More from Bertram Ludäscher (20)

Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query Patterns
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A Dialogue
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflows
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of Research
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's Seat
 
From Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceFrom Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable Provenance
 
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligionWild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
 
Using YesWorkflow hybrid queries to reveal data lineage from data curation ac...
Using YesWorkflow hybrid queries to reveal data lineage from data curation ac...Using YesWorkflow hybrid queries to reveal data lineage from data curation ac...
Using YesWorkflow hybrid queries to reveal data lineage from data curation ac...
 

Recently uploaded

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 

Works 2015-provenance-mileage

  • 1. YesWorkflow: More Provenance Mileage from Scientific Workflows and Scripts! Bertram  Ludäscher   Director, Center for Informatics Research in Science and Scholarship (CIRSS) Professor, Graduate School of Library and Information Science (GSLIS) Faculty affiliate, NCSA & Department of Computer Science
  • 2. Outline   •  All  things  “Provenance”  …     •  Provenance:  Why  should  you  care?   •  Provenance  in  Databases   – Why-­‐,  How-­‐,  …,  Why-­‐Not  Provenance   •  …  vs  Provenance  in  ScienCfic  Workflows   •  YesWorkflow:  Doing  more  (someCmes  with  less)   More  Provenance  Mileage  from  Workflows  and  Scripts   2  
  • 3. Provenance  Palooza     •  Provenance     –  …  or  provenience?   •  Chain  of  custody   •  Lineage   •  Pedigree   •  Genealogy     •  Phylogeny     •  History   •  Origin   More  Provenance  Mileage  from  Workflows  and  Scripts   3
  • 4. Provenance  Research  everywhere  …     …  and  here:   More  Provenance  Mileage  from  Workflows  and  Scripts   4  
  • 5. Provenance as we all know it •  Oxford English Dictionary: –  coming from some particular source or quarter; origin, derivation –  the history or pedigree of a work of art, manuscript, rare book, etc. –  concretely, a record of the passage of an item through its various owners (“chain of custody”) •  Merriam-Webster: –  prov·e·nance noun ˈpräv-nəәn(t)s, ˈprä-vəә-ˌnän(t)s –  the origin or source of something •  Origin: –  French, from provenir to come forth, originate, from Latin provenire, from pro- forth + venire to come More  Provenance  Mileage  from  Workflows  and  Scripts   5  
  • 6. Provenance 6 More  Provenance  Mileage  from  Workflows  and  Scripts   Is  this  a  real  Leonardo?  Lack  of  reliable   Provenance  casts  a  doubt  on  this  …    
  • 7. Pedigree 7 More  Provenance  Mileage  from  Workflows  and  Scripts  
  • 8. More  Provenance  Mileage  from  Workflows  and  Scripts   8   Natural  History:     Understanding  what  happened…   Zrzavý,  Jan,  David  Storch,  and  Stanislav  Mihulka.   EvoluIon:  Ein  Lese-­‐Lehrbuch.  Springer-­‐Verlag,  2009.   Author:  Jkwchui  (Based  on   drawing  by  Truth-­‐seeker2004)  
  • 9. Provenience  vs  Provenance   More  Provenance  Mileage  from  Workflows  and  Scripts   9
  • 10. More  Provenance  Mileage  from  Workflows  and  Scripts   10 Society  of  American  Archivists    hVp://www2.archivists.org/glossary/ terms/p/provenance     •  Principle  of   provenance   (respect  des   fonds)   •  Keep   records  of   different   origins   separate  to   preserve   context     Archivists  
  • 11. So  what  is  “provenance”  (sensu  W3C)  ?   •  Provenance  refers  to  the  sources  of  informaIon,  including  en11es   and  processes,  involving  in  producing  or  delivering  an  ar1fact  (*)   •  Provenance  is  a  descripIon  of  how  things  came  to  be,  and  how   they  came  to  be  in  the  state  they  are  in  today    (*)   •  Provenance  is  a  record  that  describes  the  people,  ins1tu1ons,  en11es,   and  ac1vi1es,  involved  in  producing,  influencing,  or  delivering  a  piece   of  data  or  a  thing  in  the  world   More  Provenance  Mileage  from  Workflows  and  Scripts   11  
  • 12. Outline   •  All  things  “Provenance”  …     •  Provenance:  Why  should  you  care?   •  Provenance  in  Databases   – Why-­‐,  How-­‐,  …,  Why-­‐Not  Provenance   •  Provenance  in  ScienCfic  Workflows   •  YesWorkflow:  Doing  more  (someCmes  with  less)   More  Provenance  Mileage  from  Workflows  and  Scripts   12  
  • 13. Provenance  =>  Transparency   •  =  “Externally-­‐facing”   provenance     – “Them-­‐Provenance”   •  Later:  “Internally-­‐ facing”  provenance   – “Me-­‐Provenance”   More  Provenance  Mileage  from  Workflows  and  Scripts   13  
  • 14. Climate  Change:  Whodunnit?   More  Provenance  Mileage  from  Workflows  and  Scripts   14  
  • 15. Tracing  the  sources  (data,  code)     More  Provenance  Mileage  from  Workflows  and  Scripts   15  
  • 16. From “Climate Gate” to Reproducible Science 16 More  Provenance  Mileage  from  Workflows  and  Scripts  
  • 17. Data & Provenance Management: Single Model 17 More  Provenance  Mileage  from  Workflows  and  Scripts  
  • 18. Data & Provenance Management: Model Chains 18 More  Provenance  Mileage  from  Workflows  and  Scripts  
  • 19. Some things people do with “provenance” •  Result  validaCon       •  Result  debugging  (science  vs  wf  logic)   •  Reproducibility  and  Repeatability       •  ExplanaCon  (derivaCons,  traces,  proof  trees)   •  RunCme  monitoring   –  Profiling,  benchmarking   •  Performance  OpCmizaCon  (“smart  re-­‐run”)   •  Fault-­‐tolerance,  crash-­‐recovery   •  Database  view  maintenance  (e.g.  data  warehousing)   •  …     19 More  Provenance  Mileage  from  Workflows  and  Scripts  
  • 20. Provenance for Virtual Joint Experiments •  How do we ensure that Charlie gets a complete account of the history of Wc s outputs? •  How do we ensure that Alice gets her due (partial) credit when Charlie uses Bob s data v? è traces TA and TB will be critical è need to compose them to obtain TC We  can  view  the  composiCon  WC  as  a  new,  virtual  workflow   Charlie Alice (1) develop! WA (2) run! RA zx Bob (3) develop!WB (5) run!RB vuf v WC:= (6) inspect provenance! (7) understand, generate! WA WS WB uzx (4) data sharing! TA! TB!f -1 More  Provenance  Mileage  from  Workflows  and  Scripts   20  
  • 21. Open  Provenance  Model  =>  W3C  Prov   More  Provenance  Mileage  from  Workflows  and  Scripts   21  
  • 22. W3C  Prov:  One  size  fits  all?   More  Provenance  Mileage  from  Workflows  and  Scripts   22  
  • 23. Outline   •  All  things  “Provenance”  …     •  Provenance:  Why  should  you  care?   •  Provenance  in  Databases   – Why-­‐,  How-­‐,  …,  Why-­‐Not  Provenance   •  Provenance  in  ScienCfic  Workflows   •  YesWorkflow:  Doing  more  (someCmes  with  less)   More  Provenance  Mileage  from  Workflows  and  Scripts   23  
  • 24. Types of Data Provenance •  Black-box –  know (next to) nothing at compile-time –  at runtime: keep some data lineage –  most prov sensu WF work use this •  White-box –  statically (compile-time) analyzable –  q(Y1,Y2) :- p(X1,X2), r(X1,Y1), s(X2,Y2) –  Most prov sensu DB work use this •  Grey-box –  can “look inside” (some black boxes) –  … e.g. b/c they have subworkflows –  … or FP signatures: A :: t1, t2à t3,t4 –  … or semantic annotations (sem.types) f A q t1 t2 t3 t4 X1 X2 Y1 Y2 More  Provenance  Mileage  from  Workflows  and  Scripts   24  
  • 25. Provenance  in  Databases   More  Provenance  Mileage  from  Workflows  and  Scripts   25   Source:  Val  Tannen  
  • 26. Provenance  in  Databases   More  Provenance  Mileage  from  Workflows  and  Scripts   26   Source:  Val  Tannen  
  • 27. Provenance  in  Databases   More  Provenance  Mileage  from  Workflows  and  Scripts   27   Source:  Val  Tannen  
  • 28. AbstracQng  the  structure  of  querying   More  Provenance  Mileage  from  Workflows  and  Scripts   28   Source:  Val  Tannen   In  database  provenance,  tuples  are  either  combined   conjunc1vely  (*)  or  disjunc1vely  (+)  è  That’s  the  core  model!  
  • 29. Provenance  Polynomials   One  Semiring  to  Rule  them  all!   (DB  theory  strikes!)   More  Provenance  Mileage  from  Workflows   and  Scripts   29   Green,  Karvounarakis,  Tannen.  Provenance  semirings,  PODS,  2007   Unifying  most  prior   work  in  a  simple  model!  
  • 30. Example:  Go  from  X  to  Y  in  3  hops!   (e.g.,  a  =  CS      b  =  NCSA      c  =  GSLIS)   •  Database:          hop(X,Y)  :=         •  Query:    3hop(X,Y)  :-­‐              hop(X,  Z1),  hop(Z1,  Z2),  hop(Z2,Y).   More  Provenance  Mileage  from  Workflows  and  Scripts   30   a p b q r c s Note:  Can  not  go  from  c  to  a  in  3hops!     a ppp+pqr+qrp b ppq+qrq cpqs ppr+qrr rpq rqs hop(a,a,  p).   hop(a,b,  q).   hop(b,a,  r)   hop(b,c,  s).   3hop(a,a,  p3+2pqr).   3hop(a,b,  p2q+q2r).   …     3hop(a,c,  pqs).  
  • 31. Provenance  Polynomials       More  Provenance  Mileage  from  Workflows   and  Scripts   31   ,,Mein  Schatz!”        p3  +  2pqr                          p3  +    pqr                        p  +  2pqr                          p  +    pqr                          pqr                          p  +    pqr                     p   a ppp+pqr+qrp b ppq+qrq cpqs ppr+qrr rpq rqs
  • 32. 32 More  Provenance  Mileage  from  Workflows  and  Scripts   Provenance in Databases
  • 33. NegaQon  &  Why-­‐Not  Provenance   More  Provenance  Mileage  from  Workflows  and  Scripts   33   •  Provenance  Semirings  work  well  for:   – PosiQve  Queries  (e.g.,  RA+  )   •  Challenges:  Handling  of     – set  difference  (~  negaQon)   – Why-­‐not  provenance       – Missing  Answer  provenance     •  A  fresh  look  at  provenance!   •  …  using  an  old  idea:  Game  semanQcs!      
  • 34. Provenance  (or  Query  EvaluaIon)  Games   More  Provenance  Mileage  from  Workflows  and  Scripts   34    “SLD-­‐resoluQon  game”      A(X)  :–  B(X,Y,Z)    …  not  C(X,Y)  …     Eureka! [KLZ13]  Köhler,  S.,  Ludäscher,  B.,  &  Zinn,  D.  (2013).  First-­‐order  provenance  games.   In  Search  of  Elegance  in  the  Theory  and  PracIce  of  ComputaIon.  Springer  
  • 35. TranslaQon:  Q(I) => G Q(I) More  Provenance  Mileage  from  Workflows  and  Scripts   35   A(X) C(X) B(X, Y ) r2(X, Y ) g1 2(X, Y ) g2 2(Y ) rB(X, Y ) rC (X) ¬A(X) ¬B(X, Y ) ¬C(X) B(X, Y ) C(X) X:=Y 9Y (a) Game template for QABC : A(X) : B(X, Y ), ¬C(Y ). ¬C(a) ¬C(b) ¬B(a, a) ¬B(a, b) rB(b, a) r2(b, a)¬A(b) ¬A(a) g1 2(a, a) B(a, b) B(a, a) C(a) g2 2(a) g2 2(b) C(b) ¬B(b, a) ¬B(b, b) rC (a) A(b) A(a) r2(a, b) r2(a, a) g1 2(a, b) rB(a, b) r2(b, b) g1 2(b, b) g1 2(b, a) B(b, b) B(b, a) 9a 9b 9b 9a (b) Instantiated QABC game on I = {B(a, b), B(b, a), C(a)}. Source  [KLZ13]  
  • 36. Solve  G Q(I)  =>  Provenance!     More  Provenance  Mileage  from  Workflows  and  Scripts   36   ¬B(a, b)¬A(a) B(a, b) r2(a, b) g1 2(a, b) rB(a, b) (b) Instantiated QABC game on I = {B(a, b), B(b, a), C(a)}. ¬C(a) ¬C(b) ¬B(a, a) ¬B(a, b) rB(b, a) r2(b, a)¬A(b) ¬A(a) rB(a, b)B(a, b) B(a, a) C(a) g2 2(a) g2 2(b) C(b) ¬B(b, a) ¬B(b, b) rC (a) A(b) A(a) r2(a, b) r2(a, a) g1 2(a, b) g1 2(a, a) r2(b, b) g1 2(b, b) g1 2(b, a) B(b, b) B(b, a) 9a 9b 9b 9a (c) Solved game: lost positions are (dark) red; won positions are (light) green. Provenance edges (= good moves) are solid. Bad moves are dashed and not part of the provenance. A(a) is true (A(b) is false) as it is won (lost) in the solved game; the game provenance explains why (why-not). Figure 3: Provenance game for Q . The well-founded model of Source  [KLZ13]  
  • 37. Provenance  ~  Query  EvaluaQon  Game     More  Provenance  Mileage  from  Workflows  and  Scripts   37   a p b q r c s (a) input I ... hop a a p a b q b a r b c s (b) ... annotated. 3hop a a p3 + 2pqr a b p2 q + q2 r a c pqs b a p2 r + qr2 b b pqr b c qrs (c) 3hop with provenance. r1(a, a, b, a) g2 1(a, a) ¬hop(b, a) g1 1(a, a) hop(b, a) g2 1(a, b) g3 1(b, a) rhop(b, a) r1(a, a, a, a) r1(a, a, a, b) 3hop(a, a) g3 1(a, a) rhop(a, a) hop(a, b) ¬hop(a, a) g1 1(a, b) rhop(a, b) g2 1(b, a) ¬hop(a, b) hop(a, a) 9 a,a 9 b,a 9 a,b (d) The game provenance of 3hop(a, a) ... ⇥ + ⇥ + + + + r ⇥ ⇥ + + p + ⇥ + q + ⇥ + (e) ... is p3 + 2pqr. Provenance  Game  on  GQ(I)       =    Provenance  Polynomials     …  for  posiQve  queries!   Source  [KLZ13]  
  • 38. Provenance  ~  Query  EvaluaQon  Game     More  Provenance  Mileage  from  Workflows  and  Scripts   38   …  but  also  works  for  Why-­‐Not  provenance  &  non-­‐monotonic   queries  (i.e.,  Q  can  have  negaQon)  !!     Here:  not  3hop(c,a)  –  can’t  go  back  from            GSLIS    to        CS                                    c                        a   g2 1(c, a) ¬3hop(c, a) g2 1(c, c)g1 1(c, c) r1(c, a, c, b) ¬hop(c, b) hop(c, a) g2 1(b, b) ¬hop(a, c) hop(c, c) g1 1(c, a) r1(c, a, b, c)r1(c, a, a, b) 3hop(c, a) hop(b, b) g2 1(c, b)g2 1(a, c) r1(c, a, a, c) ¬hop(c, c) hop(c, b) ¬hop(c, a) g1 1(c, b) r1(c, a, b, b) ¬hop(b, b) g3 1(c, a) r1(c, a, a, a) r1(c, a, b, a) hop(a, c) r1(c, a, c, a) r1(c, a, c, c) 9 a,b 9 a,c 9 c,a 9 c,c9 b,c 9 b,b9 b,a9 a,a 9 c,b Figure 2: Why-not provenance for 3hop(c, a) using provenance games. gi 1 in the body of r1, thus claiming that gi 1 is false and hence that the r1 instance doesn’t derive t. The first player can counter and demonstrate that gi 1 is true by selecting a rule instance or fact as evidence for gi 1. The game proceeds in rounds until some player cannot move and thus loses (the opponent wins). In [KLZ13] it Source  [KLZ13]  
  • 39. Database  Provenance:  Summary   •  Fine-­‐grained  “white-­‐box”  provenance   •  Solved  (preVy  much)  for  posiQve  queries   •  …  not  so  much  for  negaQon  and  “Why-­‐Not”   – AcCve  area  of  research!   •  Some  research  prototypes  …     •  …  and  some  real-­‐world  implementaCons!   •  Note:  Those  in  need  of  provenance  o`en   already  “do  it”!!   – Crash  recovery,  audiCng,  concurrency  control,  …     More  Provenance  Mileage  from  Workflows  and  Scripts   39  
  • 40. Outline   •  All  things  “Provenance”  …     •  Provenance:  Why  should  you  care?   •  Provenance  in  Databases   – Why-­‐,  How-­‐,  …,  Why-­‐Not  Provenance   •  Provenance  in  ScienQfic  Workflows   •  YesWorkflow:  Doing  more  (someCmes  with  less)   More  Provenance  Mileage  from  Workflows  and  Scripts   40  
  • 41. Scientific Workflows: ASAP! •  Automation –  wfs to automate computational aspects of science •  Scaling (exploit and optimize machine cycles) –  wfs should make use of parallel compute resources –  wfs should be able handle large data •  Abstraction, Evolution, Reuse (human cycles) –  wfs should be easy to (re-)use, evolve, share •  Provenance –  wfs should capture processing history, data lineage è traceable data- and wf-evolution è  Reproducible Science Trident   Workbench   VisTrails   More  Provenance  Mileage  from  Workflows  and  Scripts   41   Es  war  einmal  …      
  • 42. Phylogenetics workflow in Kepler (2005) Graphical interface §  Canvas for assembling and displaying the workflow. §  Library of workflow blocks (‘actors’) that can be dragged onto the canvas and connected. §  Arrows that represent control dependencies or paths of data flow. §  A run button. These features are not essential to managing actual scientific workflows. What  some  of  us  think  of  when  we  hear  the   term  ‘scienQfic  workflows’   Source:  Tim  McPhillips   More  Provenance  Mileage  from  Workflows  and  Scripts   42  
  • 43. 10  Key  FuncQons  of  a  Sci-­‐WFS   1.  Automate programs and services scientists already use. 2.  Schedule invocations of programs and services correctly and efficiently – in parallel where possible. 3.  Manage dataflow to, from, and between programs and services. 4.  Enable scientists (not just developers) to author or modify workflows easily. 5.  Predict what a workflow will do when executed: prospective provenance. 6.  Record what actually happens during workflow execution. 7.  Reveal retrospective provenance – how workflow products were derived from inputs via programs and services. 8.  Organize intermediate and final data products as desired by users. 9.  Enable scientists to version, share and publish their workflows. 10.  Empower scientists who wish to automate additional programs and services themselves. These functions–not actors—distinguish scientific workflow automation from general scientific software development. More  Provenance  Mileage  from  Workflows  and  Scripts   43   Tim  McPhillips  et  al.  
  • 44. Yes, scripts are (can be) workflows too! Interactive Visualization More  Provenance  Mileage  from  Workflows  and  Scripts   44  
  • 45. SKOPE:  Synthesized  Knowledge  Of  Past  Environments   More  Provenance  Mileage  from  Workflows  and  Scripts   45   Bocinsky,  Kohler  et  al.  study  rain-­‐fed  maize  of  Anasazi     –  Four  Corners;  AD  600–1500.  Climate  change  influenced  Mesa  Verde  MigraQons;  late   13th  century  AD.  Uses  network  of  tree-­‐ring  chronologies  to  reconstruct  a  spaQo-­‐ temporal  climate  field  at  a  fairly  high  resoluCon  (~800  m)  from  AD  1–2000.  Algorithm   esCmates  joint  informaCon  in  tree-­‐rings  and  a  climate  signal  to  idenCfy  “best”    tree-­‐ring   chronologies  for  climate  reconstrucCng.   K.  Bocinsky,  T.  Kohler,  A  2000-­‐year  reconstrucCon  of  the  rain-­‐fed   maize  agricultural  niche  in  the  US  Southwest.  Nature   Communica1ons.  doi:10.1038/ncomms6618     … implemented as an R Script …
  • 46. …  HPCBio  Workflows  @  Illinois   More  Provenance  Mileage  from  Workflows  and  Scripts   46    NaIonal  Petascale   CompuIng  Facility   Broad  InsQtute:    Recommended  workflow  for  variant  analysis   Liudmila  Mainzer,   Victor  Jongeneel   HPC  Bio  @  Illinois   Quickly,  say:    #!/bin/bash  
  • 47. It’s  Qme  to  shi`  control  …   More  Provenance  Mileage  from  Workflows  and  Scripts   47   •  …  back  from  being  consumers  of  someone   else’s  (=  our)  tools  ..     –  “Just  click  here!”   •  ...  to  tool  makers!   –  ScienCsts  who  author  workflows  as  scripts!   •  Go  where  the  wild  things  (users!)  are  …       –  Yes,  develop  for  “end  users”  …       –  …  but  don’t  forget  the  tool  makers!   •  Can  we  do  this  together?    
  • 48. Mount   Sample   Screen Sample Align   Sample   Expose   Sample   Analyze   Images   Check   Criteri a   Calculat e   Strategy   Collect   Data  Set   Calculat e  Maps   List   Peaks   Run   Search   Refine   Structur e   Integrat e   Images   Scale   ReflecQon s   Merge   ReflecQons   Calc   Amplitude s   Collect Data Process   Data Solve   Structure Analyze   Density Blu-Ice LABELIT molrep   refmac   z   ipmosflm   xds pointless   scala   xtriage   truncate   rfree   Example:  AutoDrug  Workflow   More  Provenance  Mileage  from  Workflows  and  Scripts   48   Tsai,  Y.,  McPhillips,  S.  E.,  González,  A.,  McPhillips,  T.  M.,   Zinn,  D.,  Cohen,  A.  E.,  ...  &  SolCs,  S.  (2013).  AutoDrug:  fully   automated  macromolecular  crystallography  workflows  for   fragment-­‐based  drug  discovery.  Acta  Crystallographica   SecCon  D:  Biological  Crystallography,  69(5),  796-­‐803.  
  • 49. Diffraction images Experimental electron density and protein model Full protein structure 3D  Protein  Structure   DeterminaQon  by  X-­‐ray   Crystallography     More  Provenance  Mileage  from  Workflows  and  Scripts   49   Source:  Tim  McPhillips  
  • 50. Crystal   in   loop   Sample mounting robot Cassette shipping dewar Crystal mounting pin Sample cassette Automated  Sample  Handling   Alice,  the  high-­‐throughput  crystallographer:  When  the  first  shi|  of  her  beam   Cme  begins,  technicians  at  the  beam  line  load  the  three  casseVes  into  a  liquid   nitrogen  dewar  within  reach  of  the  sample-­‐mounCng  robot  and  close  the   radiaCon  door.    From  this  point  Alice  is  able  to  control  beam  line  operaCons   remotely.   More  Provenance  Mileage  from  Workflows  and  Scripts   50   Source:  Tim  McPhillips  
  • 51. Remote  beam  line  operaQon   More  Provenance  Mileage  from  Workflows  and  Scripts   51   Source:  Tim  McPhillips  
  • 52. Outline   •  All  things  “Provenance”  …     •  Provenance:  Why  should  you  care?   •  Provenance  in  Databases   – Why-­‐,  How-­‐,  …,  Why-­‐Not  Provenance   •  Provenance  in  ScienCfic  Workflows   •  YesWorkflow:  Doing  more  with  Provenance!   – …  someCmes  using  less  (e.g.,  no  provenance  recorder)   More  Provenance  Mileage  from  Workflows  and  Scripts   52  
  • 53. GetModernClimate PRISM_annual_growing_season_precipitation SubsetAllData dendro_series_for_calibration dendro_series_for_reconstruction CAR_Analysis_unique cellwise_unique_selected_linear_models CAR_Analysis_union cellwise_union_selected_linear_models CAR_Reconstruction_union raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors CAR_Reconstruction_union_output ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif master_data_directory prism_directory tree_ring_datacalibration_years retrodiction_years ?   YesWorkflow:     Yes,  scripts  are  workflows,  too!   •  Script  vs  Workflows/ASAP:   – Automation:    *****   – Scaling:          **   – Abstraction:  *     – Provenance:    **   More  Provenance  Mileage  from  Workflows  and  Scripts   53  
  • 54. Enter:  YesWorkflow!  (yesworkflow.org)   •  YesWorkflow  (YW)   –  Grass-­‐roots  effort       –  …  meeCng  the  scienCsts/users  where  they  R!   •  R,  Matlab,  (i)Python,  Jupyter,  …   –  Scripts  +  simple  user  annotaCons   •  =>  Reveal  the  workflow  model/abstracQon      …  that  underlies  the  (script)  implementaIon   •  =>  YW  can  give  us  more  of  ASAP!   –  First  YW:    ASAP  (AbstracCon)...   –  Then  YW-­‐recon:  ASAP  (reconstrucCng  runQme  Provenance)   54  More  Provenance  Mileage  from  Workflows  and  Scripts  
  • 55. Related  Work,  other  Approaches   …  to  bring  workflow/provenance  benefits  to   scripts:   •  RunQme  Provenance  Recorders:   – use  (R,  Python,  ..)  libraries  and/or  code   instrumentaQon  to  capture  runQme  observables   •  file  read/write,  funcCon  calls,  program  variables  &  state,  …   – noWorkflow  system     •  [Murta-­‐Braganholo-­‐ChirigaC-­‐Koop-­‐Freire-­‐IPAW14]     •  exploit  Python  profiling  library  to  capture  runCme   provenance   =>  helps  with  "S"  and  "P"         More  Provenance  Mileage  from  Workflows  and  Scripts   55  
  • 56. YW  (prospec1ve)  and     YW-­‐Recon  (retrospec1ve)  Provenance   •  1.  YW:  Annotate  Script  =>  YW  Model   –  Annotate  @BEGIN..@END,  @IN,  @OUT   –  Visualize,  share,  be  happy  J     •  2.  Run  script   –  Files  are  read  and  wriVen   –  Folder-­‐  &  Filenames  have  metadata   •  3.  YW-­‐Recon   –  Use  @URI  tags  that  link  YW  Model  ó  Persisted  Data   –  Run  URI-­‐template  queries     •  cf.  “ls  -­‐R”  &  RegEx  matching   •  4.  YW-­‐Query   –  Answer  the  user’s  provenance  queries     More  Provenance  Mileage  from  Workflows  and  Scripts   56  
  • 57. YW  annotaQons:  Model  your  Workflow!   More  Provenance  Mileage  from  Workflows  and  Scripts   57  
  • 58. YesWorkflow:  ProspecQve  &  RetrospecCve   Provenance  …  (almost)  for  free!     •  YW  annotaCons  in   the  script  (R,   Python,  Matlab)   are  used  to   recreate  the   workflow  view   from  the  script  …     More  Provenance  Mileage  from  Workflows  and  Scripts   58   cassette_id sample_score_cutoff sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img initialize_run run_log file:run/run_log.txt load_screening_results sample_namesample_quality calculate_strategy rejected_sample accepted_sample num_images energies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_id energy frame_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv YW!  
  • 59. Voila!  The  Workflow  revealed!   More  Provenance  Mileage  from  Workflows  and  Scripts   59   cassette_id sample_score_cutoff sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img initialize_run run_log file:run/run_log.txt load_screening_results sample_namesample_quality calculate_strategy rejected_sample accepted_sample num_images energies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_id energy frame_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv
  • 60.            Get  3  views  for  the  price  of  1!   More  Provenance  Mileage  from  Workflows  and  Scripts   60                        Process  view   Data  view   Combined  view  
  • 61. GetModernClimate PRISM_annual_growing_season_precipitation SubsetAllData dendro_series_for_calibration dendro_series_for_reconstruction CAR_Analysis_unique cellwise_unique_selected_linear_models CAR_Analysis_union cellwise_union_selected_linear_models CAR_Reconstruction_union raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors CAR_Reconstruction_union_output ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif master_data_directory prism_directory tree_ring_datacalibration_years retrodiction_years Paleoclimate  ReconstrucQon  (EnviRecon.org)     More  Provenance  Mileage  from  Workflows  and  Scripts   61   •  …  explained  using  YesWorkflow!   Kyle  B.,  (computaConal)  archaeologist:     "It  took  me  about  20  minutes  to  comment.  Less   than  an  hour  to  learn  and  YW-­‐annotate,  all-­‐told."  
  • 62. Provenance Lands 62   Workflow  Modeling  &  Design   (a.k.a.  prospec1ve  provenance   “Workflow-­‐land”)   RunQme  Provenance     (a.k.a.  traces,  logs,       retrospec1ve   provenance,   “Trace-­‐land”)   More  Provenance  Mileage  from  Workflows  and  Scripts  
  • 63. run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     YW-­‐RECON:  ProspecCve  &  RetrospecQve   Provenance  …  (almost)  for  free!     More  Provenance  Mileage  from  Workflows  and  Scripts   63   cassette_id sample_score_cutoff sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img initialize_run run_log file:run/run_log.txt load_screening_results sample_namesample_quality calculate_strategy rejected_sample accepted_sample num_images energies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_id energy frame_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv •  URI-­‐templates  link  conceptual  enCCes   to  runQme  provenance  “le|  behind”  by   the  script  author  …     •  …  facilitaCng  provenance  reconstrucQon  
  • 64. YW  (prospec1ve)  and     YW-­‐Recon  (retrospec1ve)  Provenance   •  1.  YW:  Annotate  Script  =>  YW  Model   –  Annotate  @BEGIN..@END,  @IN,  @OUT   –  Visualize,  share,  be  happy  J     •  2.  Run  script   –  Files  are  read  and  wriVen   –  Folder-­‐  &  Filenames  have  metadata   •  3.  YW-­‐Recon   –  Use  @URI  tags  that  link  YW  Model  ó  Persisted  Data   –  Run  URI-­‐template  queries     •  cf.  “ls  -­‐R”  &  RegEx  matching   •  4.  YW-­‐Query   –  Answer  the  user’s  provenance  queries     More  Provenance  Mileage  from  Workflows  and  Scripts   64  
  • 65. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Data  collecQon  workflow  (X-­‐ray  diffracCon)   More  Provenance  Mileage  from  Workflows  and  Scripts   65  
  • 66. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Data  collecCon  workflow:  runQme  data   More  Provenance  Mileage  from  Workflows  and  Scripts   66   run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     1.  YW  annotaQons  =>  YW  model   2.  Files  &  Folders  le`  by  a  run  =>  runQme  (meta-­‐)data  
  • 67. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Q1:  What  samples  did  the  script  run  collect  images   from?   run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     More  Provenance  Mileage  from  Workflows  and  Scripts   67  
  • 68. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Q2:  What  energies  were  used  for  image  collecCon  from   sample  DRT322?   run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     More  Provenance  Mileage  from  Workflows  and  Scripts   68  
  • 69. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Q3:  Where  is  the  raw  image  of  the  corrected  image   DRT322_11000ev_030.img?    run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     More  Provenance  Mileage  from  Workflows  and  Scripts   69  
  • 70. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     Q5:  What  casseqe-­‐id  had  the  sample  leading  to   DRT240_10000ev_001.img?   More  Provenance  Mileage  from  Workflows  and  Scripts   70  
  • 71. Querying   Provenance   More  Provenance  Mileage  from  Workflows  and  Scripts   71  
  • 72. Taking  YW  for  a  spin  …     •  “To  document  on-­‐the  fly,  specifically  for  a  given   workflow  configuraIon  invoked:     –  do  not  insert  annotaIons  into  code,   –  but  rather  have  code  print  annota1ons  into  a  special  log   during  execuIon,   –  then  parse  that  log!”      –  Liudmila  Mainzer   More  Provenance  Mileage  from  Workflows  and  Scripts   72   Source:  L  Mainzer,  V  Jongeneel  (IGB  &  NCSA)    
  • 73. Conclusions   •  Provenance   –  …  in  databases   –  …  in  scienCfic  workflows   •  Scripts  are  (o|en)  workflows  too!   •  è  Need  to  support  provenance  management  for   scripts  and  scienCfic  workflows!   •  One  size  might  not  fit  all  …   –  Use  prospecCve,  retrospecCve  (recorded,  reconstructed   provenance)   •  Facilitate  “insider”  (or  “deep”)  provenance   –  …  the  stuff  scienCsts  need  to  get  their  job  done!   More  Provenance  Mileage  from  Workflows  and  Scripts   73  
  • 74. Deep  Provenance  to  get  the  science  done!   •  When  reconstrucCng  the   past  climate,  need  to   know  which  tree-­‐ring   source  was  used!   More  Provenance  Mileage  from  Workflows  and  Scripts   74   CRTZ MVNP ESPN LANL Arizona Colorado New Mexico Utah Douglas fir Pinyon and juniper Spruce, pine, and true fir GHCN stations K.  Bocinsky,  T.  Kohler,  A  2000-­‐year  reconstrucCon  of  the  rain-­‐fed  maize  agricultural   niche  in  the  US  Southwest.  Nature  Communica1ons.  doi:10.1038/ncomms6618    
  • 75. Conclusions  (Cont’d)   •  YesWorkflow:  Go  where  the  users  are!   –  …  they  already  capture  provenance  through  metadata!   •  Beware  your  level  of  provenance  abstracQon   –  Let  the  user  provide  a  workflow  model  easily!     •  YW-­‐Recon:   –  …  finishing  support  for  retrospecQve  provenance  without  using  a   runCme  provenance  recorder!   –  Key  insight:  scienCsts  already  leave  provenance  “bread  crumbs”   behind!  (it’s  not  an  accident!)   •  Future  Work:   –  Build  systems  that  work  with  the  exisCng  workflow  of  scienCsts!   –  There  are  many  research  quesCons  &  opportuniCes  out  there!   •  e.g.:  Why-­‐Not  provenance  for  scienCfic  workflows  anyone?     More  Provenance  Mileage  from  Workflows  and  Scripts   75  
  • 76. References    …     More  Provenance  Mileage  from  Workflows  and  Scripts   76  
  • 77. References  (cont’d)   More  Provenance  Mileage  from  Workflows  and  Scripts   77