YesWorkflow: More Provenance Mileage from Scientific Workflows and Scripts. Keynote at WORKS 2015: Workshop
Workflows in Support of Large-Scale Science. Sunday Nov. 15, 2015, Austin, Texas.
1. YesWorkflow: More Provenance
Mileage from Scientific Workflows
and Scripts!
Bertram
Ludäscher
Director, Center for Informatics Research in Science and Scholarship (CIRSS)
Professor, Graduate School of Library and Information Science (GSLIS)
Faculty affiliate, NCSA & Department of Computer Science
2. Outline
• All
things
“Provenance”
…
• Provenance:
Why
should
you
care?
• Provenance
in
Databases
– Why-‐,
How-‐,
…,
Why-‐Not
Provenance
• …
vs
Provenance
in
ScienCfic
Workflows
• YesWorkflow:
Doing
more
(someCmes
with
less)
More
Provenance
Mileage
from
Workflows
and
Scripts
2
3. Provenance
Palooza
• Provenance
– …
or
provenience?
• Chain
of
custody
• Lineage
• Pedigree
• Genealogy
• Phylogeny
• History
• Origin
More
Provenance
Mileage
from
Workflows
and
Scripts
3
5. Provenance as we all know it
• Oxford English Dictionary:
– coming from some particular source or quarter; origin,
derivation
– the history or pedigree of a work of art, manuscript, rare
book, etc.
– concretely, a record of the passage of an item through
its various owners (“chain of custody”)
• Merriam-Webster:
– prov·e·nance noun ˈpräv-nəәn(t)s, ˈprä-vəә-ˌnän(t)s
– the origin or source of something
• Origin:
– French, from provenir to come forth, originate, from Latin
provenire, from pro- forth + venire to come
More
Provenance
Mileage
from
Workflows
and
Scripts
5
8. More
Provenance
Mileage
from
Workflows
and
Scripts
8
Natural
History:
Understanding
what
happened…
Zrzavý,
Jan,
David
Storch,
and
Stanislav
Mihulka.
EvoluIon:
Ein
Lese-‐Lehrbuch.
Springer-‐Verlag,
2009.
Author:
Jkwchui
(Based
on
drawing
by
Truth-‐seeker2004)
10. More
Provenance
Mileage
from
Workflows
and
Scripts
10
Society
of
American
Archivists
hVp://www2.archivists.org/glossary/
terms/p/provenance
• Principle
of
provenance
(respect
des
fonds)
• Keep
records
of
different
origins
separate
to
preserve
context
Archivists
11. So
what
is
“provenance”
(sensu
W3C)
?
• Provenance
refers
to
the
sources
of
informaIon,
including
en11es
and
processes,
involving
in
producing
or
delivering
an
ar1fact
(*)
• Provenance
is
a
descripIon
of
how
things
came
to
be,
and
how
they
came
to
be
in
the
state
they
are
in
today
(*)
• Provenance
is
a
record
that
describes
the
people,
ins1tu1ons,
en11es,
and
ac1vi1es,
involved
in
producing,
influencing,
or
delivering
a
piece
of
data
or
a
thing
in
the
world
More
Provenance
Mileage
from
Workflows
and
Scripts
11
12. Outline
• All
things
“Provenance”
…
• Provenance:
Why
should
you
care?
• Provenance
in
Databases
– Why-‐,
How-‐,
…,
Why-‐Not
Provenance
• Provenance
in
ScienCfic
Workflows
• YesWorkflow:
Doing
more
(someCmes
with
less)
More
Provenance
Mileage
from
Workflows
and
Scripts
12
13. Provenance
=>
Transparency
• =
“Externally-‐facing”
provenance
– “Them-‐Provenance”
• Later:
“Internally-‐
facing”
provenance
– “Me-‐Provenance”
More
Provenance
Mileage
from
Workflows
and
Scripts
13
15. Tracing
the
sources
(data,
code)
More
Provenance
Mileage
from
Workflows
and
Scripts
15
16. From “Climate Gate” to Reproducible Science
16
More
Provenance
Mileage
from
Workflows
and
Scripts
17. Data & Provenance Management: Single Model
17
More
Provenance
Mileage
from
Workflows
and
Scripts
18. Data & Provenance Management: Model
Chains
18
More
Provenance
Mileage
from
Workflows
and
Scripts
19. Some things people do with “provenance”
• Result
validaCon
• Result
debugging
(science
vs
wf
logic)
• Reproducibility
and
Repeatability
• ExplanaCon
(derivaCons,
traces,
proof
trees)
• RunCme
monitoring
– Profiling,
benchmarking
• Performance
OpCmizaCon
(“smart
re-‐run”)
• Fault-‐tolerance,
crash-‐recovery
• Database
view
maintenance
(e.g.
data
warehousing)
• …
19
More
Provenance
Mileage
from
Workflows
and
Scripts
20. Provenance for Virtual Joint Experiments
• How do we ensure that Charlie gets a complete account of the history of
Wc s outputs?
• How do we ensure that Alice gets her due (partial) credit when Charlie
uses Bob s data v?
è traces TA and TB will be critical
è need to compose them to obtain TC
We
can
view
the
composiCon
WC
as
a
new,
virtual
workflow
Charlie
Alice
(1) develop! WA
(2) run! RA zx Bob
(3) develop!WB
(5) run!RB vuf
v
WC:=
(6) inspect
provenance!
(7) understand,
generate!
WA WS WB
uzx
(4) data sharing!
TA! TB!f -1
More
Provenance
Mileage
from
Workflows
and
Scripts
20
21. Open
Provenance
Model
=>
W3C
Prov
More
Provenance
Mileage
from
Workflows
and
Scripts
21
22. W3C
Prov:
One
size
fits
all?
More
Provenance
Mileage
from
Workflows
and
Scripts
22
23. Outline
• All
things
“Provenance”
…
• Provenance:
Why
should
you
care?
• Provenance
in
Databases
– Why-‐,
How-‐,
…,
Why-‐Not
Provenance
• Provenance
in
ScienCfic
Workflows
• YesWorkflow:
Doing
more
(someCmes
with
less)
More
Provenance
Mileage
from
Workflows
and
Scripts
23
24. Types of Data Provenance
• Black-box
– know (next to) nothing at compile-time
– at runtime: keep some data lineage
– most prov sensu WF work use this
• White-box
– statically (compile-time) analyzable
– q(Y1,Y2) :- p(X1,X2), r(X1,Y1), s(X2,Y2)
– Most prov sensu DB work use this
• Grey-box
– can “look inside” (some black boxes)
– … e.g. b/c they have subworkflows
– … or FP signatures: A :: t1, t2à t3,t4
– … or semantic annotations (sem.types)
f
A
q
t1
t2
t3
t4
X1
X2
Y1
Y2
More
Provenance
Mileage
from
Workflows
and
Scripts
24
25. Provenance
in
Databases
More
Provenance
Mileage
from
Workflows
and
Scripts
25
Source:
Val
Tannen
26. Provenance
in
Databases
More
Provenance
Mileage
from
Workflows
and
Scripts
26
Source:
Val
Tannen
27. Provenance
in
Databases
More
Provenance
Mileage
from
Workflows
and
Scripts
27
Source:
Val
Tannen
28. AbstracQng
the
structure
of
querying
More
Provenance
Mileage
from
Workflows
and
Scripts
28
Source:
Val
Tannen
In
database
provenance,
tuples
are
either
combined
conjunc1vely
(*)
or
disjunc1vely
(+)
è
That’s
the
core
model!
29. Provenance
Polynomials
One
Semiring
to
Rule
them
all!
(DB
theory
strikes!)
More
Provenance
Mileage
from
Workflows
and
Scripts
29
Green,
Karvounarakis,
Tannen.
Provenance
semirings,
PODS,
2007
Unifying
most
prior
work
in
a
simple
model!
30. Example:
Go
from
X
to
Y
in
3
hops!
(e.g.,
a
=
CS
b
=
NCSA
c
=
GSLIS)
• Database:
hop(X,Y)
:=
• Query:
3hop(X,Y)
:-‐
hop(X,
Z1),
hop(Z1,
Z2),
hop(Z2,Y).
More
Provenance
Mileage
from
Workflows
and
Scripts
30
a
p
b
q
r
c
s
Note:
Can
not
go
from
c
to
a
in
3hops!
a
ppp+pqr+qrp
b
ppq+qrq
cpqs
ppr+qrr
rpq
rqs
hop(a,a,
p).
hop(a,b,
q).
hop(b,a,
r)
hop(b,c,
s).
3hop(a,a,
p3+2pqr).
3hop(a,b,
p2q+q2r).
…
3hop(a,c,
pqs).
31. Provenance
Polynomials
More
Provenance
Mileage
from
Workflows
and
Scripts
31
,,Mein
Schatz!”
p3
+
2pqr
p3
+
pqr
p
+
2pqr
p
+
pqr
pqr
p
+
pqr
p
a
ppp+pqr+qrp
b
ppq+qrq
cpqs
ppr+qrr
rpq
rqs
33. NegaQon
&
Why-‐Not
Provenance
More
Provenance
Mileage
from
Workflows
and
Scripts
33
• Provenance
Semirings
work
well
for:
– PosiQve
Queries
(e.g.,
RA+
)
• Challenges:
Handling
of
– set
difference
(~
negaQon)
– Why-‐not
provenance
– Missing
Answer
provenance
• A
fresh
look
at
provenance!
• …
using
an
old
idea:
Game
semanQcs!
34. Provenance
(or
Query
EvaluaIon)
Games
More
Provenance
Mileage
from
Workflows
and
Scripts
34
“SLD-‐resoluQon
game”
A(X)
:–
B(X,Y,Z)
…
not
C(X,Y)
…
Eureka!
[KLZ13]
Köhler,
S.,
Ludäscher,
B.,
&
Zinn,
D.
(2013).
First-‐order
provenance
games.
In
Search
of
Elegance
in
the
Theory
and
PracIce
of
ComputaIon.
Springer
35. TranslaQon:
Q(I) => G Q(I)
More
Provenance
Mileage
from
Workflows
and
Scripts
35
A(X)
C(X)
B(X, Y )
r2(X, Y )
g1
2(X, Y )
g2
2(Y )
rB(X, Y )
rC (X)
¬A(X)
¬B(X, Y )
¬C(X)
B(X, Y )
C(X)
X:=Y
9Y
(a) Game template for QABC : A(X) : B(X, Y ), ¬C(Y ).
¬C(a)
¬C(b)
¬B(a, a)
¬B(a, b)
rB(b, a)
r2(b, a)¬A(b)
¬A(a)
g1
2(a, a)
B(a, b)
B(a, a)
C(a)
g2
2(a)
g2
2(b)
C(b)
¬B(b, a)
¬B(b, b)
rC (a)
A(b)
A(a)
r2(a, b)
r2(a, a)
g1
2(a, b) rB(a, b)
r2(b, b)
g1
2(b, b)
g1
2(b, a)
B(b, b)
B(b, a)
9a
9b
9b
9a
(b) Instantiated QABC game on I = {B(a, b), B(b, a), C(a)}.
Source
[KLZ13]
36. Solve
G Q(I)
=>
Provenance!
More
Provenance
Mileage
from
Workflows
and
Scripts
36
¬B(a, b)¬A(a) B(a, b)
r2(a, b)
g1
2(a, b) rB(a, b)
(b) Instantiated QABC game on I = {B(a, b), B(b, a), C(a)}.
¬C(a)
¬C(b)
¬B(a, a)
¬B(a, b)
rB(b, a)
r2(b, a)¬A(b)
¬A(a) rB(a, b)B(a, b)
B(a, a)
C(a)
g2
2(a)
g2
2(b)
C(b)
¬B(b, a)
¬B(b, b)
rC (a)
A(b)
A(a)
r2(a, b)
r2(a, a)
g1
2(a, b)
g1
2(a, a)
r2(b, b)
g1
2(b, b)
g1
2(b, a)
B(b, b)
B(b, a)
9a
9b
9b
9a
(c) Solved game: lost positions are (dark) red; won positions
are (light) green. Provenance edges (= good moves) are solid.
Bad moves are dashed and not part of the provenance. A(a) is
true (A(b) is false) as it is won (lost) in the solved game; the
game provenance explains why (why-not).
Figure 3: Provenance game for Q . The well-founded model of
Source
[KLZ13]
37. Provenance
~
Query
EvaluaQon
Game
More
Provenance
Mileage
from
Workflows
and
Scripts
37
a p
b
q r
c
s
(a) input I ...
hop
a a p
a b q
b a r
b c s
(b) ... annotated.
3hop
a a p3
+ 2pqr
a b p2
q + q2
r
a c pqs
b a p2
r + qr2
b b pqr
b c qrs
(c) 3hop with provenance.
r1(a, a, b, a)
g2
1(a, a)
¬hop(b, a)
g1
1(a, a)
hop(b, a)
g2
1(a, b) g3
1(b, a)
rhop(b, a)
r1(a, a, a, a)
r1(a, a, a, b)
3hop(a, a)
g3
1(a, a)
rhop(a, a)
hop(a, b)
¬hop(a, a)
g1
1(a, b)
rhop(a, b)
g2
1(b, a)
¬hop(a, b)
hop(a, a)
9 a,a 9 b,a
9 a,b
(d) The game provenance of 3hop(a, a) ...
⇥
+
⇥
+
+
+ +
r
⇥
⇥
+
+
p
+
⇥
+
q
+
⇥
+
(e) ... is p3 + 2pqr.
Provenance
Game
on
GQ(I)
=
Provenance
Polynomials
…
for
posiQve
queries!
Source
[KLZ13]
38. Provenance
~
Query
EvaluaQon
Game
More
Provenance
Mileage
from
Workflows
and
Scripts
38
…
but
also
works
for
Why-‐Not
provenance
&
non-‐monotonic
queries
(i.e.,
Q
can
have
negaQon)
!!
Here:
not
3hop(c,a)
–
can’t
go
back
from
GSLIS
to
CS
c
a
g2
1(c, a)
¬3hop(c, a)
g2
1(c, c)g1
1(c, c)
r1(c, a, c, b)
¬hop(c, b)
hop(c, a)
g2
1(b, b)
¬hop(a, c)
hop(c, c)
g1
1(c, a)
r1(c, a, b, c)r1(c, a, a, b)
3hop(c, a)
hop(b, b)
g2
1(c, b)g2
1(a, c)
r1(c, a, a, c)
¬hop(c, c)
hop(c, b)
¬hop(c, a)
g1
1(c, b)
r1(c, a, b, b)
¬hop(b, b)
g3
1(c, a)
r1(c, a, a, a) r1(c, a, b, a)
hop(a, c)
r1(c, a, c, a) r1(c, a, c, c)
9 a,b 9 a,c 9 c,a 9 c,c9 b,c 9 b,b9 b,a9 a,a 9 c,b
Figure 2: Why-not provenance for 3hop(c, a) using provenance games.
gi
1 in the body of r1, thus claiming that gi
1 is false and hence that
the r1 instance doesn’t derive t. The first player can counter and
demonstrate that gi
1 is true by selecting a rule instance or fact as
evidence for gi
1. The game proceeds in rounds until some player
cannot move and thus loses (the opponent wins). In [KLZ13] it
Source
[KLZ13]
39. Database
Provenance:
Summary
• Fine-‐grained
“white-‐box”
provenance
• Solved
(preVy
much)
for
posiQve
queries
• …
not
so
much
for
negaQon
and
“Why-‐Not”
– AcCve
area
of
research!
• Some
research
prototypes
…
• …
and
some
real-‐world
implementaCons!
• Note:
Those
in
need
of
provenance
o`en
already
“do
it”!!
– Crash
recovery,
audiCng,
concurrency
control,
…
More
Provenance
Mileage
from
Workflows
and
Scripts
39
40. Outline
• All
things
“Provenance”
…
• Provenance:
Why
should
you
care?
• Provenance
in
Databases
– Why-‐,
How-‐,
…,
Why-‐Not
Provenance
• Provenance
in
ScienQfic
Workflows
• YesWorkflow:
Doing
more
(someCmes
with
less)
More
Provenance
Mileage
from
Workflows
and
Scripts
40
41. Scientific Workflows: ASAP!
• Automation
– wfs to automate computational aspects of science
• Scaling (exploit and optimize machine cycles)
– wfs should make use of parallel compute resources
– wfs should be able handle large data
• Abstraction, Evolution, Reuse (human cycles)
– wfs should be easy to (re-)use, evolve, share
• Provenance
– wfs should capture processing history, data lineage
è traceable data- and wf-evolution
è Reproducible Science
Trident
Workbench
VisTrails
More
Provenance
Mileage
from
Workflows
and
Scripts
41
Es
war
einmal
…
42. Phylogenetics workflow in Kepler (2005)
Graphical interface
§ Canvas for assembling
and displaying the
workflow.
§ Library of workflow
blocks (‘actors’) that can
be dragged onto the
canvas and connected.
§ Arrows that represent
control dependencies or
paths of data flow.
§ A run button.
These features are
not essential to
managing actual
scientific workflows.
What
some
of
us
think
of
when
we
hear
the
term
‘scienQfic
workflows’
Source:
Tim
McPhillips
More
Provenance
Mileage
from
Workflows
and
Scripts
42
43. 10
Key
FuncQons
of
a
Sci-‐WFS
1. Automate programs and services scientists already use.
2. Schedule invocations of programs and services correctly and efficiently
– in parallel where possible.
3. Manage dataflow to, from, and between programs and services.
4. Enable scientists (not just developers) to author or modify workflows
easily.
5. Predict what a workflow will do when executed: prospective provenance.
6. Record what actually happens during workflow execution.
7. Reveal retrospective provenance – how workflow products were
derived from inputs via programs and services.
8. Organize intermediate and final data products as desired by users.
9. Enable scientists to version, share and publish their workflows.
10. Empower scientists who wish to automate additional programs and
services themselves.
These functions–not actors—distinguish scientific workflow
automation from general scientific software development.
More
Provenance
Mileage
from
Workflows
and
Scripts
43
Tim
McPhillips
et
al.
44. Yes, scripts are (can be) workflows too!
Interactive Visualization
More
Provenance
Mileage
from
Workflows
and
Scripts
44
45. SKOPE:
Synthesized
Knowledge
Of
Past
Environments
More
Provenance
Mileage
from
Workflows
and
Scripts
45
Bocinsky,
Kohler
et
al.
study
rain-‐fed
maize
of
Anasazi
– Four
Corners;
AD
600–1500.
Climate
change
influenced
Mesa
Verde
MigraQons;
late
13th
century
AD.
Uses
network
of
tree-‐ring
chronologies
to
reconstruct
a
spaQo-‐
temporal
climate
field
at
a
fairly
high
resoluCon
(~800
m)
from
AD
1–2000.
Algorithm
esCmates
joint
informaCon
in
tree-‐rings
and
a
climate
signal
to
idenCfy
“best”
tree-‐ring
chronologies
for
climate
reconstrucCng.
K.
Bocinsky,
T.
Kohler,
A
2000-‐year
reconstrucCon
of
the
rain-‐fed
maize
agricultural
niche
in
the
US
Southwest.
Nature
Communica1ons.
doi:10.1038/ncomms6618
… implemented as an R Script …
46. …
HPCBio
Workflows
@
Illinois
More
Provenance
Mileage
from
Workflows
and
Scripts
46
NaIonal
Petascale
CompuIng
Facility
Broad
InsQtute:
Recommended
workflow
for
variant
analysis
Liudmila
Mainzer,
Victor
Jongeneel
HPC
Bio
@
Illinois
Quickly,
say:
#!/bin/bash
47. It’s
Qme
to
shi`
control
…
More
Provenance
Mileage
from
Workflows
and
Scripts
47
• …
back
from
being
consumers
of
someone
else’s
(=
our)
tools
..
– “Just
click
here!”
• ...
to
tool
makers!
– ScienCsts
who
author
workflows
as
scripts!
• Go
where
the
wild
things
(users!)
are
…
– Yes,
develop
for
“end
users”
…
– …
but
don’t
forget
the
tool
makers!
• Can
we
do
this
together?
48. Mount
Sample
Screen
Sample
Align
Sample
Expose
Sample
Analyze
Images
Check
Criteri
a
Calculat
e
Strategy
Collect
Data
Set
Calculat
e
Maps
List
Peaks
Run
Search
Refine
Structur
e
Integrat
e
Images
Scale
ReflecQon
s
Merge
ReflecQons
Calc
Amplitude
s
Collect
Data
Process
Data
Solve
Structure
Analyze
Density
Blu-Ice
LABELIT
molrep
refmac
z
ipmosflm
xds
pointless
scala
xtriage
truncate
rfree
Example:
AutoDrug
Workflow
More
Provenance
Mileage
from
Workflows
and
Scripts
48
Tsai,
Y.,
McPhillips,
S.
E.,
González,
A.,
McPhillips,
T.
M.,
Zinn,
D.,
Cohen,
A.
E.,
...
&
SolCs,
S.
(2013).
AutoDrug:
fully
automated
macromolecular
crystallography
workflows
for
fragment-‐based
drug
discovery.
Acta
Crystallographica
SecCon
D:
Biological
Crystallography,
69(5),
796-‐803.
49. Diffraction images
Experimental electron
density and protein
model
Full protein structure
3D
Protein
Structure
DeterminaQon
by
X-‐ray
Crystallography
More
Provenance
Mileage
from
Workflows
and
Scripts
49
Source:
Tim
McPhillips
50. Crystal
in
loop
Sample mounting
robot
Cassette shipping
dewar
Crystal mounting pin
Sample cassette
Automated
Sample
Handling
Alice,
the
high-‐throughput
crystallographer:
When
the
first
shi|
of
her
beam
Cme
begins,
technicians
at
the
beam
line
load
the
three
casseVes
into
a
liquid
nitrogen
dewar
within
reach
of
the
sample-‐mounCng
robot
and
close
the
radiaCon
door.
From
this
point
Alice
is
able
to
control
beam
line
operaCons
remotely.
More
Provenance
Mileage
from
Workflows
and
Scripts
50
Source:
Tim
McPhillips
51. Remote
beam
line
operaQon
More
Provenance
Mileage
from
Workflows
and
Scripts
51
Source:
Tim
McPhillips
52. Outline
• All
things
“Provenance”
…
• Provenance:
Why
should
you
care?
• Provenance
in
Databases
– Why-‐,
How-‐,
…,
Why-‐Not
Provenance
• Provenance
in
ScienCfic
Workflows
• YesWorkflow:
Doing
more
with
Provenance!
– …
someCmes
using
less
(e.g.,
no
provenance
recorder)
More
Provenance
Mileage
from
Workflows
and
Scripts
52
54. Enter:
YesWorkflow!
(yesworkflow.org)
• YesWorkflow
(YW)
– Grass-‐roots
effort
– …
meeCng
the
scienCsts/users
where
they
R!
• R,
Matlab,
(i)Python,
Jupyter,
…
– Scripts
+
simple
user
annotaCons
• =>
Reveal
the
workflow
model/abstracQon
…
that
underlies
the
(script)
implementaIon
• =>
YW
can
give
us
more
of
ASAP!
– First
YW:
ASAP
(AbstracCon)...
– Then
YW-‐recon:
ASAP
(reconstrucCng
runQme
Provenance)
54
More
Provenance
Mileage
from
Workflows
and
Scripts
55. Related
Work,
other
Approaches
…
to
bring
workflow/provenance
benefits
to
scripts:
• RunQme
Provenance
Recorders:
– use
(R,
Python,
..)
libraries
and/or
code
instrumentaQon
to
capture
runQme
observables
• file
read/write,
funcCon
calls,
program
variables
&
state,
…
– noWorkflow
system
• [Murta-‐Braganholo-‐ChirigaC-‐Koop-‐Freire-‐IPAW14]
• exploit
Python
profiling
library
to
capture
runCme
provenance
=>
helps
with
"S"
and
"P"
More
Provenance
Mileage
from
Workflows
and
Scripts
55
56. YW
(prospec1ve)
and
YW-‐Recon
(retrospec1ve)
Provenance
• 1.
YW:
Annotate
Script
=>
YW
Model
– Annotate
@BEGIN..@END,
@IN,
@OUT
– Visualize,
share,
be
happy
J
• 2.
Run
script
– Files
are
read
and
wriVen
– Folder-‐
&
Filenames
have
metadata
• 3.
YW-‐Recon
– Use
@URI
tags
that
link
YW
Model
ó
Persisted
Data
– Run
URI-‐template
queries
• cf.
“ls
-‐R”
&
RegEx
matching
• 4.
YW-‐Query
– Answer
the
user’s
provenance
queries
More
Provenance
Mileage
from
Workflows
and
Scripts
56
57. YW
annotaQons:
Model
your
Workflow!
More
Provenance
Mileage
from
Workflows
and
Scripts
57
58. YesWorkflow:
ProspecQve
&
RetrospecCve
Provenance
…
(almost)
for
free!
• YW
annotaCons
in
the
script
(R,
Python,
Matlab)
are
used
to
recreate
the
workflow
view
from
the
script
…
More
Provenance
Mileage
from
Workflows
and
Scripts
58
cassette_id
sample_score_cutoff
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
YW!
59. Voila!
The
Workflow
revealed!
More
Provenance
Mileage
from
Workflows
and
Scripts
59
cassette_id
sample_score_cutoff
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
60.
Get
3
views
for
the
price
of
1!
More
Provenance
Mileage
from
Workflows
and
Scripts
60
Process
view
Data
view
Combined
view
64. YW
(prospec1ve)
and
YW-‐Recon
(retrospec1ve)
Provenance
• 1.
YW:
Annotate
Script
=>
YW
Model
– Annotate
@BEGIN..@END,
@IN,
@OUT
– Visualize,
share,
be
happy
J
• 2.
Run
script
– Files
are
read
and
wriVen
– Folder-‐
&
Filenames
have
metadata
• 3.
YW-‐Recon
– Use
@URI
tags
that
link
YW
Model
ó
Persisted
Data
– Run
URI-‐template
queries
• cf.
“ls
-‐R”
&
RegEx
matching
• 4.
YW-‐Query
– Answer
the
user’s
provenance
queries
More
Provenance
Mileage
from
Workflows
and
Scripts
64
72. Taking
YW
for
a
spin
…
• “To
document
on-‐the
fly,
specifically
for
a
given
workflow
configuraIon
invoked:
– do
not
insert
annotaIons
into
code,
– but
rather
have
code
print
annota1ons
into
a
special
log
during
execuIon,
– then
parse
that
log!”
–
Liudmila
Mainzer
More
Provenance
Mileage
from
Workflows
and
Scripts
72
Source:
L
Mainzer,
V
Jongeneel
(IGB
&
NCSA)
73. Conclusions
• Provenance
– …
in
databases
– …
in
scienCfic
workflows
• Scripts
are
(o|en)
workflows
too!
• è
Need
to
support
provenance
management
for
scripts
and
scienCfic
workflows!
• One
size
might
not
fit
all
…
– Use
prospecCve,
retrospecCve
(recorded,
reconstructed
provenance)
• Facilitate
“insider”
(or
“deep”)
provenance
– …
the
stuff
scienCsts
need
to
get
their
job
done!
More
Provenance
Mileage
from
Workflows
and
Scripts
73
74. Deep
Provenance
to
get
the
science
done!
• When
reconstrucCng
the
past
climate,
need
to
know
which
tree-‐ring
source
was
used!
More
Provenance
Mileage
from
Workflows
and
Scripts
74
CRTZ
MVNP
ESPN
LANL
Arizona
Colorado
New Mexico
Utah
Douglas fir
Pinyon and juniper
Spruce, pine, and true fir
GHCN stations
K.
Bocinsky,
T.
Kohler,
A
2000-‐year
reconstrucCon
of
the
rain-‐fed
maize
agricultural
niche
in
the
US
Southwest.
Nature
Communica1ons.
doi:10.1038/ncomms6618
75. Conclusions
(Cont’d)
• YesWorkflow:
Go
where
the
users
are!
– …
they
already
capture
provenance
through
metadata!
• Beware
your
level
of
provenance
abstracQon
– Let
the
user
provide
a
workflow
model
easily!
• YW-‐Recon:
– …
finishing
support
for
retrospecQve
provenance
without
using
a
runCme
provenance
recorder!
– Key
insight:
scienCsts
already
leave
provenance
“bread
crumbs”
behind!
(it’s
not
an
accident!)
• Future
Work:
– Build
systems
that
work
with
the
exisCng
workflow
of
scienCsts!
– There
are
many
research
quesCons
&
opportuniCes
out
there!
• e.g.:
Why-‐Not
provenance
for
scienCfic
workflows
anyone?
More
Provenance
Mileage
from
Workflows
and
Scripts
75
76. References
…
More
Provenance
Mileage
from
Workflows
and
Scripts
76