The lifecycle of reproducible science data and what provenance has got to do with it

The lifecycle of reproducible science data
and what provenance has got to do with it
Paolo Missier
School of Computing Science
Newcastle University, UK
Alan Turing Institute
Symposium On Reproducibility for Data-Intensive Research
Oxford, April 6, 2016
With material contributed by:
Yang Cao, Bertram Ludascher, Tim McPhillips, Dave Vieglais, Matt Jones and
the DataONE CyberInfrastructure group
Rawaa Qasha at Newcastle University
Carole Goble at the University of Manchester

P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
(Yet another) Data Lifecycle picture
Search
discover
packagepublish
spec(P’)
Deploy
P’ 
Env(dep’)
prov(D’)
Compare
(P,P’,D,D’)
spec(P)
prov(D)
D  D1
P  P’
dep  dep’
<D,P,dep,spec(P), prov(D)>
compute
Env
D’
D1

Reproducibility: working. reporting
submit article
and move on…
publish article
Research
Environment
Publication
Environment
Peer
Review

P.Missier
OxfordApril6th,2016
Re-what?
Re-*
ReRun:
vary experiment and setup, same lab
P P’
DD’
depdep’
Repeat:
Same experiment, setup, lab
P, D, dep, env(dep)
Replicate:
Same experiment, setup, different lab
P, D, dep, env’(dep)
Reproduce:
vary experiment and setup, different lab
P P’
DD’
depdep’
env(dep) env’(dep’)
Reuse:
Different experiment
D, P  Q

P.Missier
OxfordApril6th,2016
Mapping the reproducibility space
5
Goal: to help scientists understand the effect of workflow / data / dependencies
evolution on workflow execution results
Approach: compare provenance traces generated during the runs: PDIFF
P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow
reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013.

P.Missier
OxfordApril6th,2016
Workflow evolution
6
Each of the elements in an execution may evolve (semi) independently
from the others:
Can trt be computed again at some time t’>t?
Requires saving EDt but may be impractical (eg large DB state)
Repeatability:

P.Missier
OxfordApril6th,2016
Reproducibility
7
Can a new version trt’ of trt be computed at some later time t’ > t, after one
of more of the elements has changed?
• Wi may not run new EDj’
• Wi may not run with wfmsk’
• Wi’ may not run with dh’
• ...
Potential issues:

P.Missier
OxfordApril6th,2016
(Yet another) Data Lifecycle picture
Search
discover
packagepublish
spec(P’)
Deploy
P’  Env
D  D1
P  P’
dep  dep’
compute
Env
D’
prov(D’)
Compare
(P,P’,D,D’)
spec(P)
prov(D)
Research
Objects
DataONE
Federated
Research Data
Repositories
- Matlab
provenance
recorder
TOSCA-based
virtualisation
Pdiff: differencing
provenance
YesWorkflow
- Workflow
Provenance
- NoWorkflow
Matlab
provenance
recorder
(DataONE)
ReproZip

P.Missier
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences

Computational Workflow Runs
workflowrun.prov.ttl
(RDF)
outputA.txt
outputC.jpg
outputB/
intermediates/
1.txt
2.txt
3.txt
de/def2e58b-50e2-4949-9980-
fd310166621a.txt
inputA.txt
workflow attribution
execution
environment
Aggregating in Research Object
ZIP folder structure (RO Bundle)
mimetype
application/vnd.wf4ever.robundle+zi
p
.ro/manifest.jso
n
URI
references
Exchange
Reproducibility
Same data
Same code
Systematic and
extensible meta-
data collection
Workflow
Annotation Profile
Wf4Ever
Project

P.Missier
OxfordApril6th,2016
Manifests and Containers
Container
Packaging:
Zip files, Docker images, BagIt, …
Catalogues & Commons Platforms:
FAIRDOM SEEK, Farr Commons CKAN,
STELAR eLab, myExperiment
Manifest
Metadata
Describes the aggregated resources, their
annotations and their provenance
Manifest

P.Missier
OxfordApril6th,2016
Manifest Metadata
Manifest Construction
• Identification – id, title, creator, status….
• Aggregates – list of ids/links to resources
• Annotations – list of annotations about resources
Manifest
Manifest Description
• Checklists – what should be there
• Provenance – where it came from
• Versioning – its evolution
• Dependencies – what else is needed
Manifest

Components for a flexible, scalable,
sustainable network
Cyberinfrastructure Component 2
Member Nodes
www.dataone.org/member-nodes
Coordinating Nodes
• retain complete
metadata catalog
• indexing for search
• network-wide services
• ensure content
availability
(preservation)
• replication services
Member Nodes
• diverse institutions
• serve local community
• provide resources for
managing their data
• retain copies of data
14

Cyberinfrastructure
Data Services: Extraction, sub-setting etc
Provenance Semantics-enabled Discovery
ontolog
y
annotation
System
Metadata
Science
Data
Search
API
Science
Metadata
Provenance
Replicate
Metadata
Index
15

What input data went
into this study?
What methods were
used?
… with what
parameter settings,
calibrations, …?
Can we trust the data
and methods?
 Provenance (lineage): track origin and processing history
of data  trust, data quality ~ audit trail for attribution, credit
 Discovery of data, methodologies, experiments
Use Provenance for
Transparency, Reproducibility
17

 W3C has published the ‘PROV’ standard
Entity
Activity
Agent
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
W3C PROV model
See w3.org/TR/prov-o/
used
20

map image
R script
Execution
Scientist
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
Using a common model
 Example: Scientific workflow
21

map image
R script
Execution
Scientist
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
CSV data
used
wasDerivedFrom
22

map image
R script
Execution
Scientist
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
CSV data
used
wasDerivedFrom
< “map image” wasDerivedFrom “CSV data” >
23

ProvONE Motivation:
Different Kinds of Provenance
 Prospective Provenance
 method/workflow description (“workflow-land”)
 Retrospective Provenance
 runtime provenance tracking (“trace-land”)
 Better together!
24

ProvONE extends PROV for science!
“Trace-Land”
“Workflow-Land”
“Data-Land”
http://purl.dataone.org/provone-v1-dev
25

DataONE data packages:
Provenance inside!
resource map
science metadata
system
metadata
science data
system
metadata
system
metadata
OAI-ORE with ProvONE trace
figures
system metadata
software
system metadata
29

1 # @begin CreateGulfOfAlaskaMaps
2 # @in hcdb @as Total_Aromatic_Alkanes_PWS.csv
3 # @in world @as RWorldMap
4 # @out map @as Map_Of_Sampling_Locations.png
5 # @out detailMap @as Detailed_Map_Of_SamplingLocations.png
... mapping code is here ...
25 # @end CreateGulfOfAlaskaMaps
YesWorkflow (YW):
Scripts as prospective provenance
33

MATLAB, R , Python … Scripts
YesWorkflow (YW):
Scripts as prospective provenance
 Script + @YW-annotation
workflow-land & trace-land
 Combine provenance:
 Prospective (workflow)
 Retrospective (runtime trace)
 Reconstructed (logs, files, …)
 User can query own data &
provenance prior to sharing
 Incentive: accelerate work!
 “Provenance for Self”
34

When a user cites a pub, we
know:
 Which data produced it
 What software produced it
 What was derived from it
 Who to credit down the
attribution stack
 Katz & Smith. 2014. Implementing Transitive Credit
with JSON-LD. arXiv:1407.5117
 Missier, Paolo. “Data Trajectories: Tracking Reuse of
Published Data for Transitive Credit Attribution.” 11th
Intl. Data Curation Conference (IDCC). Amsterdam,
2016. (Best Paper Award)
Transitive Credit
36

Provenance today:
Important but hard
C limate C hange Impacts
in the U nited S tates
U .S . N a t iona l C lim a t e A sse ssm e nt
U . S. G lo b a l C h a n g e R e s e a r c h P r o g r a m
“This report is the result of a three-
year analytical effort by a team of
over 300 experts, overseen by a
broadly constituted Federal Advisory
Committee of 60 members. It was
developed from information and
analyses gathered in over 70
workshops and listening sessions
held across the country.”
37

Provenance today:
Important but hard
38
data and “code” / method linked
alt formats

Yaxing’s script with inputs &
output products
YesWorkflow model
Christopher using
Yaxing’s outputs as
inputs for his script
Christopher’s results can be
traced back all the way to Yaxing’s
input
Provenance in action
40

4
TOSCA
• Topology and Orchestration Specification of
Cloud Applications

Use Case: e-Science Central Workflow
5
http://www.esciencecentral.co.uk

TOSCA-based mapping of an e-SC Workflow
6
• Workflow components as Node Types
• Block dependencies as Relationship Types

e-SC Workflow Service Template
7

P.Missier
OxfordApril6th,2016
Data divergence analysis using provenance
All work done with reference to the e-Science Central WFMS
Assumption: workflow WFj (new version) runs to completion
thus it produces a new provenance trace
however, it may be disfunctional relative to WFi (the original)
Example: only input data changes: d != d’, WFj == WFi
4
7
Note: results may diverge even when the input datasets are identical, for example when one or
more of the services exhibits non-deterministic behaviour, or depends on external state that has
changed between executions.

P.Missier
OxfordApril6th,2016
Provenance traces for two runs
4
8
used
genBy

P.Missier
OxfordApril6th,2016
Delta graphs
4
9
A graph obtained as a result of traces “diff”
which can be used to explain observed differences in workflow outputs, in
terms of differences throughout the two executions.
This is the simplest
possible delta “graph”!

P.Missier
OxfordApril6th,2016
More involved workflow differences
5
0
WA
WB
sv2

P.Missier
OxfordApril6th,2016
The corresponding traces
5
1

P.Missier
OxfordApril6th,2016
Delta graph computed by PDIFF
5
2

P.Missier
OxfordApril6th,2016
References
Research Objects: www.researchobject.org
Bechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P. Couch, et
al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems (2011).
doi:doi:10.1016/j.future.2011.08.004.
DataONE: dataone.org
Cuevas-Vicenttín, Víctor, Parisa Kianmajd, Bertram Ludäscher, Paolo Missier, Fernando Chirigati,
Yaxing Wei, David Koop, and Saumen Dey. “The PBase Scientific Workflow Provenance Repository.”
In Procs. 9th International Digital Curation Conference, 9:28–38. San Francisco, CA, USA, 2014.
doi:10.2218/ijdc.v9i2.332.
Process Virtualisation using TOSCA
Qasha, Rawaa, Jacek Cala, and Paul Watson. “Towards Automated Workflow Deployment in the Cloud Using
TOSCA.” In 2015 IEEE 8th International Conference on Cloud Computing, 1037–1040. New York, 2015.
doi:10.1109/CLOUD.2015.146.
NoWorkflow: provenance recording for Python
Murta, Leonardo, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire.
“noWorkflow: Capturing and Analyzing Provenance of Scripts⋆.” In Procs. IPAW’14. Cologne,
Germany: Springer, 2014.
Pdiff: provenance differencing for understanding workflow differences
Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing
for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience
(2013). doi:10.1002/cpe.3035.

The lifecycle of reproducible science data and what provenance has got to do with it

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The lifecycle of reproducible science data and what provenance has got to do with it

Similar to The lifecycle of reproducible science data and what provenance has got to do with it (20)

More from Paolo Missier

More from Paolo Missier (20)

Recently uploaded

Recently uploaded (20)

The lifecycle of reproducible science data and what provenance has got to do with it

Editor's Notes