08448380779 Call Girls In Friends Colony Women Seeking Men
The lifecycle of reproducible science data and what provenance has got to do with it
1. The lifecycle of reproducible science data
and what provenance has got to do with it
Paolo Missier
School of Computing Science
Newcastle University, UK
Alan Turing Institute
Symposium On Reproducibility for Data-Intensive Research
Oxford, April 6, 2016
With material contributed by:
Yang Cao, Bertram Ludascher, Tim McPhillips, Dave Vieglais, Matt Jones and
the DataONE CyberInfrastructure group
Rawaa Qasha at Newcastle University
Carole Goble at the University of Manchester
5. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Mapping the reproducibility space
5
Goal: to help scientists understand the effect of workflow / data / dependencies
evolution on workflow execution results
Approach: compare provenance traces generated during the runs: PDIFF
P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow
reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013.
8. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
(Yet another) Data Lifecycle picture
Search
discover
packagepublish
spec(P’)
Deploy
P’ Env
D D1
P P’
dep dep’
compute
Env
D’
prov(D’)
Compare
(P,P’,D,D’)
spec(P)
prov(D)
Research
Objects
DataONE
Federated
Research Data
Repositories
- Matlab
provenance
recorder
TOSCA-based
virtualisation
Pdiff: differencing
provenance
YesWorkflow
- Workflow
Provenance
- NoWorkflow
Matlab
provenance
recorder
(DataONE)
ReproZip
9. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences
12. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Manifest Metadata
Manifest Construction
• Identification – id, title, creator, status….
• Aggregates – list of ids/links to resources
• Annotations – list of annotations about resources
Manifest
Manifest Description
• Checklists – what should be there
• Provenance – where it came from
• Versioning – its evolution
• Dependencies – what else is needed
Manifest
13. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences
14. Components for a flexible, scalable,
sustainable network
Cyberinfrastructure Component 2
Member Nodes
www.dataone.org/member-nodes
Coordinating Nodes
• retain complete
metadata catalog
• indexing for search
• network-wide services
• ensure content
availability
(preservation)
• replication services
Member Nodes
• diverse institutions
• serve local community
• provide resources for
managing their data
• retain copies of data
14
15. Cyberinfrastructure
Data Services: Extraction, sub-setting etc
Provenance Semantics-enabled Discovery
ontolog
y
annotation
System
Metadata
Science
Data
Search
API
Science
Metadata
Provenance
Replicate
Metadata
Index
15
17. What input data went
into this study?
What methods were
used?
… with what
parameter settings,
calibrations, …?
Can we trust the data
and methods?
Provenance (lineage): track origin and processing history
of data trust, data quality ~ audit trail for attribution, credit
Discovery of data, methodologies, experiments
Use Provenance for
Transparency, Reproducibility
17
18. W3C has published the ‘PROV’ standard
Entity
Activity
Agent
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
W3C PROV model
See w3.org/TR/prov-o/
used
20
24. DataONE data packages:
Provenance inside!
resource map
science metadata
system
metadata
science data
system
metadata
system
metadata
OAI-ORE with ProvONE trace
figures
system metadata
software
system metadata
29
28. MATLAB, R , Python … Scripts
YesWorkflow (YW):
Scripts as prospective provenance
Script + @YW-annotation
workflow-land & trace-land
Combine provenance:
Prospective (workflow)
Retrospective (runtime trace)
Reconstructed (logs, files, …)
User can query own data &
provenance prior to sharing
Incentive: accelerate work!
“Provenance for Self”
34
29. When a user cites a pub, we
know:
Which data produced it
What software produced it
What was derived from it
Who to credit down the
attribution stack
Katz & Smith. 2014. Implementing Transitive Credit
with JSON-LD. arXiv:1407.5117
Missier, Paolo. “Data Trajectories: Tracking Reuse of
Published Data for Transitive Credit Attribution.” 11th
Intl. Data Curation Conference (IDCC). Amsterdam,
2016. (Best Paper Award)
Transitive Credit
36
30. Provenance today:
Important but hard
C limate C hange Impacts
in the U nited S tates
U .S . N a t iona l C lim a t e A sse ssm e nt
U . S. G lo b a l C h a n g e R e s e a r c h P r o g r a m
“This report is the result of a three-
year analytical effort by a team of
over 300 experts, overseen by a
broadly constituted Federal Advisory
Committee of 60 members. It was
developed from information and
analyses gathered in over 70
workshops and listening sessions
held across the country.”
37
32. Yaxing’s script with inputs &
output products
YesWorkflow model
Christopher using
Yaxing’s outputs as
inputs for his script
Christopher’s results can be
traced back all the way to Yaxing’s
input
Provenance in action
40
33. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences
38. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences
39. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Data divergence analysis using provenance
All work done with reference to the e-Science Central WFMS
Assumption: workflow WFj (new version) runs to completion
thus it produces a new provenance trace
however, it may be disfunctional relative to WFi (the original)
Example: only input data changes: d != d’, WFj == WFi
4
7
Note: results may diverge even when the input datasets are identical, for example when one or
more of the services exhibits non-deterministic behaviour, or depends on external state that has
changed between executions.
45. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
References
Research Objects: www.researchobject.org
Bechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P. Couch, et
al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems (2011).
doi:doi:10.1016/j.future.2011.08.004.
DataONE: dataone.org
Cuevas-Vicenttín, Víctor, Parisa Kianmajd, Bertram Ludäscher, Paolo Missier, Fernando Chirigati,
Yaxing Wei, David Koop, and Saumen Dey. “The PBase Scientific Workflow Provenance Repository.”
In Procs. 9th International Digital Curation Conference, 9:28–38. San Francisco, CA, USA, 2014.
doi:10.2218/ijdc.v9i2.332.
Process Virtualisation using TOSCA
Qasha, Rawaa, Jacek Cala, and Paul Watson. “Towards Automated Workflow Deployment in the Cloud Using
TOSCA.” In 2015 IEEE 8th International Conference on Cloud Computing, 1037–1040. New York, 2015.
doi:10.1109/CLOUD.2015.146.
NoWorkflow: provenance recording for Python
Murta, Leonardo, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire.
“noWorkflow: Capturing and Analyzing Provenance of Scripts⋆.” In Procs. IPAW’14. Cologne,
Germany: Springer, 2014.
Pdiff: provenance differencing for understanding workflow differences
Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing
for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience
(2013). doi:10.1002/cpe.3035.
Editor's Notes
Packaging – physical and logical containers
Open Archives Initiation Object Reuse and Exchange (OAI ORE) is a standard for describing aggregations of web resources
http://www.openarchives.org/ore/
Uses a Resource Map to describe the aggregated resources
Proxies allow for statements about the resources within the aggregation
Capturing context and viewpoints
Several concrete serialisations
RDF/XML, Atom, RDFa
Open Annotation specification is a community developed data model for annotation of web resources
http://www.openannotation.org/spec/core/
Developed by the W3C Open Annotation Community Group
Allows for “stand-off” annotations
Annotation as a first class citizen
Developed to fit with Web Architecture
How do you make a research object? Well, gather your resources, describe them in the manifest.
Different types of Containers can be used to transfer and package the Research Object;
The Research Object Bundle is a structured ZIP file format… but more specific and more general formats are also used, such a
Docker images (a bit low-level, capturing the whole execution environment)
BagIt (a digital archiving format that is commonly used by libraries), or
Simply existing Web resources (which may be subject to change).
You can register and archive research object in domain-specific repositories like FAIRDOM’s SEEK (system biology models), FARR Commons CKAN (public health medical data), technology-specific repositories (myExperiment for workflow-centric workflows), or generic data repositories you probably have already heard of, like Zenodo and Figshare.
Linked Resource Model very relevant
Dublin Core Application Profile
Pericles Linked Resource Model
Identification includes properties for identifying the “mime type” annotation profile of the RO
Need to update with new / upcoming MN locations and logos
Amber notes:Retain CN, MN logo? Required if used elsewhere, if not cut?Not all MN logos will fit – select representative or cut?Cross reference with google MN
Rebecca:
Need updated logos for KNB, AOOS (FIXED) – I would select a different set of MNs to highlight since all won’t fit
Rebecca:
Can we do a better job than the quad chart? If not, are all the logos in
1st quadrant appropriate?
Update before RSV
Figure shows from 2020 – edit?
Rebecca: the green axis and legend on the right is difficult to read – another color would be better.
Bertram: Agreed. But this isn’t our chart. Maybe we can “patch” it? Also: should credit source!
Still missing; EYE CANDY
Also removed (redundant with next slide!):
DataONE Provenance Products & Tools:
New ProvONE model
extends W3C PROV standard for workflows
New Matlab provenance recorder
ITK also includes R, Python recorders
DataONE Web UI integration
UI is “provenance-aware”
These statements are the low-level pieces of information that we keep track of.
These statements are the low-level pieces of information that we keep track of.
These statements are the low-level pieces of information that we keep track of.
These statements are the low-level pieces of information that we keep track of.
We want to enhance analysis software that scientists are already familiar with. So for our first round, we are working on a Matlab Toolbox, and an R library. In conjunction with Bertram, Paolo, and other colleagues, we are incorporating the Yesworkflow java library into our Matlab Toolbox to capture ‘prospective’ provenance.
Is the logo supposed to be R or ONE R?
Use tools, concepts scientists are already familiar with
Query 3: Where is the raw image corresponding to corrected image DRT322_11000ev_028.img
Scientist: Look at the image files nested within the raw directory. Find the image file that contains the values DRT322, 11000, and 028 in the file access path.
YW: Extract the URI template variable names and values from the path to DRT322_11000ev_028.img output by the port named corrected_image, look at the paths for all files output by the raw_image port, and return the file whose path includes template variables with names and values matching those for DRT322_11000ev_028.img
In the DataONE Search, we can search for ‘grass’, and two data packages show up. The Yaxing Wei (Alice) soil map processing workflow and the Christopher Schwalm (Bob) analysis workflows both show that they have provenance information associated with the Data Packages (via the icon in the search record). We next will choose the Wei’s Data Package to see the details. This can be seen at https://search-sandbox-2.test.dataone.org.
Viewing the Wei soil processing workflow we see on the left that the Matlab script (C3_C4_map_present_with_comments.m) has 25 inputs. It also has 6 outputs on the right. The top three outputs are the YesWorkflow diagrams (dataflow, processflow, combined). The bottom three are the NetCDF data files that represent three different world map grids of percentage of grass types (C3 grass fraction, C4 grass fraction, and total grass fraction). The script can be downloaded with the Download button in the center. This can be accessed at https://search-sandbox-2.test.dataone.org/#view/metadata_e859d2dd-c5e6-4ec6-892f-1b00bb6f8f65.xml. Bertram, if you want to show the YesWorkflow diagram (combined) for this run showing how monthly air and precipitation values are used as the inputs, the combined diagram can be accessed from this page, or directly from https://cn-sandbox-2.test.dataone.org/cn/v2/resolve/d87e1a6a-1a78-4f96-bba8-cb74ac2b1efb