SlideShare a Scribd company logo
1 of 45
The lifecycle of reproducible science data
and what provenance has got to do with it
Paolo Missier
School of Computing Science
Newcastle University, UK
Alan Turing Institute
Symposium On Reproducibility for Data-Intensive Research
Oxford, April 6, 2016
With material contributed by:
Yang Cao, Bertram Ludascher, Tim McPhillips, Dave Vieglais, Matt Jones and
the DataONE CyberInfrastructure group
Rawaa Qasha at Newcastle University
Carole Goble at the University of Manchester
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
(Yet another) Data Lifecycle picture
Search
discover
packagepublish
spec(P’)
Deploy
P’ 
Env(dep’)
prov(D’)
Compare
(P,P’,D,D’)
spec(P)
prov(D)
D  D1
P  P’
dep  dep’
<D,P,dep,spec(P), prov(D)>
compute
Env
D’
D1
Reproducibility: working. reporting
submit article
and move on…
publish article
Research
Environment
Publication
Environment
Peer
Review
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Re-what?
Re-*
ReRun:
vary experiment and setup, same lab
P P’
DD’
depdep’
Repeat:
Same experiment, setup, lab
P, D, dep, env(dep)
Replicate:
Same experiment, setup, different lab
P, D, dep, env’(dep)
Reproduce:
vary experiment and setup, different lab
P P’
DD’
depdep’
env(dep) env’(dep’)
Reuse:
Different experiment
D, P  Q
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Mapping the reproducibility space
5
Goal: to help scientists understand the effect of workflow / data / dependencies
evolution on workflow execution results
Approach: compare provenance traces generated during the runs: PDIFF
P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow
reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013.
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Workflow evolution
6
Each of the elements in an execution may evolve (semi) independently
from the others:
Can trt be computed again at some time t’>t?
Requires saving EDt but may be impractical (eg large DB state)
Repeatability:
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Reproducibility
7
Can a new version trt’ of trt be computed at some later time t’ > t, after one
of more of the elements has changed?
• Wi may not run new EDj’
• Wi may not run with wfmsk’
• Wi’ may not run with dh’
• ...
Potential issues:
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
(Yet another) Data Lifecycle picture
Search
discover
packagepublish
spec(P’)
Deploy
P’  Env
D  D1
P  P’
dep  dep’
compute
Env
D’
prov(D’)
Compare
(P,P’,D,D’)
spec(P)
prov(D)
Research
Objects
DataONE
Federated
Research Data
Repositories
- Matlab
provenance
recorder
TOSCA-based
virtualisation
Pdiff: differencing
provenance
YesWorkflow
- Workflow
Provenance
- NoWorkflow
Matlab
provenance
recorder
(DataONE)
ReproZip
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences
Computational Workflow Runs
workflowrun.prov.ttl
(RDF)
outputA.txt
outputC.jpg
outputB/
intermediates/
1.txt
2.txt
3.txt
de/def2e58b-50e2-4949-9980-
fd310166621a.txt
inputA.txt
workflow attribution
execution
environment
Aggregating in Research Object
ZIP folder structure (RO Bundle)
mimetype
application/vnd.wf4ever.robundle+zi
p
.ro/manifest.jso
n
URI
references
Exchange
Reproducibility
Same data
Same code
Systematic and
extensible meta-
data collection
Workflow
Annotation Profile
Wf4Ever
Project
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Manifests and Containers
Container
Packaging:
Zip files, Docker images, BagIt, …
Catalogues & Commons Platforms:
FAIRDOM SEEK, Farr Commons CKAN,
STELAR eLab, myExperiment
Manifest
Metadata
Describes the aggregated resources, their
annotations and their provenance
Manifest
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Manifest Metadata
Manifest Construction
• Identification – id, title, creator, status….
• Aggregates – list of ids/links to resources
• Annotations – list of annotations about resources
Manifest
Manifest Description
• Checklists – what should be there
• Provenance – where it came from
• Versioning – its evolution
• Dependencies – what else is needed
Manifest
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences
Components for a flexible, scalable,
sustainable network
Cyberinfrastructure Component 2
Member Nodes
www.dataone.org/member-nodes
Coordinating Nodes
• retain complete
metadata catalog
• indexing for search
• network-wide services
• ensure content
availability
(preservation)
• replication services
Member Nodes
• diverse institutions
• serve local community
• provide resources for
managing their data
• retain copies of data
14
Cyberinfrastructure
Data Services: Extraction, sub-setting etc
Provenance Semantics-enabled Discovery
ontolog
y
annotation
System
Metadata
Science
Data
Search
API
Science
Metadata
Provenance
Replicate
Metadata
Index
15
Data Holdings
16
What input data went
into this study?
What methods were
used?
… with what
parameter settings,
calibrations, …?
Can we trust the data
and methods?
 Provenance (lineage): track origin and processing history
of data  trust, data quality ~ audit trail for attribution, credit
 Discovery of data, methodologies, experiments
Use Provenance for
Transparency, Reproducibility
17
 W3C has published the ‘PROV’ standard
Entity
Activity
Agent
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
W3C PROV model
See w3.org/TR/prov-o/
used
20
map image
R script
Execution
Scientist
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
Using a common model
 Example: Scientific workflow
21
map image
R script
Execution
Scientist
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
CSV data
used
wasDerivedFrom
Using a common model
 Example: Scientific workflow
22
map image
R script
Execution
Scientist
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
CSV data
used
wasDerivedFrom
< “map image” wasDerivedFrom “CSV data” >
Using a common model
 Example: Scientific workflow
23
ProvONE Motivation:
Different Kinds of Provenance
 Prospective Provenance
 method/workflow description (“workflow-land”)
 Retrospective Provenance
 runtime provenance tracking (“trace-land”)
 Better together!
24
ProvONE extends PROV for science!
“Trace-Land”
“Workflow-Land”
“Data-Land”
http://purl.dataone.org/provone-v1-dev
25
DataONE data packages:
Provenance inside!
resource map
science metadata
system
metadata
science data
system
metadata
system
metadata
OAI-ORE with ProvONE trace
figures
system metadata
software
system metadata
29
Provenance
… of Figures
31
Provenance
… of Data
32
1 # @begin CreateGulfOfAlaskaMaps
2 # @in hcdb @as Total_Aromatic_Alkanes_PWS.csv
3 # @in world @as RWorldMap
4 # @out map @as Map_Of_Sampling_Locations.png
5 # @out detailMap @as Detailed_Map_Of_SamplingLocations.png
... mapping code is here ...
25 # @end CreateGulfOfAlaskaMaps
YesWorkflow (YW):
Scripts as prospective provenance
33
MATLAB, R , Python … Scripts
YesWorkflow (YW):
Scripts as prospective provenance
 Script + @YW-annotation
workflow-land & trace-land
 Combine provenance:
 Prospective (workflow)
 Retrospective (runtime trace)
 Reconstructed (logs, files, …)
 User can query own data &
provenance prior to sharing
 Incentive: accelerate work!
 “Provenance for Self”
34
When a user cites a pub, we
know:
 Which data produced it
 What software produced it
 What was derived from it
 Who to credit down the
attribution stack
 Katz & Smith. 2014. Implementing Transitive Credit
with JSON-LD. arXiv:1407.5117
 Missier, Paolo. “Data Trajectories: Tracking Reuse of
Published Data for Transitive Credit Attribution.” 11th
Intl. Data Curation Conference (IDCC). Amsterdam,
2016. (Best Paper Award)
Transitive Credit
36
Provenance today:
Important but hard
C limate C hange Impacts
in the U nited S tates
U .S . N a t iona l C lim a t e A sse ssm e nt
U . S. G lo b a l C h a n g e R e s e a r c h P r o g r a m
“This report is the result of a three-
year analytical effort by a team of
over 300 experts, overseen by a
broadly constituted Federal Advisory
Committee of 60 members. It was
developed from information and
analyses gathered in over 70
workshops and listening sessions
held across the country.”
37
Provenance today:
Important but hard
38
data and “code” / method linked
alt formats
Yaxing’s script with inputs &
output products
YesWorkflow model
Christopher using
Yaxing’s outputs as
inputs for his script
Christopher’s results can be
traced back all the way to Yaxing’s
input
Provenance in action
40
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences
4
TOSCA
• Topology and Orchestration Specification of
Cloud Applications
Use Case: e-Science Central Workflow
5
http://www.esciencecentral.co.uk
TOSCA-based mapping of an e-SC Workflow
6
• Workflow components as Node Types
• Block dependencies as Relationship Types
e-SC Workflow Service Template
7
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Data divergence analysis using provenance
All work done with reference to the e-Science Central WFMS
Assumption: workflow WFj (new version) runs to completion
thus it produces a new provenance trace
however, it may be disfunctional relative to WFi (the original)
Example: only input data changes: d != d’, WFj == WFi
4
7
Note: results may diverge even when the input datasets are identical, for example when one or
more of the services exhibits non-deterministic behaviour, or depends on external state that has
changed between executions.
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Provenance traces for two runs
4
8
used
genBy
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Delta graphs
4
9
A graph obtained as a result of traces “diff”
which can be used to explain observed differences in workflow outputs, in
terms of differences throughout the two executions.
This is the simplest
possible delta “graph”!
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
More involved workflow differences
5
0
WA
WB
sv2
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
The corresponding traces
5
1
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Delta graph computed by PDIFF
5
2
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
References
Research Objects: www.researchobject.org
Bechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P. Couch, et
al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems (2011).
doi:doi:10.1016/j.future.2011.08.004.
DataONE: dataone.org
Cuevas-Vicenttín, Víctor, Parisa Kianmajd, Bertram Ludäscher, Paolo Missier, Fernando Chirigati,
Yaxing Wei, David Koop, and Saumen Dey. “The PBase Scientific Workflow Provenance Repository.”
In Procs. 9th International Digital Curation Conference, 9:28–38. San Francisco, CA, USA, 2014.
doi:10.2218/ijdc.v9i2.332.
Process Virtualisation using TOSCA
Qasha, Rawaa, Jacek Cala, and Paul Watson. “Towards Automated Workflow Deployment in the Cloud Using
TOSCA.” In 2015 IEEE 8th International Conference on Cloud Computing, 1037–1040. New York, 2015.
doi:10.1109/CLOUD.2015.146.
NoWorkflow: provenance recording for Python
Murta, Leonardo, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire.
“noWorkflow: Capturing and Analyzing Provenance of Scripts⋆.” In Procs. IPAW’14. Cologne,
Germany: Springer, 2014.
Pdiff: provenance differencing for understanding workflow differences
Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing
for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience
(2013). doi:10.1002/cpe.3035.

More Related Content

What's hot

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 

What's hot (20)

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert Bifet
 
ISNCC 2017
ISNCC 2017ISNCC 2017
ISNCC 2017
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsFast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data Streams
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
 
parallel OLAP
parallel OLAPparallel OLAP
parallel OLAP
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data Sets
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Indexing Techniques for Scalable Record Linkage and Deduplication
Indexing Techniques for Scalable Record Linkage and DeduplicationIndexing Techniques for Scalable Record Linkage and Deduplication
Indexing Techniques for Scalable Record Linkage and Deduplication
 
Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiers
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream Classifiers
 

Similar to The lifecycle of reproducible science data and what provenance has got to do with it

Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
Ian Foster
 

Similar to The lifecycle of reproducible science data and what provenance has got to do with it (20)

Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...
 
Identifying semantics characteristics of user’s interactions datasets through...
Identifying semantics characteristics of user’s interactions datasets through...Identifying semantics characteristics of user’s interactions datasets through...
Identifying semantics characteristics of user’s interactions datasets through...
 
Camp 4-data workshop presentation
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentation
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to Reporting
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
 
Using Neo4j for exploring the research graph connections made by RD-Switchboard
Using Neo4j for exploring the research graph connections made by RD-SwitchboardUsing Neo4j for exploring the research graph connections made by RD-Switchboard
Using Neo4j for exploring the research graph connections made by RD-Switchboard
 
ISMB Workshop 2014
ISMB Workshop 2014ISMB Workshop 2014
ISMB Workshop 2014
 
2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis
 
The Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational ResearchThe Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational Research
 
Enabling semantic integration
Enabling semantic integration Enabling semantic integration
Enabling semantic integration
 
Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research Objects
 
Christine borgman keynote
Christine borgman keynoteChristine borgman keynote
Christine borgman keynote
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010
 
UKON 2014
UKON 2014UKON 2014
UKON 2014
 
Tools für das Management von Forschungsdaten
Tools für das Management von ForschungsdatenTools für das Management von Forschungsdaten
Tools für das Management von Forschungsdaten
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphs
 
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...
 
RO Advisory Kickoff Slides
RO Advisory Kickoff SlidesRO Advisory Kickoff Slides
RO Advisory Kickoff Slides
 

More from Paolo Missier

Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 

More from Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

The lifecycle of reproducible science data and what provenance has got to do with it

  • 1. The lifecycle of reproducible science data and what provenance has got to do with it Paolo Missier School of Computing Science Newcastle University, UK Alan Turing Institute Symposium On Reproducibility for Data-Intensive Research Oxford, April 6, 2016 With material contributed by: Yang Cao, Bertram Ludascher, Tim McPhillips, Dave Vieglais, Matt Jones and the DataONE CyberInfrastructure group Rawaa Qasha at Newcastle University Carole Goble at the University of Manchester
  • 2. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 (Yet another) Data Lifecycle picture Search discover packagepublish spec(P’) Deploy P’  Env(dep’) prov(D’) Compare (P,P’,D,D’) spec(P) prov(D) D  D1 P  P’ dep  dep’ <D,P,dep,spec(P), prov(D)> compute Env D’ D1
  • 3. Reproducibility: working. reporting submit article and move on… publish article Research Environment Publication Environment Peer Review
  • 4. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 Re-what? Re-* ReRun: vary experiment and setup, same lab P P’ DD’ depdep’ Repeat: Same experiment, setup, lab P, D, dep, env(dep) Replicate: Same experiment, setup, different lab P, D, dep, env’(dep) Reproduce: vary experiment and setup, different lab P P’ DD’ depdep’ env(dep) env’(dep’) Reuse: Different experiment D, P  Q
  • 5. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 Mapping the reproducibility space 5 Goal: to help scientists understand the effect of workflow / data / dependencies evolution on workflow execution results Approach: compare provenance traces generated during the runs: PDIFF P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013.
  • 6. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 Workflow evolution 6 Each of the elements in an execution may evolve (semi) independently from the others: Can trt be computed again at some time t’>t? Requires saving EDt but may be impractical (eg large DB state) Repeatability:
  • 7. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 Reproducibility 7 Can a new version trt’ of trt be computed at some later time t’ > t, after one of more of the elements has changed? • Wi may not run new EDj’ • Wi may not run with wfmsk’ • Wi’ may not run with dh’ • ... Potential issues:
  • 8. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 (Yet another) Data Lifecycle picture Search discover packagepublish spec(P’) Deploy P’  Env D  D1 P  P’ dep  dep’ compute Env D’ prov(D’) Compare (P,P’,D,D’) spec(P) prov(D) Research Objects DataONE Federated Research Data Repositories - Matlab provenance recorder TOSCA-based virtualisation Pdiff: differencing provenance YesWorkflow - Workflow Provenance - NoWorkflow Matlab provenance recorder (DataONE) ReproZip
  • 9. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 You are here Data packaging: Research Objects DataONE: Data packaging, publication, search and discovery, hosting • R provenance recorder • Process-as-a-dataflow: YesWorkflow Process Virtualisation using TOSCA Provenance recorders • Workflow Provenance • Taverna, eScience Central, Kepler, Pegasus, VisTrails… • NoWorkflow: provenance recording for Python • Pdiff: provenance differencing for understanding workflow differences
  • 10. Computational Workflow Runs workflowrun.prov.ttl (RDF) outputA.txt outputC.jpg outputB/ intermediates/ 1.txt 2.txt 3.txt de/def2e58b-50e2-4949-9980- fd310166621a.txt inputA.txt workflow attribution execution environment Aggregating in Research Object ZIP folder structure (RO Bundle) mimetype application/vnd.wf4ever.robundle+zi p .ro/manifest.jso n URI references Exchange Reproducibility Same data Same code Systematic and extensible meta- data collection Workflow Annotation Profile Wf4Ever Project
  • 11. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 Manifests and Containers Container Packaging: Zip files, Docker images, BagIt, … Catalogues & Commons Platforms: FAIRDOM SEEK, Farr Commons CKAN, STELAR eLab, myExperiment Manifest Metadata Describes the aggregated resources, their annotations and their provenance Manifest
  • 12. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 Manifest Metadata Manifest Construction • Identification – id, title, creator, status…. • Aggregates – list of ids/links to resources • Annotations – list of annotations about resources Manifest Manifest Description • Checklists – what should be there • Provenance – where it came from • Versioning – its evolution • Dependencies – what else is needed Manifest
  • 13. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 You are here Data packaging: Research Objects DataONE: Data packaging, publication, search and discovery, hosting • R provenance recorder • Process-as-a-dataflow: YesWorkflow Process Virtualisation using TOSCA Provenance recorders • Workflow Provenance • Taverna, eScience Central, Kepler, Pegasus, VisTrails… • NoWorkflow: provenance recording for Python • Pdiff: provenance differencing for understanding workflow differences
  • 14. Components for a flexible, scalable, sustainable network Cyberinfrastructure Component 2 Member Nodes www.dataone.org/member-nodes Coordinating Nodes • retain complete metadata catalog • indexing for search • network-wide services • ensure content availability (preservation) • replication services Member Nodes • diverse institutions • serve local community • provide resources for managing their data • retain copies of data 14
  • 15. Cyberinfrastructure Data Services: Extraction, sub-setting etc Provenance Semantics-enabled Discovery ontolog y annotation System Metadata Science Data Search API Science Metadata Provenance Replicate Metadata Index 15
  • 17. What input data went into this study? What methods were used? … with what parameter settings, calibrations, …? Can we trust the data and methods?  Provenance (lineage): track origin and processing history of data  trust, data quality ~ audit trail for attribution, credit  Discovery of data, methodologies, experiments Use Provenance for Transparency, Reproducibility 17
  • 18.  W3C has published the ‘PROV’ standard Entity Activity Agent wasAssociatedWith wasAttributedTo wasGeneratedBy W3C PROV model See w3.org/TR/prov-o/ used 20
  • 20. map image R script Execution Scientist wasAssociatedWith wasAttributedTo wasGeneratedBy CSV data used wasDerivedFrom Using a common model  Example: Scientific workflow 22
  • 21. map image R script Execution Scientist wasAssociatedWith wasAttributedTo wasGeneratedBy CSV data used wasDerivedFrom < “map image” wasDerivedFrom “CSV data” > Using a common model  Example: Scientific workflow 23
  • 22. ProvONE Motivation: Different Kinds of Provenance  Prospective Provenance  method/workflow description (“workflow-land”)  Retrospective Provenance  runtime provenance tracking (“trace-land”)  Better together! 24
  • 23. ProvONE extends PROV for science! “Trace-Land” “Workflow-Land” “Data-Land” http://purl.dataone.org/provone-v1-dev 25
  • 24. DataONE data packages: Provenance inside! resource map science metadata system metadata science data system metadata system metadata OAI-ORE with ProvONE trace figures system metadata software system metadata 29
  • 27. 1 # @begin CreateGulfOfAlaskaMaps 2 # @in hcdb @as Total_Aromatic_Alkanes_PWS.csv 3 # @in world @as RWorldMap 4 # @out map @as Map_Of_Sampling_Locations.png 5 # @out detailMap @as Detailed_Map_Of_SamplingLocations.png ... mapping code is here ... 25 # @end CreateGulfOfAlaskaMaps YesWorkflow (YW): Scripts as prospective provenance 33
  • 28. MATLAB, R , Python … Scripts YesWorkflow (YW): Scripts as prospective provenance  Script + @YW-annotation workflow-land & trace-land  Combine provenance:  Prospective (workflow)  Retrospective (runtime trace)  Reconstructed (logs, files, …)  User can query own data & provenance prior to sharing  Incentive: accelerate work!  “Provenance for Self” 34
  • 29. When a user cites a pub, we know:  Which data produced it  What software produced it  What was derived from it  Who to credit down the attribution stack  Katz & Smith. 2014. Implementing Transitive Credit with JSON-LD. arXiv:1407.5117  Missier, Paolo. “Data Trajectories: Tracking Reuse of Published Data for Transitive Credit Attribution.” 11th Intl. Data Curation Conference (IDCC). Amsterdam, 2016. (Best Paper Award) Transitive Credit 36
  • 30. Provenance today: Important but hard C limate C hange Impacts in the U nited S tates U .S . N a t iona l C lim a t e A sse ssm e nt U . S. G lo b a l C h a n g e R e s e a r c h P r o g r a m “This report is the result of a three- year analytical effort by a team of over 300 experts, overseen by a broadly constituted Federal Advisory Committee of 60 members. It was developed from information and analyses gathered in over 70 workshops and listening sessions held across the country.” 37
  • 31. Provenance today: Important but hard 38 data and “code” / method linked alt formats
  • 32. Yaxing’s script with inputs & output products YesWorkflow model Christopher using Yaxing’s outputs as inputs for his script Christopher’s results can be traced back all the way to Yaxing’s input Provenance in action 40
  • 33. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 You are here Data packaging: Research Objects DataONE: Data packaging, publication, search and discovery, hosting • R provenance recorder • Process-as-a-dataflow: YesWorkflow Process Virtualisation using TOSCA Provenance recorders • Workflow Provenance • Taverna, eScience Central, Kepler, Pegasus, VisTrails… • NoWorkflow: provenance recording for Python • Pdiff: provenance differencing for understanding workflow differences
  • 34. 4 TOSCA • Topology and Orchestration Specification of Cloud Applications
  • 35. Use Case: e-Science Central Workflow 5 http://www.esciencecentral.co.uk
  • 36. TOSCA-based mapping of an e-SC Workflow 6 • Workflow components as Node Types • Block dependencies as Relationship Types
  • 37. e-SC Workflow Service Template 7
  • 38. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 You are here Data packaging: Research Objects DataONE: Data packaging, publication, search and discovery, hosting • R provenance recorder • Process-as-a-dataflow: YesWorkflow Process Virtualisation using TOSCA Provenance recorders • Workflow Provenance • Taverna, eScience Central, Kepler, Pegasus, VisTrails… • NoWorkflow: provenance recording for Python • Pdiff: provenance differencing for understanding workflow differences
  • 39. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 Data divergence analysis using provenance All work done with reference to the e-Science Central WFMS Assumption: workflow WFj (new version) runs to completion thus it produces a new provenance trace however, it may be disfunctional relative to WFi (the original) Example: only input data changes: d != d’, WFj == WFi 4 7 Note: results may diverge even when the input datasets are identical, for example when one or more of the services exhibits non-deterministic behaviour, or depends on external state that has changed between executions.
  • 41. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 Delta graphs 4 9 A graph obtained as a result of traces “diff” which can be used to explain observed differences in workflow outputs, in terms of differences throughout the two executions. This is the simplest possible delta “graph”!
  • 45. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 References Research Objects: www.researchobject.org Bechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P. Couch, et al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems (2011). doi:doi:10.1016/j.future.2011.08.004. DataONE: dataone.org Cuevas-Vicenttín, Víctor, Parisa Kianmajd, Bertram Ludäscher, Paolo Missier, Fernando Chirigati, Yaxing Wei, David Koop, and Saumen Dey. “The PBase Scientific Workflow Provenance Repository.” In Procs. 9th International Digital Curation Conference, 9:28–38. San Francisco, CA, USA, 2014. doi:10.2218/ijdc.v9i2.332. Process Virtualisation using TOSCA Qasha, Rawaa, Jacek Cala, and Paul Watson. “Towards Automated Workflow Deployment in the Cloud Using TOSCA.” In 2015 IEEE 8th International Conference on Cloud Computing, 1037–1040. New York, 2015. doi:10.1109/CLOUD.2015.146. NoWorkflow: provenance recording for Python Murta, Leonardo, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire. “noWorkflow: Capturing and Analyzing Provenance of Scripts⋆.” In Procs. IPAW’14. Cologne, Germany: Springer, 2014. Pdiff: provenance differencing for understanding workflow differences Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience (2013). doi:10.1002/cpe.3035.

Editor's Notes

  1. Packaging – physical and logical containers Open Archives Initiation Object Reuse and Exchange (OAI ORE) is a standard for describing aggregations of web resources http://www.openarchives.org/ore/ Uses a Resource Map to describe the aggregated resources Proxies allow for statements about the resources within the aggregation Capturing context and viewpoints Several concrete serialisations RDF/XML, Atom, RDFa Open Annotation specification is a community developed data model for annotation of web resources http://www.openannotation.org/spec/core/ Developed by the W3C Open Annotation Community Group Allows for “stand-off” annotations Annotation as a first class citizen Developed to fit with Web Architecture How do you make a research object? Well, gather your resources, describe them in the manifest. Different types of Containers can be used to transfer and package the Research Object; The Research Object Bundle is a structured ZIP file format… but more specific and more general formats are also used, such a Docker images (a bit low-level, capturing the whole execution environment) BagIt (a digital archiving format that is commonly used by libraries), or Simply existing Web resources (which may be subject to change). You can register and archive research object in domain-specific repositories like FAIRDOM’s SEEK (system biology models), FARR Commons CKAN (public health medical data), technology-specific repositories (myExperiment for workflow-centric workflows), or generic data repositories you probably have already heard of, like Zenodo and Figshare.
  2. Linked Resource Model very relevant Dublin Core Application Profile Pericles Linked Resource Model Identification includes properties for identifying the “mime type” annotation profile of the RO
  3. Need to update with new / upcoming MN locations and logos Amber notes: Retain CN, MN logo? Required if used elsewhere, if not cut? Not all MN logos will fit – select representative or cut? Cross reference with google MN Rebecca: Need updated logos for KNB, AOOS (FIXED) – I would select a different set of MNs to highlight since all won’t fit
  4. Rebecca: Can we do a better job than the quad chart? If not, are all the logos in 1st quadrant appropriate?
  5. Update before RSV Figure shows from 2020 – edit?
  6. Rebecca: the green axis and legend on the right is difficult to read – another color would be better. Bertram: Agreed. But this isn’t our chart. Maybe we can “patch” it? Also: should credit source!
  7. Still missing; EYE CANDY Also removed (redundant with next slide!): DataONE Provenance Products & Tools: New ProvONE model extends W3C PROV standard for workflows New Matlab provenance recorder ITK also includes R, Python recorders DataONE Web UI integration UI is “provenance-aware”
  8. These statements are the low-level pieces of information that we keep track of.
  9. These statements are the low-level pieces of information that we keep track of.
  10. These statements are the low-level pieces of information that we keep track of.
  11. These statements are the low-level pieces of information that we keep track of.
  12. We want to enhance analysis software that scientists are already familiar with. So for our first round, we are working on a Matlab Toolbox, and an R library. In conjunction with Bertram, Paolo, and other colleagues, we are incorporating the Yesworkflow java library into our Matlab Toolbox to capture ‘prospective’ provenance.
  13. Is the logo supposed to be R or ONE R?
  14. Use tools, concepts scientists are already familiar with
  15. Query 3: Where is the raw image corresponding to corrected image DRT322_11000ev_028.img Scientist: Look at the image files nested within the raw directory. Find the image file that contains the values DRT322, 11000, and 028 in the file access path. YW: Extract the URI template variable names and values from the path to DRT322_11000ev_028.img output by the port named corrected_image, look at the paths for all files output by the raw_image port, and return the file whose path includes template variables with names and values matching those for DRT322_11000ev_028.img
  16. In the DataONE Search, we can search for ‘grass’, and two data packages show up. The Yaxing Wei (Alice) soil map processing workflow and the Christopher Schwalm (Bob) analysis workflows both show that they have provenance information associated with the Data Packages (via the icon in the search record). We next will choose the Wei’s Data Package to see the details. This can be seen at https://search-sandbox-2.test.dataone.org.
  17. Viewing the Wei soil processing workflow we see on the left that the Matlab script (C3_C4_map_present_with_comments.m) has 25 inputs. It also has 6 outputs on the right. The top three outputs are the YesWorkflow diagrams (dataflow, processflow, combined). The bottom three are the NetCDF data files that represent three different world map grids of percentage of grass types (C3 grass fraction, C4 grass fraction, and total grass fraction). The script can be downloaded with the Download button in the center. This can be accessed at https://search-sandbox-2.test.dataone.org/#view/metadata_e859d2dd-c5e6-4ec6-892f-1b00bb6f8f65.xml. Bertram, if you want to show the YesWorkflow diagram (combined) for this run showing how monthly air and precipitation values are used as the inputs, the combined diagram can be accessed from this page, or directly from https://cn-sandbox-2.test.dataone.org/cn/v2/resolve/d87e1a6a-1a78-4f96-bba8-cb74ac2b1efb