SlideShare a Scribd company logo
1 of 34
Download to read offline
Documenting data, methods,
software, provenance and context
of publications:
Towards the Scientific Paper of the
Future
Daniel Garijo, Yolanda Gil
Jan, 2019
prov:derivedFrom
http://dx.doi.org/10.5281/zenodo.159206
http://www.scientificpaperofthefuture.org
ICER-1440323
ICER-1343800
EarthCube!
Reproducibility
Financial
Human lives
Reliability
Scientific
integrity
Financial
Trust
5/ 29/ 15, 1:49 AMRetracted Scientific Studies: A Growing List - NYTimes.com
Sections Home Search Skip to content
Advertisement
Email
Share
Tweet
More
Search
Subscribe
Log In 0 Settings
Close search
SUBSCRIBE NOW
5/ 29/ 15, 1:49 AM
a study of changing attitudes about gay marriage is
wal of research results from scientific literature.
e last. A 2011 study in Nature found a 10-fold
during the preceding decade.
ster outside of the scientific field. But in some
ere clawed back made major waves in societal
dealt with. This list recounts some prominent
d since 1980.
Photo
h medical journal,
rew Wakefield
children was
ine for measles,
The Lancet
a review of Dr.
ds and financial
dy, Dr.
rong effect on
d
Methodology
Scientists Are Changing
Open publications
Open data
Open access
Open source
Impact and credit
Publishers Are Changing
Funders Are Changing
Modern Scientific Articles
Text:
Narrative of method,
the data is in tables, figures/plots,
the software used is mentioned
Software:
scripted codes + manual steps +
documentation in notes/emails
Data:
Supplementary materials,
pointers to data repositories
Modern Published Articles
NOT published,
loosely recorded:
Text:
Narrative of method,
the data is in tables, figures/plots,
the software used is mentioned
Traditional Published Articles
Reproducible Articles
Provenance and Workflow:
Workflow/scripts describing
dataflow, codes, and parameters
Text:
Narrative of method,
the data is in tables, figures/plots,
the software used is mentioned
Reproducible Publications
Data:
Supplementary materials,
pointers to data repositories
Software:
Data preparation,
data analysis, and visualization
Text:
Narrative of method,
the data is in tables, figures/plots,
the software used is mentioned
Software:
scripted codes + manual steps +
documentation in notes/emails
Data:
Supplementary materials,
pointers to data repositories
Modern Published Articles
NOT published,
loosely recorded:
Provenance and methods:
Workflow/scripts describing
dataflow, codes, and parameters
Text:
Narrative of method,
the data is in tables, figures/plots,
the software used is mentioned
Reproducible Publications
Data:
Supplementary materials,
pointers to data repositories
Software:
Data preparation,
data analysis, and visualization
Beyond Reproducible
Publications
Is this sufficient?
The Scientific Paper
of the Future has
further requirements
Scientific Paper of the Future
What is a
Scientific Paper of the Future
 Data: Available in a public repository, including
documentation (metadata), a clear license specifying
conditions of use, and citable using a unique and persistent
identifier.
 Software: Available in a public repository, with
documentation (metadata), a license for reuse, and citable
using a unique persistent identifier.
 Not only major software used, but also other ancillary software for
data reformatting, data conversions, data filtering, and data
visualization.
 Provenance: Documented for all results by explicitly
describing the series of computations and their outcome with a
provenance record of the execution traces and a workflow
sketch (or formal workflow)
 Possibly in a shared repository and with a unique and persistent
identifier.
Outline
Data
Software & Software metadata
Methods
Provenance
Problems with Current Practice
 Data is often not made
available in publications
 Lack of reproducibility
 Data made available through
investigator’s URL
 URL does not resolve (i.e.,
‘’rotten’’)
We analyze a vast collection of articles from three corpora that span
publication years 1997 to 2012. For over one million references to
web resources extracted from over 3.5 million articles, we observe
that the fraction of articles containing references to web resources is
growing steadily over time. We find one out of five STM articles
suffering from reference rot, meaning it is impossible to revisit the
web context that surrounds them some time after their publication.
When only considering STM articles that contain references to web
resources, this fraction increases to seven out of ten.
Making Data Accessible:
Overview of Best Practices
1
2
3
4
5
1. Uniform Resource Locator
(URL)
2. Persistent URL (PURL)
3. Digital Object Identifier
Outline
Data
Software & Software metadata
Methods
Provenance
Data Preparation Software
Dominates but is Least Shared
 “Scientists and engineers spend more than 60% of their time just
preparing the data for model input or data-model comparison”
(NASA A40)
Result
“Common Motifs in Scientific Workflows: An Empirical Analysis.” Garijo, D.; Alper, P.;
Belhajjame, K.; Corcho, O.; Gil, Y.; and Goble, C. Future Generation Computer Systems, 2013.
“Dark Software”
 Models that are not
published
 Eg from a PhD thesis
 Data preparation
software
 Visualization software
“Dark Software” is the counterpart of “Dark Data” [Heidorn 2008]
Making Software Accessible
from a Public Location
Options:
 Publish in your web site
 Very easy and simple
 Get a PURL for the version you use in the
paper
 Use a data repository (eg zenodo), treating code
like data
 Very easy and simple
 It allows you to get a DOI
 Use a code repository (eg GitHub, BitBucket)
 Beneficial if you have other users or want to
track new versions
 Some will give you a DOI (eg GitHub)
 Create a formal community project (eg in
Apache)
Choosing an Open Source
License
 Copyright: automatically applied to software when it is created to
grant the creator exclusive rights as an intellectual property
 Open source license: reduce constraints and enable software
developers to make their source code available to public
1. “Copyleft” license (ex: GNU General Public License (GPL))
2. “Permissive” license (ex: Apache 2 or MIT licenses)
 Open Source Initiative
 Choose a license from: http://opensource.org/licenses
 Recommend that you choose a permissive license
 Apache v2
Software Citation Format
 Similar to data citation format, but includes software version
Garijo, Daniel;Xie, Lei; Zhang, Yinliang; Gil, Yolanda;
Xie, Li (2013) Tool for computing anomalies, GitHub.
V.1
http://dx.doi.org/10.5281/zenodo.18765
Retrieved 11:05, Feb, 15, 2015 (GMT)
Time of
retrieval
Authors
Date of
publication
Persistent
unique identifier
Version
RepositoryName
Software
Metadata
 Describe characteristics of the
software that others can
understand, discover (find),
and compare software
 Six major categories of
software metadata
 Developed as part of the
OntoSoft project
 http://www.ontosoft.org/software
Describing Software with OntoSoft
Automatic crawlers
import metadata from
code repositories (eg
GitHub)
Questions for 6 top
categories, some
“important” and
some “optional”
http://www.ontosoft.org/portal
Currently >600 entries, many
imported from CSDMS, C4P, …
Publishing Software Metadata
with OntoSoft
Publish metadata as HTML
from OntoSoft and add pointer
from software repository
http://www.ontosoft.org/portal
Outline
Data
Software & Software metadata
Methods
Provenance
Methods Described in Text Are
Incomplete and Ambigous
 Analysis of 18 quantitative papers published in Nature Genetics in the
past two years found that reproducibility was not achievable even in
principle in 10 cases, even when datasets are published [Ioannidis et al
09]
 “Data processing, however, is often not described well enough to allow
for exact reproduction of the results, leading to exercises in ‘forensic
bioinformatics’ where aspects of raw data and reported results are
used to infer what methods must have been employed.” [Baggerly and
Coombes 09]
 “Ambiguity in program descriptions leads to the possibility, if not the certainty,
that a given natural language description can be converted into computer code
in various ways, each of which may lead to different numerical outcomes.”
[Ince et al 2012]
Workflows as Representations
of Computational Methods
 Computational workflow
 Eg, water metabolism
 Workflows can include
manual steps
 Eg, creating a figure,
cleaning data
 Workflows may access web
services
 Eg, access databases in
biology
Workflow Systems
 Capture method as a workflow
 Workflow can be easily shared
and reused
 Other benefits
 Workflow validation
 Scalable computations
 Comprehensive software
libraries
 Many workflow systems
 Each has different
capabilities
Electronic Notebooks
http://ipython.org/notebook.html
Outline
Data
Software & Software metadata
Methods
Provenance
A Working Definition of
Provenance
 Provenance can be seen as metadata, but not all metadata is provenance
Provenance of a resource is a record that describes entities
and processes involved in producing and delivering or
otherwise influencing that resource.
Provenance provides a critical foundation for assessing
authenticity, enabling trust, and allowing reproducibility.
http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance
 Provenance results from past actions
Describing Execution (Provenance)
vs General Method (Workflow)
Workflow generation
and execution
Specification, modeling
and generation
Publication Exploitation
Workflow system 1
Workflow system 2
Users
Adapted methodology
Workflow
Template
Workflow
Execution log
OPMW-PROV
export
Workflow
Template
Workflow
Execution log
OPMW-PROV
export
RDF
Triple Store
Permanent
web-accessible
file store
File upload
interface
SPARQLendpoint
RDF upload
interface
Pubbywebservice
Applications
(Wexp)
Uploadmanager
URI naming
convention
Capturing Scientific Workflows
and their provenance as web
resources
Scientific Paper of the Future
Acknowledgments
 The Scientific Paper of the Future training materials were developed and edited by
Yolanda Gil (USC), based on the OntoSoft Geoscience Paper of the Future (GPF)
training materials with contributions from the OntoSoft team including Chris Duffy
(PSU), Chris Mattmann (JPL), Scott Peckham (CU), Ji-Hyun Oh (USC), Varun
Ratnakar (USC), Erin Robinson (ESIP)
 The OntoSoft training materials were significantly improved through input from GPF
pioneers Cedric David (JPL), Ibrahim Demir (UI), Bakinam Essawy (UV), Robinson
W. Fulweiler (BU), Jon Goodall (UV), Leif Karlstrom (UO), Kyo Lee (JPL), Heath
Mills (UH), Suzanne Pierce (UT), Allen Pope (CU), Mimi Tzeng (DISL), Karan
Venayagamoorthy (CSU), Sandra Villamizar (UC), and Xuan Yu (UD)
 Thank you to Ruth Duerr (NSIDC), James Howison (UT), Matt Jones (UCSB), Lisa
Kempler (Matworks), Kerstin Lehnert (LDEO), Matt Meyernick (NCAR), and Greg
Wilson (Software Carpentry) for feedback on best practices
 Thank you also to the many scientists and colleagues that have taken the training
and asked hard questions
 We are grateful for the support of the National Science Foundation and the
EarthCube program
EarthCube! ICER-1440323
ICER-1343800
For More Information
http://www.scientificpaperofthefuture.org
Rainfall
Snowmelt
Snowfall
Rechar ge
Groundwater discharge
Runof
Transpiration
Canopy
Evap.
Soil Evap.
UnsaturatedZone
SaturatedZone
EarthCube!
ICER-1440323
ICER-1343800
http://dx.doi.org/10.5281/zenodo.15920

More Related Content

More from dgarijo

WDPlus: Leveraging Wikidata to Link and Extend Tabular Data
WDPlus: Leveraging Wikidata to Link and Extend Tabular DataWDPlus: Leveraging Wikidata to Link and Extend Tabular Data
WDPlus: Leveraging Wikidata to Link and Extend Tabular Datadgarijo
 
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...dgarijo
 
Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019dgarijo
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Sciencedgarijo
 
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...dgarijo
 
WIDOCO: A Wizard for Documenting Ontologies
WIDOCO: A Wizard for Documenting OntologiesWIDOCO: A Wizard for Documenting Ontologies
WIDOCO: A Wizard for Documenting Ontologiesdgarijo
 
Towards Automating Data Narratives
Towards Automating Data NarrativesTowards Automating Data Narratives
Towards Automating Data Narrativesdgarijo
 
Automated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific WorkflowsAutomated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific Workflowsdgarijo
 
OntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific SoftwareOntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific Softwaredgarijo
 
OEG tools for supporting Ontology Engineering
OEG tools for supporting Ontology EngineeringOEG tools for supporting Ontology Engineering
OEG tools for supporting Ontology Engineeringdgarijo
 
Software Metadata: Describing "dark software" in GeoSciences
Software Metadata: Describing "dark software" in GeoSciencesSoftware Metadata: Describing "dark software" in GeoSciences
Software Metadata: Describing "dark software" in GeoSciencesdgarijo
 
Reproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An OverviewReproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An Overviewdgarijo
 
PhD Thesis: Mining abstractions in scientific workflows
PhD Thesis: Mining abstractions in scientific workflowsPhD Thesis: Mining abstractions in scientific workflows
PhD Thesis: Mining abstractions in scientific workflowsdgarijo
 
Publicación de datos y métodos científicos en investigación
Publicación de datos y métodos científicos en investigaciónPublicación de datos y métodos científicos en investigación
Publicación de datos y métodos científicos en investigacióndgarijo
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overviewdgarijo
 
Similarity in Wikipedia Articles (EDBT Summer School)
Similarity in Wikipedia Articles (EDBT Summer School)Similarity in Wikipedia Articles (EDBT Summer School)
Similarity in Wikipedia Articles (EDBT Summer School)dgarijo
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsdgarijo
 
Is preserving data enough? Towards the preservation of scientific methods
Is preserving data enough? Towards the preservation of scientific methods Is preserving data enough? Towards the preservation of scientific methods
Is preserving data enough? Towards the preservation of scientific methods dgarijo
 
Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015dgarijo
 
Towards Workflow Ecosystems Through Semantic and Standard Representations
Towards Workflow Ecosystems Through Semantic and Standard RepresentationsTowards Workflow Ecosystems Through Semantic and Standard Representations
Towards Workflow Ecosystems Through Semantic and Standard Representationsdgarijo
 

More from dgarijo (20)

WDPlus: Leveraging Wikidata to Link and Extend Tabular Data
WDPlus: Leveraging Wikidata to Link and Extend Tabular DataWDPlus: Leveraging Wikidata to Link and Extend Tabular Data
WDPlus: Leveraging Wikidata to Link and Extend Tabular Data
 
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
 
Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
 
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
 
WIDOCO: A Wizard for Documenting Ontologies
WIDOCO: A Wizard for Documenting OntologiesWIDOCO: A Wizard for Documenting Ontologies
WIDOCO: A Wizard for Documenting Ontologies
 
Towards Automating Data Narratives
Towards Automating Data NarrativesTowards Automating Data Narratives
Towards Automating Data Narratives
 
Automated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific WorkflowsAutomated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific Workflows
 
OntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific SoftwareOntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific Software
 
OEG tools for supporting Ontology Engineering
OEG tools for supporting Ontology EngineeringOEG tools for supporting Ontology Engineering
OEG tools for supporting Ontology Engineering
 
Software Metadata: Describing "dark software" in GeoSciences
Software Metadata: Describing "dark software" in GeoSciencesSoftware Metadata: Describing "dark software" in GeoSciences
Software Metadata: Describing "dark software" in GeoSciences
 
Reproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An OverviewReproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An Overview
 
PhD Thesis: Mining abstractions in scientific workflows
PhD Thesis: Mining abstractions in scientific workflowsPhD Thesis: Mining abstractions in scientific workflows
PhD Thesis: Mining abstractions in scientific workflows
 
Publicación de datos y métodos científicos en investigación
Publicación de datos y métodos científicos en investigaciónPublicación de datos y métodos científicos en investigación
Publicación de datos y métodos científicos en investigación
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overview
 
Similarity in Wikipedia Articles (EDBT Summer School)
Similarity in Wikipedia Articles (EDBT Summer School)Similarity in Wikipedia Articles (EDBT Summer School)
Similarity in Wikipedia Articles (EDBT Summer School)
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologists
 
Is preserving data enough? Towards the preservation of scientific methods
Is preserving data enough? Towards the preservation of scientific methods Is preserving data enough? Towards the preservation of scientific methods
Is preserving data enough? Towards the preservation of scientific methods
 
Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015
 
Towards Workflow Ecosystems Through Semantic and Standard Representations
Towards Workflow Ecosystems Through Semantic and Standard RepresentationsTowards Workflow Ecosystems Through Semantic and Standard Representations
Towards Workflow Ecosystems Through Semantic and Standard Representations
 

Recently uploaded

Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxraviapr7
 
3.26.24 Race, the Draft, and the Vietnam War.pptx
3.26.24 Race, the Draft, and the Vietnam War.pptx3.26.24 Race, the Draft, and the Vietnam War.pptx
3.26.24 Race, the Draft, and the Vietnam War.pptxmary850239
 
Department of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdfDepartment of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdfMohonDas
 
How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17Celine George
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...CaraSkikne1
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...raviapr7
 
Riddhi Kevadiya. WILLIAM SHAKESPEARE....
Riddhi Kevadiya. WILLIAM SHAKESPEARE....Riddhi Kevadiya. WILLIAM SHAKESPEARE....
Riddhi Kevadiya. WILLIAM SHAKESPEARE....Riddhi Kevadiya
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICESayali Powar
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxAditiChauhan701637
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17Celine George
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRATanmoy Mishra
 
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptxSandy Millin
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17Celine George
 
Over the counter (OTC)- Sale, rational use.pptx
Over the counter (OTC)- Sale, rational use.pptxOver the counter (OTC)- Sale, rational use.pptx
Over the counter (OTC)- Sale, rational use.pptxraviapr7
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.EnglishCEIPdeSigeiro
 
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptx
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptxSlides CapTechTalks Webinar March 2024 Joshua Sinai.pptx
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptxCapitolTechU
 

Recently uploaded (20)

Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptx
 
Finals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quizFinals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quiz
 
3.26.24 Race, the Draft, and the Vietnam War.pptx
3.26.24 Race, the Draft, and the Vietnam War.pptx3.26.24 Race, the Draft, and the Vietnam War.pptx
3.26.24 Race, the Draft, and the Vietnam War.pptx
 
Department of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdfDepartment of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdf
 
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdfPersonal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
 
How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...
 
Riddhi Kevadiya. WILLIAM SHAKESPEARE....
Riddhi Kevadiya. WILLIAM SHAKESPEARE....Riddhi Kevadiya. WILLIAM SHAKESPEARE....
Riddhi Kevadiya. WILLIAM SHAKESPEARE....
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICE
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptx
 
Prelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quizPrelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quiz
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
 
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17
 
March 2024 Directors Meeting, Division of Student Affairs and Academic Support
March 2024 Directors Meeting, Division of Student Affairs and Academic SupportMarch 2024 Directors Meeting, Division of Student Affairs and Academic Support
March 2024 Directors Meeting, Division of Student Affairs and Academic Support
 
Over the counter (OTC)- Sale, rational use.pptx
Over the counter (OTC)- Sale, rational use.pptxOver the counter (OTC)- Sale, rational use.pptx
Over the counter (OTC)- Sale, rational use.pptx
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.
 
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptx
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptxSlides CapTechTalks Webinar March 2024 Joshua Sinai.pptx
Slides CapTechTalks Webinar March 2024 Joshua Sinai.pptx
 

Documenting data, methods, software, provenance and context of publications: Towards the Scientific Paper of the Future

  • 1. Documenting data, methods, software, provenance and context of publications: Towards the Scientific Paper of the Future Daniel Garijo, Yolanda Gil Jan, 2019 prov:derivedFrom http://dx.doi.org/10.5281/zenodo.159206 http://www.scientificpaperofthefuture.org ICER-1440323 ICER-1343800 EarthCube!
  • 2. Reproducibility Financial Human lives Reliability Scientific integrity Financial Trust 5/ 29/ 15, 1:49 AMRetracted Scientific Studies: A Growing List - NYTimes.com Sections Home Search Skip to content Advertisement Email Share Tweet More Search Subscribe Log In 0 Settings Close search SUBSCRIBE NOW 5/ 29/ 15, 1:49 AM a study of changing attitudes about gay marriage is wal of research results from scientific literature. e last. A 2011 study in Nature found a 10-fold during the preceding decade. ster outside of the scientific field. But in some ere clawed back made major waves in societal dealt with. This list recounts some prominent d since 1980. Photo h medical journal, rew Wakefield children was ine for measles, The Lancet a review of Dr. ds and financial dy, Dr. rong effect on d Methodology
  • 3. Scientists Are Changing Open publications Open data Open access Open source Impact and credit
  • 6. Modern Scientific Articles Text: Narrative of method, the data is in tables, figures/plots, the software used is mentioned Software: scripted codes + manual steps + documentation in notes/emails Data: Supplementary materials, pointers to data repositories Modern Published Articles NOT published, loosely recorded: Text: Narrative of method, the data is in tables, figures/plots, the software used is mentioned Traditional Published Articles
  • 7. Reproducible Articles Provenance and Workflow: Workflow/scripts describing dataflow, codes, and parameters Text: Narrative of method, the data is in tables, figures/plots, the software used is mentioned Reproducible Publications Data: Supplementary materials, pointers to data repositories Software: Data preparation, data analysis, and visualization Text: Narrative of method, the data is in tables, figures/plots, the software used is mentioned Software: scripted codes + manual steps + documentation in notes/emails Data: Supplementary materials, pointers to data repositories Modern Published Articles NOT published, loosely recorded:
  • 8. Provenance and methods: Workflow/scripts describing dataflow, codes, and parameters Text: Narrative of method, the data is in tables, figures/plots, the software used is mentioned Reproducible Publications Data: Supplementary materials, pointers to data repositories Software: Data preparation, data analysis, and visualization Beyond Reproducible Publications Is this sufficient? The Scientific Paper of the Future has further requirements
  • 9. Scientific Paper of the Future
  • 10. What is a Scientific Paper of the Future  Data: Available in a public repository, including documentation (metadata), a clear license specifying conditions of use, and citable using a unique and persistent identifier.  Software: Available in a public repository, with documentation (metadata), a license for reuse, and citable using a unique persistent identifier.  Not only major software used, but also other ancillary software for data reformatting, data conversions, data filtering, and data visualization.  Provenance: Documented for all results by explicitly describing the series of computations and their outcome with a provenance record of the execution traces and a workflow sketch (or formal workflow)  Possibly in a shared repository and with a unique and persistent identifier.
  • 11. Outline Data Software & Software metadata Methods Provenance
  • 12. Problems with Current Practice  Data is often not made available in publications  Lack of reproducibility  Data made available through investigator’s URL  URL does not resolve (i.e., ‘’rotten’’) We analyze a vast collection of articles from three corpora that span publication years 1997 to 2012. For over one million references to web resources extracted from over 3.5 million articles, we observe that the fraction of articles containing references to web resources is growing steadily over time. We find one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten.
  • 13. Making Data Accessible: Overview of Best Practices 1 2 3 4 5 1. Uniform Resource Locator (URL) 2. Persistent URL (PURL) 3. Digital Object Identifier
  • 14. Outline Data Software & Software metadata Methods Provenance
  • 15. Data Preparation Software Dominates but is Least Shared  “Scientists and engineers spend more than 60% of their time just preparing the data for model input or data-model comparison” (NASA A40) Result “Common Motifs in Scientific Workflows: An Empirical Analysis.” Garijo, D.; Alper, P.; Belhajjame, K.; Corcho, O.; Gil, Y.; and Goble, C. Future Generation Computer Systems, 2013.
  • 16. “Dark Software”  Models that are not published  Eg from a PhD thesis  Data preparation software  Visualization software “Dark Software” is the counterpart of “Dark Data” [Heidorn 2008]
  • 17. Making Software Accessible from a Public Location Options:  Publish in your web site  Very easy and simple  Get a PURL for the version you use in the paper  Use a data repository (eg zenodo), treating code like data  Very easy and simple  It allows you to get a DOI  Use a code repository (eg GitHub, BitBucket)  Beneficial if you have other users or want to track new versions  Some will give you a DOI (eg GitHub)  Create a formal community project (eg in Apache)
  • 18. Choosing an Open Source License  Copyright: automatically applied to software when it is created to grant the creator exclusive rights as an intellectual property  Open source license: reduce constraints and enable software developers to make their source code available to public 1. “Copyleft” license (ex: GNU General Public License (GPL)) 2. “Permissive” license (ex: Apache 2 or MIT licenses)  Open Source Initiative  Choose a license from: http://opensource.org/licenses  Recommend that you choose a permissive license  Apache v2
  • 19. Software Citation Format  Similar to data citation format, but includes software version Garijo, Daniel;Xie, Lei; Zhang, Yinliang; Gil, Yolanda; Xie, Li (2013) Tool for computing anomalies, GitHub. V.1 http://dx.doi.org/10.5281/zenodo.18765 Retrieved 11:05, Feb, 15, 2015 (GMT) Time of retrieval Authors Date of publication Persistent unique identifier Version RepositoryName
  • 20. Software Metadata  Describe characteristics of the software that others can understand, discover (find), and compare software  Six major categories of software metadata  Developed as part of the OntoSoft project  http://www.ontosoft.org/software
  • 21. Describing Software with OntoSoft Automatic crawlers import metadata from code repositories (eg GitHub) Questions for 6 top categories, some “important” and some “optional” http://www.ontosoft.org/portal Currently >600 entries, many imported from CSDMS, C4P, …
  • 22. Publishing Software Metadata with OntoSoft Publish metadata as HTML from OntoSoft and add pointer from software repository http://www.ontosoft.org/portal
  • 23. Outline Data Software & Software metadata Methods Provenance
  • 24. Methods Described in Text Are Incomplete and Ambigous  Analysis of 18 quantitative papers published in Nature Genetics in the past two years found that reproducibility was not achievable even in principle in 10 cases, even when datasets are published [Ioannidis et al 09]  “Data processing, however, is often not described well enough to allow for exact reproduction of the results, leading to exercises in ‘forensic bioinformatics’ where aspects of raw data and reported results are used to infer what methods must have been employed.” [Baggerly and Coombes 09]  “Ambiguity in program descriptions leads to the possibility, if not the certainty, that a given natural language description can be converted into computer code in various ways, each of which may lead to different numerical outcomes.” [Ince et al 2012]
  • 25. Workflows as Representations of Computational Methods  Computational workflow  Eg, water metabolism  Workflows can include manual steps  Eg, creating a figure, cleaning data  Workflows may access web services  Eg, access databases in biology
  • 26. Workflow Systems  Capture method as a workflow  Workflow can be easily shared and reused  Other benefits  Workflow validation  Scalable computations  Comprehensive software libraries  Many workflow systems  Each has different capabilities
  • 28. Outline Data Software & Software metadata Methods Provenance
  • 29. A Working Definition of Provenance  Provenance can be seen as metadata, but not all metadata is provenance Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance  Provenance results from past actions
  • 30. Describing Execution (Provenance) vs General Method (Workflow)
  • 31. Workflow generation and execution Specification, modeling and generation Publication Exploitation Workflow system 1 Workflow system 2 Users Adapted methodology Workflow Template Workflow Execution log OPMW-PROV export Workflow Template Workflow Execution log OPMW-PROV export RDF Triple Store Permanent web-accessible file store File upload interface SPARQLendpoint RDF upload interface Pubbywebservice Applications (Wexp) Uploadmanager URI naming convention Capturing Scientific Workflows and their provenance as web resources
  • 32. Scientific Paper of the Future
  • 33. Acknowledgments  The Scientific Paper of the Future training materials were developed and edited by Yolanda Gil (USC), based on the OntoSoft Geoscience Paper of the Future (GPF) training materials with contributions from the OntoSoft team including Chris Duffy (PSU), Chris Mattmann (JPL), Scott Peckham (CU), Ji-Hyun Oh (USC), Varun Ratnakar (USC), Erin Robinson (ESIP)  The OntoSoft training materials were significantly improved through input from GPF pioneers Cedric David (JPL), Ibrahim Demir (UI), Bakinam Essawy (UV), Robinson W. Fulweiler (BU), Jon Goodall (UV), Leif Karlstrom (UO), Kyo Lee (JPL), Heath Mills (UH), Suzanne Pierce (UT), Allen Pope (CU), Mimi Tzeng (DISL), Karan Venayagamoorthy (CSU), Sandra Villamizar (UC), and Xuan Yu (UD)  Thank you to Ruth Duerr (NSIDC), James Howison (UT), Matt Jones (UCSB), Lisa Kempler (Matworks), Kerstin Lehnert (LDEO), Matt Meyernick (NCAR), and Greg Wilson (Software Carpentry) for feedback on best practices  Thank you also to the many scientists and colleagues that have taken the training and asked hard questions  We are grateful for the support of the National Science Foundation and the EarthCube program EarthCube! ICER-1440323 ICER-1343800
  • 34. For More Information http://www.scientificpaperofthefuture.org Rainfall Snowmelt Snowfall Rechar ge Groundwater discharge Runof Transpiration Canopy Evap. Soil Evap. UnsaturatedZone SaturatedZone EarthCube! ICER-1440323 ICER-1343800 http://dx.doi.org/10.5281/zenodo.15920

Editor's Notes

  1. URI: Resource Identifier / URL: Uniform Resource Locator (a type of URI)