Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Documenting data, methods, software, provenance and context of publications: Towards the Scientific Paper of the Future

15 views

Published on

A summary description of the best practices scientists may follow to make their data, software, methods and experiment context reusable.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Documenting data, methods, software, provenance and context of publications: Towards the Scientific Paper of the Future

  1. 1. Documenting data, methods, software, provenance and context of publications: Towards the Scientific Paper of the Future Daniel Garijo, Yolanda Gil Jan, 2019 prov:derivedFrom http://dx.doi.org/10.5281/zenodo.159206 http://www.scientificpaperofthefuture.org ICER-1440323 ICER-1343800 EarthCube!
  2. 2. Reproducibility Financial Human lives Reliability Scientific integrity Financial Trust 5/ 29/ 15, 1:49 AMRetracted Scientific Studies: A Growing List - NYTimes.com Sections Home Search Skip to content Advertisement Email Share Tweet More Search Subscribe Log In 0 Settings Close search SUBSCRIBE NOW 5/ 29/ 15, 1:49 AM a study of changing attitudes about gay marriage is wal of research results from scientific literature. e last. A 2011 study in Nature found a 10-fold during the preceding decade. ster outside of the scientific field. But in some ere clawed back made major waves in societal dealt with. This list recounts some prominent d since 1980. Photo h medical journal, rew Wakefield children was ine for measles, The Lancet a review of Dr. ds and financial dy, Dr. rong effect on d Methodology
  3. 3. Scientists Are Changing Open publications Open data Open access Open source Impact and credit
  4. 4. Publishers Are Changing
  5. 5. Funders Are Changing
  6. 6. Modern Scientific Articles Text: Narrative of method, the data is in tables, figures/plots, the software used is mentioned Software: scripted codes + manual steps + documentation in notes/emails Data: Supplementary materials, pointers to data repositories Modern Published Articles NOT published, loosely recorded: Text: Narrative of method, the data is in tables, figures/plots, the software used is mentioned Traditional Published Articles
  7. 7. Reproducible Articles Provenance and Workflow: Workflow/scripts describing dataflow, codes, and parameters Text: Narrative of method, the data is in tables, figures/plots, the software used is mentioned Reproducible Publications Data: Supplementary materials, pointers to data repositories Software: Data preparation, data analysis, and visualization Text: Narrative of method, the data is in tables, figures/plots, the software used is mentioned Software: scripted codes + manual steps + documentation in notes/emails Data: Supplementary materials, pointers to data repositories Modern Published Articles NOT published, loosely recorded:
  8. 8. Provenance and methods: Workflow/scripts describing dataflow, codes, and parameters Text: Narrative of method, the data is in tables, figures/plots, the software used is mentioned Reproducible Publications Data: Supplementary materials, pointers to data repositories Software: Data preparation, data analysis, and visualization Beyond Reproducible Publications Is this sufficient? The Scientific Paper of the Future has further requirements
  9. 9. Scientific Paper of the Future
  10. 10. What is a Scientific Paper of the Future  Data: Available in a public repository, including documentation (metadata), a clear license specifying conditions of use, and citable using a unique and persistent identifier.  Software: Available in a public repository, with documentation (metadata), a license for reuse, and citable using a unique persistent identifier.  Not only major software used, but also other ancillary software for data reformatting, data conversions, data filtering, and data visualization.  Provenance: Documented for all results by explicitly describing the series of computations and their outcome with a provenance record of the execution traces and a workflow sketch (or formal workflow)  Possibly in a shared repository and with a unique and persistent identifier.
  11. 11. Outline Data Software & Software metadata Methods Provenance
  12. 12. Problems with Current Practice  Data is often not made available in publications  Lack of reproducibility  Data made available through investigator’s URL  URL does not resolve (i.e., ‘’rotten’’) We analyze a vast collection of articles from three corpora that span publication years 1997 to 2012. For over one million references to web resources extracted from over 3.5 million articles, we observe that the fraction of articles containing references to web resources is growing steadily over time. We find one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten.
  13. 13. Making Data Accessible: Overview of Best Practices 1 2 3 4 5 1. Uniform Resource Locator (URL) 2. Persistent URL (PURL) 3. Digital Object Identifier
  14. 14. Outline Data Software & Software metadata Methods Provenance
  15. 15. Data Preparation Software Dominates but is Least Shared  “Scientists and engineers spend more than 60% of their time just preparing the data for model input or data-model comparison” (NASA A40) Result “Common Motifs in Scientific Workflows: An Empirical Analysis.” Garijo, D.; Alper, P.; Belhajjame, K.; Corcho, O.; Gil, Y.; and Goble, C. Future Generation Computer Systems, 2013.
  16. 16. “Dark Software”  Models that are not published  Eg from a PhD thesis  Data preparation software  Visualization software “Dark Software” is the counterpart of “Dark Data” [Heidorn 2008]
  17. 17. Making Software Accessible from a Public Location Options:  Publish in your web site  Very easy and simple  Get a PURL for the version you use in the paper  Use a data repository (eg zenodo), treating code like data  Very easy and simple  It allows you to get a DOI  Use a code repository (eg GitHub, BitBucket)  Beneficial if you have other users or want to track new versions  Some will give you a DOI (eg GitHub)  Create a formal community project (eg in Apache)
  18. 18. Choosing an Open Source License  Copyright: automatically applied to software when it is created to grant the creator exclusive rights as an intellectual property  Open source license: reduce constraints and enable software developers to make their source code available to public 1. “Copyleft” license (ex: GNU General Public License (GPL)) 2. “Permissive” license (ex: Apache 2 or MIT licenses)  Open Source Initiative  Choose a license from: http://opensource.org/licenses  Recommend that you choose a permissive license  Apache v2
  19. 19. Software Citation Format  Similar to data citation format, but includes software version Garijo, Daniel;Xie, Lei; Zhang, Yinliang; Gil, Yolanda; Xie, Li (2013) Tool for computing anomalies, GitHub. V.1 http://dx.doi.org/10.5281/zenodo.18765 Retrieved 11:05, Feb, 15, 2015 (GMT) Time of retrieval Authors Date of publication Persistent unique identifier Version RepositoryName
  20. 20. Software Metadata  Describe characteristics of the software that others can understand, discover (find), and compare software  Six major categories of software metadata  Developed as part of the OntoSoft project  http://www.ontosoft.org/software
  21. 21. Describing Software with OntoSoft Automatic crawlers import metadata from code repositories (eg GitHub) Questions for 6 top categories, some “important” and some “optional” http://www.ontosoft.org/portal Currently >600 entries, many imported from CSDMS, C4P, …
  22. 22. Publishing Software Metadata with OntoSoft Publish metadata as HTML from OntoSoft and add pointer from software repository http://www.ontosoft.org/portal
  23. 23. Outline Data Software & Software metadata Methods Provenance
  24. 24. Methods Described in Text Are Incomplete and Ambigous  Analysis of 18 quantitative papers published in Nature Genetics in the past two years found that reproducibility was not achievable even in principle in 10 cases, even when datasets are published [Ioannidis et al 09]  “Data processing, however, is often not described well enough to allow for exact reproduction of the results, leading to exercises in ‘forensic bioinformatics’ where aspects of raw data and reported results are used to infer what methods must have been employed.” [Baggerly and Coombes 09]  “Ambiguity in program descriptions leads to the possibility, if not the certainty, that a given natural language description can be converted into computer code in various ways, each of which may lead to different numerical outcomes.” [Ince et al 2012]
  25. 25. Workflows as Representations of Computational Methods  Computational workflow  Eg, water metabolism  Workflows can include manual steps  Eg, creating a figure, cleaning data  Workflows may access web services  Eg, access databases in biology
  26. 26. Workflow Systems  Capture method as a workflow  Workflow can be easily shared and reused  Other benefits  Workflow validation  Scalable computations  Comprehensive software libraries  Many workflow systems  Each has different capabilities
  27. 27. Electronic Notebooks http://ipython.org/notebook.html
  28. 28. Outline Data Software & Software metadata Methods Provenance
  29. 29. A Working Definition of Provenance  Provenance can be seen as metadata, but not all metadata is provenance Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance  Provenance results from past actions
  30. 30. Describing Execution (Provenance) vs General Method (Workflow)
  31. 31. Workflow generation and execution Specification, modeling and generation Publication Exploitation Workflow system 1 Workflow system 2 Users Adapted methodology Workflow Template Workflow Execution log OPMW-PROV export Workflow Template Workflow Execution log OPMW-PROV export RDF Triple Store Permanent web-accessible file store File upload interface SPARQLendpoint RDF upload interface Pubbywebservice Applications (Wexp) Uploadmanager URI naming convention Capturing Scientific Workflows and their provenance as web resources
  32. 32. Scientific Paper of the Future
  33. 33. Acknowledgments  The Scientific Paper of the Future training materials were developed and edited by Yolanda Gil (USC), based on the OntoSoft Geoscience Paper of the Future (GPF) training materials with contributions from the OntoSoft team including Chris Duffy (PSU), Chris Mattmann (JPL), Scott Peckham (CU), Ji-Hyun Oh (USC), Varun Ratnakar (USC), Erin Robinson (ESIP)  The OntoSoft training materials were significantly improved through input from GPF pioneers Cedric David (JPL), Ibrahim Demir (UI), Bakinam Essawy (UV), Robinson W. Fulweiler (BU), Jon Goodall (UV), Leif Karlstrom (UO), Kyo Lee (JPL), Heath Mills (UH), Suzanne Pierce (UT), Allen Pope (CU), Mimi Tzeng (DISL), Karan Venayagamoorthy (CSU), Sandra Villamizar (UC), and Xuan Yu (UD)  Thank you to Ruth Duerr (NSIDC), James Howison (UT), Matt Jones (UCSB), Lisa Kempler (Matworks), Kerstin Lehnert (LDEO), Matt Meyernick (NCAR), and Greg Wilson (Software Carpentry) for feedback on best practices  Thank you also to the many scientists and colleagues that have taken the training and asked hard questions  We are grateful for the support of the National Science Foundation and the EarthCube program EarthCube! ICER-1440323 ICER-1343800
  34. 34. For More Information http://www.scientificpaperofthefuture.org Rainfall Snowmelt Snowfall Rechar ge Groundwater discharge Runof Transpiration Canopy Evap. Soil Evap. UnsaturatedZone SaturatedZone EarthCube! ICER-1440323 ICER-1343800 http://dx.doi.org/10.5281/zenodo.15920

×