Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

236 views

Published on

Scott Edmunds at Galaxy Australasia: GigaGalaxy & publishing workflows for publishing workflows. #GAMe2017

Published in: Science
  • Be the first to comment

  • Be the first to like this

Scott Edmunds at #GAMe2017: GigaGalaxy & publishing workflows for publishing workflows

  1. 1. Scott Edmunds publishing workflows publishing workflowsfor and scott@gigasciencejournal.com ORCID: 0000-0001-6444-1436
  2. 2. Methods Answer Metadata softwareAnalysis (Pipelines) Idea Study Science & publishing pipelines 1665-2016 Data Narrative Review Publisher Impact?
  3. 3. unFAIR things about publishing • Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995 • Focus only on subjective “impact” rather than reuse. • Lack of transparency, lack of credit for anything other than dead trees.
  4. 4. The consequences: growing replication gap 1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14 2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8) Out of 18 microarray papers, results from 10 could not be reproduced
  5. 5. On top of availability, data (& ROs) need to be FAIR http://www.nature.com/articles/sdata201618
  6. 6. Methods Answer Metadata softwareAnalysis (Pipelines) Idea Study Science & publishing pipelines >2017? Data Rewarding the DOI, etc. Publication Publication Publication
  7. 7. GigaSolution: deconstructing the paper gigadb.org www.gigasciencejournal.com Utilizes big-data infrastructure and expertise from: Combines and integrates (with DOIs): Open-access journal Data Publishing Platform Data Analysis Platform Open Review Platform
  8. 8. gigadb.org
  9. 9. Publication only Full replication Not reproducible Gold standard Data Code and data Linked and executable code and data Publication + Reproducibility spectrum Adapted from Roger Peng (2011) Reproducible research in computational science. Science 334: 122
  10. 10. Publication only Full replication Not reproducible Gold standard Data Code and data Linked and executable code and data Publication + Reproducibility (FAIR) spectrum Adapted from Roger Peng (2011) Reproducible research in computational science. Science 334: 1226-1227.
  11. 11. gigagalaxy.net Reward Sharing of Workflows
  12. 12. http://gigatoolshed.net/ Reward Sharing of Workflows Toolshed
  13. 13. https://academic.oup.com/gigascience/pages/galaxy_series_data_intensive_reproducible_research
  14. 14. Visualisations & DOIs for workflows https://academic.oup.com/gigascience/pages/galaxy_series_data_intensive_reproducible_research
  15. 15. https://dx.doi.org/10.1186%2F2047-217X-3-23 https://dx.doi.org/10.1186%2Fs13742-015-0060-y Virtual Machines/containers • Downloadable as virtual harddisk/available as Amazon Machine Image • Now publishing container (docker) submissions
  16. 16. Not just Genomics: Galaxy-M (Metabolomics) https://gigascience.biomedcentral.com/articles/10.1186/s13742-016-0115-8
  17. 17. Now including deep integration with Need to capture “wet” workflows (protocols) • Create, share, modify forkeable protocols in repo. • Download & run on smartphone app. • Get discoverability, credit, DOIs for sharing methods. • Create your own, or let us set up & you claim. https://www.protocols.io/groups/gigascience-journal
  18. 18. Taking a microscope to the publication process How FAIR/reproducible are GigaScience papers?
  19. 19. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127612 Pilot project
  20. 20. How FAIR can we get? Data sets Analyses Open-Paper Open-Review DOI:10.1186/2047-217X-1-18 >50,000 accesses & >1,000 citations Open-Code 7 reviewers tested data in ftp server & named reports published DOI:10.5524/100044 Open-Pipelines Open-Workflows DOI:10.5524/100038 Open-Data 78GB CC0 data Code in sourceforge & Gitub under GPLv3: http://soapdenovo2.sourceforge.net/ & https://github.com/aquaskyline/SOAPdenovo2>40,000 downloads Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2
  21. 21. The SOAPdenovo2 Case study Subject to and test with 3 models: Data Method/Experi mental protocol Findings Types of resources in an RO ISA-TAB/ISA2OWL Nanopublication Wfdesc/ISA- TAB/ISA2OWL Models to describe each resource type
  22. 22. Integration of SOAPdenovo2 into GigaGalaxy
  23. 23. SOAPdenovo2 S. aureus pipeline
  24. 24. Species Tool Contigs Scaffolds Number N50 (kb) Errors N50 corrected (kb) Number N50 (kb) Errors N50 corrected (kb) S. aureus SOAPdenovo1 79 148.6 156 23 49 342 0 342 SOAPdenovo2 80 98.6 25 71.5 38 1086 2 1078 ALL-PATHS-LG 37 149.7 13 119.0 11 1477 1 1093 R. sphaeroides SOAPdenovo1 2241 3.5 400 2.8 956 106 24 68 SOAPdenovo2 721 18 106 14.1 333 2549 4 2540 ALL-PATHS-LG 190 41.9 30 36.7 32 3191 0 0 Published and Galaxy-reproduced statistics of genome assemblies of S. aureus and R. sphaeroides Species Tool Contigs Scaffolds Number N50 (kb) Errors N50 corrected (kb) Number N50 (kb) Errors N50 corrected (kb) S. aureus SOAPdenovo1 79 148.6 156 23 49 342 0 342 SOAPdenovo2 80 98.6 25 71.5 38 1086 2 1078 ALL-PATHS-LG 37 149.7 13 117.6 10 1477 1 1093 R. sphaeroides SOAPdenovo1 2242 3.5 392 2.8 956 105 18 70 SOAPdenovo2 721 18 106 14.1 333 2549 4 2540 ALL-PATHS-LG 190 41.9 31 36.7 32 3191 0 3310 PublishedReproduced
  25. 25. 1. While there are huge improvements to the quality of the resulting assemblies, other than the tables it was not stressed in the text that the speed of SOAPdenovo2 can be slightly slower than SOAPdenovo v1. 2. In the testing an assessment section (page 3), based on the correct results in table 2, where we say the scaffold N50 metric is an order of magnitude longer from SOAPdenovo2 versus SOAPdenovo1, this was actually 45 times longer 3. Also in the testing an assessment section, based on the correct results in table 2, where we say SOAPdenovo2 produced a contig N50 1.53 times longer than ALL-PATHS, this should be 2.18 times longer. 4. Finally in this section, where we say the correct assembly length produced by SOAPdenovo2 was 3-80 fold longer than SOAPdenovo1, this should be 3-64 fold longer. http://dx.doi.org/10.1186/s13742-015-0069-2
  26. 26. Lessons learned from this • With enough effort is possible recreate a result from a paper • Most published research findings are false. Or at least have errors • Complete scientific reproduction is difficult – Being FAIR can be COSTLY. How much are you willing to spend? • Much easier to make things FAIR before rather than after publication. • Finally seeing benefits (re-use/citations) from our “review on reproducibility not impact” approach
  27. 27. 21st Century I4As • Think beyond narrative to re-use • Bake in reproducibility • Embrace new FAIR tools & models • Disseminate ALL ROs • Worth investment in moving up reproducibility spectrum – toolshedVMs/Docker • Remember FAIR mantra: “The question to ask in order to be a data steward, to handle data or to simplify a set of standards is the same: “is it FAIR”?” http://www.nature.com/ng/journal/v48/n4/full/ng.3544.html
  28. 28. www.gigasciencejournal.com Give us your FAIR data, workflows & papers Help GigaPanda make it happen! scott@gigasciencejournal.com editorial@gigasciencejournal.com database@gigasciencejournal.com Contact us:
  29. 29. Thanks to: @gigascience facebook.com/GigaScience http://gigasciencejournal.com/blog Peter Li Chris Hunter Jesse Si Zhe Xiao Nicole Nogoy Hans Zauner Laurie Goodman Ruibang Luo (HKU/JH) Marco Roos (LUMC) Mark Thompson (LUMC) Jun Zhao (Oxford) Susanna Sansone (Oxford) Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford) www.gigadb.org gigagalaxy.net www.gigasciencejournal.com Funding from: team: Case study:

×