3. unFAIR things about publishing
• Scholarly articles are merely advertisement of
scholarship . The actual scholarly artefacts, i.e.
the data and computational methods, which
support the scholarship, remain largely
inaccessible --- Jon B. Buckheit and David L.
Donoho, WaveLab and reproducible research, 1995
• Focus only on subjective “impact” rather than reuse.
• Lack of transparency, lack of credit for anything other
than dead trees.
4. The consequences: growing replication gap
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Out of 18 microarray papers, results
from 10 could not be reproduced
5. On top of availability, data (& ROs) need to be FAIR
http://www.nature.com/articles/sdata201618
8. GigaSolution: deconstructing the paper
gigadb.org
www.gigasciencejournal.com
Utilizes big-data infrastructure and expertise from:
Combines and integrates (with DOIs):
Open-access journal
Data Publishing Platform
Data Analysis Platform
Open Review Platform
10. Publication only Full replication
Not reproducible Gold standard
Data Code and data
Linked and
executable
code and data
Publication +
Reproducibility spectrum
Adapted from Roger Peng (2011) Reproducible research in computational science. Science 334: 122
11. Publication only Full replication
Not reproducible Gold standard
Data Code and data
Linked and
executable
code and data
Publication +
Reproducibility (FAIR) spectrum
Adapted from Roger Peng (2011) Reproducible research in computational science. Science 334: 1226-1227.
17. Not just Genomics: Galaxy-M (Metabolomics)
https://gigascience.biomedcentral.com/articles/10.1186/s13742-016-0115-8
18. Now including deep integration with
Need to capture “wet” workflows (protocols)
• Create, share, modify forkeable protocols in repo.
• Download & run on smartphone app.
• Get discoverability, credit, DOIs for sharing methods.
• Create your own, or let us set up & you claim.
https://www.protocols.io/groups/gigascience-journal
19. Taking a microscope to the
publication process
How FAIR/reproducible are GigaScience papers?
21. How FAIR can we get?
Data sets
Analyses
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18
>50,000 accesses
& >1,000 citations
Open-Code
7 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-Pipelines
Open-Workflows
DOI:10.5524/100038
Open-Data
78GB CC0 data
Code in sourceforge & Gitub under GPLv3: http://soapdenovo2.sourceforge.net/ &
https://github.com/aquaskyline/SOAPdenovo2>40,000 downloads
Enabled code to being picked apart by bloggers in wiki
http://homolog.us/wiki/index.php?title=SOAPdenovo2
22. The SOAPdenovo2 Case study
Subject to and test with 3 models:
Data
Method/Experi
mental protocol
Findings
Types of resources in an RO
ISA-TAB/ISA2OWL
Nanopublication
Wfdesc/ISA-
TAB/ISA2OWL
Models to describe each resource type
25. Species Tool Contigs Scaffolds
Number N50 (kb) Errors N50 corrected (kb) Number N50 (kb) Errors N50 corrected (kb)
S. aureus SOAPdenovo1 79 148.6 156 23 49 342 0 342
SOAPdenovo2 80 98.6 25 71.5 38 1086 2 1078
ALL-PATHS-LG 37 149.7 13 119.0 11 1477 1 1093
R. sphaeroides SOAPdenovo1 2241 3.5 400 2.8 956 106 24 68
SOAPdenovo2 721 18 106 14.1 333 2549 4 2540
ALL-PATHS-LG 190 41.9 30 36.7 32 3191 0 0
Published and Galaxy-reproduced statistics of genome assemblies of S. aureus and R. sphaeroides
Species Tool Contigs Scaffolds
Number N50 (kb) Errors N50 corrected (kb) Number N50 (kb) Errors N50 corrected (kb)
S. aureus SOAPdenovo1 79 148.6 156 23 49 342 0 342
SOAPdenovo2 80 98.6 25 71.5 38 1086 2 1078
ALL-PATHS-LG 37 149.7 13 117.6 10 1477 1 1093
R. sphaeroides SOAPdenovo1 2242 3.5 392 2.8 956 105 18 70
SOAPdenovo2 721 18 106 14.1 333 2549 4 2540
ALL-PATHS-LG 190 41.9 31 36.7 32 3191 0 3310
PublishedReproduced
26.
27. 1. While there are huge improvements to the quality of the resulting
assemblies, other than the tables it was not stressed in the text that
the speed of SOAPdenovo2 can be slightly slower than SOAPdenovo
v1.
2. In the testing an assessment section (page 3), based on the correct
results in table 2, where we say the scaffold N50 metric is an order of
magnitude longer from SOAPdenovo2 versus SOAPdenovo1, this was
actually 45 times longer
3. Also in the testing an assessment section, based on the correct
results in table 2, where we say SOAPdenovo2 produced a contig N50
1.53 times longer than ALL-PATHS, this should be 2.18 times longer.
4. Finally in this section, where we say the correct assembly length
produced by SOAPdenovo2 was 3-80 fold longer than SOAPdenovo1,
this should be 3-64 fold longer.
http://dx.doi.org/10.1186/s13742-015-0069-2
28. Lessons learned from this
• With enough effort is possible recreate a result from
a paper
• Most published research findings are false. Or at
least have errors
• Complete scientific reproduction is difficult
– Being FAIR can be COSTLY. How much are you
willing to spend?
• Much easier to make things FAIR before rather than
after publication.
• Finally seeing benefits (re-use/citations) from our
“review on reproducibility not impact” approach
29. 21st Century I4As
• Think beyond narrative to re-use
• Bake in reproducibility
• Embrace new FAIR tools & models
• Disseminate ALL ROs
• Worth investment in moving up
reproducibility spectrum
– toolshedVMs/Docker
• Remember FAIR mantra:
“The question to ask in order to be a data
steward, to handle data or to simplify a
set of standards is the same: “is it FAIR”?”
http://www.nature.com/ng/journal/v48/n4/full/ng.3544.html
30. www.gigasciencejournal.com
Give us your FAIR data,
workflows & papers
Help GigaPanda
make it happen!
scott@gigasciencejournal.com
editorial@gigasciencejournal.com
database@gigasciencejournal.com
Contact us:
31. Thanks to:
@gigascience
facebook.com/GigaScience
http://gigasciencejournal.com/blog
Peter Li
Chris Hunter
Jesse Si Zhe Xiao
Nicole Nogoy
Hans Zauner
Laurie Goodman
Ruibang Luo (HKU/JH)
Marco Roos (LUMC)
Mark Thompson (LUMC)
Jun Zhao (Oxford)
Susanna Sansone (Oxford)
Philippe Rocca-Serra (Oxford)
Alejandra Gonzalez-Beltran (Oxford)
www.gigadb.org
gigagalaxy.net
www.gigasciencejournal.com
Funding from:
team: Case study: