Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Scott Edmunds
G3 workshop
19th September 2014
0000-0001-6444-1436
Open Data: the reproducibility crisis,
and the need for transparency.

Being able to read things only 1st step
Dead trees not fit for purpose
1665 1812 1869

The problems with publishing
• Scholarly articles are merely advertisement of scholarship .
The actual scholarly artefacts, i.e. the data and computational
methods, which support the scholarship, remain largely
inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab
and reproducible research, 1995
• Lack of transparency, lack of credit for anything other than
“regular” dead tree publication.
• If there is interest in data, only to monetise & re-silo
• Traditional publishing policies and practices a hindrance

Growing
problem…
…loss of
confidence in
research

The Cost of Scientific Retractions

The consequences: growing replication gap
Out of 18 microarray papers, results
from 10 could not be reproduced
Out of 18 microarray papers, results
from 10 could not be reproduced
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)

Consequences: increasing number of retractions
>15X increase in last decade
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

Consequences: increasing number of retractions
>15X increase in last decade
At current % > by 2045 as many
papers published as retracted
2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

Consequences: growing replication gap
Insufficient methods
1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950

STAP paper demonstrates problems:
Need:
…to publish protocols BEFORE analysis
…better access to supporting data
…more transparent & accountable review
…to publish replication studies

Anatomy of a Dead Tree Publication
Data
Idea
Study
Analysis
Answer
Metadata

Anatomy of an (Open) Data Publication
Data
Idea
Study
Analysis
Answer
Metadata

What is Open (Science) Data?
• Free & open access to data about the world around us:
o Searchable, findable
o Machine-readable, app-makeable, Excel-usable
o Without restrictions/limitations
http://science.okfn.org/

Panton Principles
=
http://pantonprinciples.org/

Sharing aids individuals…
Sharing Detailed
Research Data Is
Associated with
Increased Citation
Rate.
Piwowar HA, Day RS, Fridsma DB (2007)
PLoS ONE 2(3): e308.
doi:10.1371/journal.pone.0000308
Every 10 datasets collected contributes to at least 4 papers in
the following 3-years.
Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment
Nature, 473 (7347), 285-285 DOI: 10.1038/473285a

Sharing aids specific communities…
Rice v Wheat: consequences of publically available
genome data.
Papers

Credit where credit is overdue:
“One option would be to provide researchers who release data to public
repositories with a means of accreditation.”
“An ability to search the literature for all online papers that used a particular data
set would enable appropriate attribution for those who share. “
Nature Biotechnology 27, 579 (2009)
• Data
• Software
• }
Review
• Re-use…
= Credit
New incentives/credit

Reward better handling of metadata…
Novel tools/formats for data interoperability/handling.
Cloud
solutions?

Lowering barriers: data-athons
DTL/ELIXIR-NL
“Bring Your Own Data Party”
GigaScience/BGI HK
Metabolomics ISA-TAB athon v

Beneficiaries/users of our work
IRRI GALAXY

Beneficiaries/users of our work
Rice 3K project: 3,000 rice genomes, 13.4TB public data
IRRI GALAXY

Our first DOI:
To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released here into the public
domain under a CC0 license. Until the publication of research papers on
the assembly and whole-genome analysis of this isolate we would ask you
to cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang,
J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J;
Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao,
X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the
Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium
(2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI
Shenzhen. doi:10.5524/100001
http://dx.doi.org/10.5524/100001
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Downstream consequences:
1. Citations (~240) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons
4. Example for faster & more open science
“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the
Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he
knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the
agreements governing how his team could use data collected on the strain. Luckily, one team had
released its data under a Creative Commons licence that allowed free use of the data, allowing
Kasarskis and his colleagues to join the international research effort and publish their work without
wasting time on legal wrangling.”

1.3 The power of intelligently open data
The benefits of intelligently open data were powerfully
illustrated by events following an outbreak of a severe gastro-intestinal
infection in Hamburg in Germany in May 2011. This
spread through several European countries and the US,
affecting about 4000 people and resulting in over 50 deaths.
All tested positive for an unusual and little-known Shiga-toxin–
producing E. coli bacterium. The strain was initially analysed
by scientists at BGI-Shenzhen in China, working together with
those in Hamburg, and three days later a draft genome was
released under an open data licence. This generated interest
from bioinformaticians on four continents. 24 hours after the
release of the genome it had been assembled. Within a week
two dozen reports had been filed on an open-source site
dedicated to the analysis of the strain. These analyses
provided crucial information about the strain’s virulence and
resistance genes – how it spreads and which antibiotics are
effective against it. They produced results in time to help
contain the outbreak. By July 2011, scientists published papers
based on this work. By opening up their early sequencing
results to international collaboration, researchers in Hamburg
produced results that were quickly tested by a wide range of
experts, used to produce new knowledge and ultimately to
control a public health emergency.

http://dx.doi.org/10.5524/100102
The first public Nanopore dataset
released 10-Sep-2014
Curated with sample details and
converted to ISA-tab
second

The other challenge:
transparency, accountability,
credit

More transparency:
open peer review
BMC Series
Medical Journals

Reward open & transparent review
End reviewer 3 Downfall parody videos, now!

More transparency (and speed):
pre-prints

More transparency (and speed):
pre-prints
1. http://www.nature.com/news/preprints-come-to-life-1.14140

GigaScience + Publons = further credit for reviewers efforts
http://publons.com/

Reward faster review
GigaScience + AcademicKarma = even more credit
http://academickarma.org/

Real-time open-review = paper in arXiv + blogged reviews
www.gigasciencejournal.com/content/2/1/10 http://tmblr.co/ZzXdssfOMJfy

(Assemblathon ‘publish for free’ contest: publishforfree@assemblathon.org)

In Summary
Make your data open (CC0)
Metadata, metadata, metadata
Get credit for your reviewing
Use pre-prints
Publish your data with us
scott@gigasciencejournal.com
@gigascience
facebook.com/GigaScience
www.gigasciencejournal.com

Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Similar to Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency. (20)

More from GigaScience, BGI Hong Kong

More from GigaScience, BGI Hong Kong (20)

Recently uploaded

Recently uploaded (20)

Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: the reproducibility crisis, and the need for transparency.

Editor's Notes