Ferric Fang of the University of Washington and his colleagues quantified just how much fraud costs the government It turns out that every paper retracted because of research misconduct costs about $400,000 in funds from the US National Institutes of Health (NIH)—totaling $58 million for papers retracted between 1992 and 2012. Scientific fraud incurs additional costs.
That just leaves me to thank the GigaScience team: Laurie, Scott, Alexandra, Peter and Jesse, BGI for their support - specifically Shaoguang for IT and bioinformatics support – our collaborators on the database, website and tools: Tin-Lap, Qiong, Senhong, Yan, the Cogini web design team, Datacite for providing the DOI service and the isacommons team for their support and advocacy for best practice use of metadata reporting and sharing.
Thank you for listening.
Scott Edmunds, ReCon 2015: Beyond Dead Trees, Publishing Digital Research Objects
Need to move beyond 350 year old incentive systems
Buckheit & Donoho: Scholarly articles are
merely advertisement of scholarship. The
actual scholarly artifacts, i.e. the data and
computational methods, which support
the scholarship, remain largely
Arsenic Life forms, will
they take over the planet?
By Melba Ketchum, PhD
Which Overhyped, Unreproducible
Experiment Are You?
Want rapid citations for 2 years only? Carry out this quiz.
You got: STAP Cells
Of course dipping cells in
coffee will make them
pluripotent. Even if the
research gets discredited,
it’ll still get 100’s of
citations in two years.
The end result….
Attempts to “game the peer-review system on an industrial
Companies offering authorship of papers made to order by
“paper mills”. Meta-analyses, network analysis & more.
Guaranteed publication in JIF journal, often using fake referees,
ID theft, etc.
Consequences: increasing number of retractions
>15X increase in last decade
At current % > by 2045 as many
papers published as retracted
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
2. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
STAP paper demonstrates problems:
Nature Editorial, 2nd July 2014:
“We have concluded that we and the referees could
not have detected the problems that fatally
undermined the papers. The referees’ rigorous
reports quite rightly took on trust what was
presented in the papers.”
STAP paper demonstrates problems:
…to publish protocols BEFORE analysis
…better access to supporting data
…more transparent & accountable review
…to publish replication studies
Credit where credit is overdue:
“One option would be to provide researchers who release data to public
repositories with a means of accreditation.”
“An ability to search the literature for all online papers that used a particular data
set would enable appropriate attribution for those who share. “
Nature Biotechnology 27, 579 (2009)
GigaSolution: deconstructing the paper
Utilizes big-data infrastructure and expertise from:
Combines and integrates (with DOIs):
Data Publishing Platform
Data Analysis Platform
Open Review Platform
Data Publishing: nothing new…
Data & Metadata Collection/Experiments
+ Area of Interest/Question
Data Publishing: Can be Life or Death
Climate change, global hunger, pollution, cancer,
To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released here into the public domain
under a CC0 license. Until the publication of research papers on the
assembly and whole-genome analysis of this isolate we would ask you to
cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang,
J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J;
Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X;
Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the
Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI
Our first DOI:
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the
Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew
it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements
governing how his team could use data collected on the strain. Luckily, one team had released its data
under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his
colleagues to join the international research effort and publish their work without wasting time on
1. Citations (~300) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons
4. Example for faster & more open science
1.3 The power of intelligently open data
The benefits of intelligently open data were powerfully
illustrated by events following an outbreak of a severe gastro-
intestinal infection in Hamburg in Germany in May 2011. This
spread through several European countries and the US,
affecting about 4000 people and resulting in over 50 deaths. All
tested positive for an unusual and little-known Shiga-toxin–
producing E. coli bacterium. The strain was initially analysed by
scientists at BGI-Shenzhen in China, working together with
those in Hamburg, and three days later a draft genome was
released under an open data licence. This generated interest
from bioinformaticians on four continents. 24 hours after the
release of the genome it had been assembled. Within a week
two dozen reports had been filed on an open-source site
dedicated to the analysis of the strain. These analyses
provided crucial information about the strain’s virulence and
resistance genes – how it spreads and which antibiotics are
effective against it. They produced results in time to help
contain the outbreak. By July 2011, scientists published papers
based on this work. By opening up their early sequencing
results to international collaboration, researchers in Hamburg
produced results that were quickly tested by a wide range of
experts, used to produce new knowledge and ultimately to
control a public health emergency.
1st Nanopore MinION E. Coli
genome released via GigaDB
10th September 2014 (>125GB)
Data Note peer reviewed &
published 20th October1
Immediately used for teaching
materials2 & real-time tools3
Real time sequencing era needs real time publication!
• First nanopore clinical
amplicon sequencing paper (&
data) published March 2015
• Can determine virus/bacteria
strains in hours
• Already in use tackling Ebola
in West Africa
• “Living internet of things”
Rice 3K project: 3,000 rice genomes, 13.4TB public data
Feed The World With (Big) Data
OMERO: providing access
to imaging data
Already used by JCB.
View, filter, measure raw
images with direct links
from journal article.
See all image data, not just
cherry picked examples.
Download and reprocess.
Need for better handling of imaging data
...look but don't touch
Need for better handling of imaging data
Open & able to build upon
Taking citeable snapshots
Reward Sharing of Workflows
& DOIs for workflows
Facilitate reproducibility, reuse & sharing & publish outputs of:
Knitr, Sweave, Jupyter/iPython Notebook, etc.
Reward Open/Dynamic Workbooks
Reviewer (Christophe Pouzat):
“It took me a couple of hours to get the data, the few
custom developed routines, the “vignette” and to
REPRODUCE EXACTLY the analysis presented in the
manuscript. With few more hours, I was able to modify
the authors’ code to change their Fig. 4. In addition to
making the presented research trustworthy, the
reproducible research paradigm definitely makes the
reviewer’s job much more fun!
• Is possible to push button(s) & recreate a result from
• Most published research findings are false. Or at
least have errors
• Reproducibility is COSTLY. How much are you willing
• Much easier to do this before rather than after
The cost of staying with the status quo?
• Ioannidis estimate that 85% of research resources are wasted.
• ~US$28B year unnecessarily spent on preclinical research in US.
• Each retraction estimated to cost $400,000.
Death to the Publication. Long live the Research Object!
Manifesto for a reproducible publisher:
The era of the 1665-style publication is over
Reward replication not advertising
Credit FAIR data, not JIF-bait narrative
Granularity ≠ salami slicing. Ingelfinger is the enemy
We need a recognizable mark/badge/score(s) for replication
Separate category in ORCID for actually usable things
Ruibang Luo (BGI/HKU)
Shaoguang Liang (BGI-SZ)
Tin-Lap Lee (CUHK)
Qiong Luo (HKUST)
Senghong Wang (HKUST)
Yan Zhou (HKUST)
Jesse Si Zhe
Amye Kenall (BMC)
Marco Roos (LUMC)
Mark Thompson (LUMC)
Jun Zhao (Lancaster)
Susanna Sansone (Oxford)
Philippe Rocca-Serra (Oxford)
Alejandra Gonzalez-Beltran (Oxford)
Our collaborators:team: (Case study)