2014 CrossRef Annual Meeting Keynote: Ways and Needs to Promote Rapid Data Sharing

Ways and Needs to Promote
Rapid Data Sharing
Laurie Goodman, PhD
Editor-in-Chief GigaScience
ORCID ID: 0000-0001-9724-5976

Scientific Communication
Via Publication
• Scholarly articles are merely advertisement of scholarship .
The actual scholarly artefacts, i.e. the data and
computational methods, which support the
scholarship, remain largely inaccessible --- Jon B.
Buckheit and David L. Donoho, WaveLab and reproducible
research, 1995
• Core scientific statements or assertions are intertwined and
hidden in the conventional scholarly narratives
• Lack of transparency, lack of credit for anything other than
“regular” dead tree publication

A Tale of Two Bacteria
1. On May 2, 2011 German Doctors Reported the first case of an
E.coli infection, that was accompanied by hemolytic-uremic
syndrome
2. On May 21, 2011 the first death occurred from this bacteria
(denoted E.coli O104:H4)
3. On June 3, 2014, BGI completed a draft sequence of E.coli
O104:H4 from a sample provided by doctors at the University
Medical Centre Hamburg-Eppendorf
4. At this point- the leaders at BGI held a discussion about
whether to release the sequence data immediately: what were
the potential repercussions of doing so
The question arose:
If the data were released now- would it affect
their ability to publish later?

A Tale of Two Bacteria
• In one world- the researchers — who were concerned about their
ability to publish as this is the way to obtain recognition and
obtain grants (which are essential for them to work) — waited.
The first publication appeared on July 29th
• In another world, the researchers — who decided public health
was more important than obtaining a publication — released the
data immediately.
The first publication appeared on July 29th — but was not
from that group who released the data (though information on
that data was included.

Whether the concern about the ability to publish
if data are released early is real or imagined
Researchers act on that concern

These data were put on an FTP
server under a CCO waiver and also
given a DOI to make access
‘permanent’
To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released here into the public domain
under a CC0 license. Until the publication of research papers on the
assembly and whole-genome analysis of this isolate we would ask you to
cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang,
J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J;
Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X;
Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the
Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium
(2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI
Shenzhen. doi:10.5524/100001
http://dx.doi.org/10.5524/100001
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Downstream consequences:
1. Citations (~180) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons
4. Example for faster & more open science
“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli
strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days
for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could
use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that
allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and
publish their work without wasting time on legal wrangling.”

1.3 The power of intelligently open data
The benefits of intelligently open data were powerfully
illustrated by events following an outbreak of a severe gastro-intestinal
infection in Hamburg in Germany in May 2011. This
spread through several European countries and the US,
affecting about 4000 people and resulting in over 50 deaths. All
tested positive for an unusual and little-known Shiga-toxin–
producing E. coli bacterium. The strain was initially analysed by
scientists at BGI-Shenzhen in China, working together with
those in Hamburg, and three days later a draft genome was
released under an open data licence. This generated interest
from bioinformaticians on four continents. 24 hours after the
release of the genome it had been assembled. Within a week
two dozen reports had been filed on an open-source site
dedicated to the analysis of the strain. These analyses
provided crucial information about the strain’s virulence and
resistance genes – how it spreads and which antibiotics are
effective against it. They produced results in time to help
contain the outbreak. By July 2011, scientists published papers
based on this work. By opening up their early sequencing
results to international collaboration, researchers in Hamburg
produced results that were quickly tested by a wide range of
experts, used to produce new knowledge and ultimately to
control a public health emergency.

All that aside
Can we all agree that releasing the E.coli data
ahead of publication was ‘good’
At least from a public health perspective
Here are the numbers for the E.coli 2011 Outbreak
In total, ~4000 people were infected and 53 died

From a Public Health perspective…Deaths
Worldwide*
Infectious Disease
Measles: 122,000 per year
Hepatitis C-related liver disease: 350,000-500,000 per year
Malaria: 627,000 per year
HIV/AIDS: 1.4-1.7 million per year
Non-communicable, with genetic predisposition
Prostate cancer: 307,000 per year
Breast cancer: 522,000 per year
Suicide: 800,000 per year
Diabetes: 1.5 million per year
Cancer: 8.2 million per year
Cardiovascular Disease: 17.5 million per year
Non-genetic/Non-infectious
Pesticide Poisoning: 250,000 per year
Malnutrition: 2.8 million children (under 5) per year
*World Health Organization Fact Sheets http://www.who.int/en/

Sharing Data is Essential for Many
Reasons

Sharing aids fields…
Rice v Wheat: consequences of publically available genome data
700
600
500
400
300
200
100
0
rice wheat
Every 10 datasets collected contributes to at least 4 papers in the
following 3-years.
Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473
(7347), 285-285 DOI: 10.1038/473285a

Sharing aids authors…
Sharing Detailed Research
Data Is Associated with
Increased Citation Rate.
Piwowar HA, Day RS, Fridsma DB (2007)
PLoS ONE 2(3): e308.
doi:10.1371/journal.pone.0000308

Lack of Sharing Impacts Reproducibility
Out of 18 microarray papers, results
from 10 could not be reproduced
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)

Sharing can reduce retractions
>15X increase in last decade
Strong correlation of “retraction index” with
higher impact factor
At current % increase by 2045 as
many papers published as
retracted!
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

Data Sharing Hurdles
?
If only it were easy…
There are numerous reasons why researchers
do not share data:
The majority of which are good reasons

Wiley Researcher Data Insights Survey
Our objective was to establish a baseline view of data sharing
practices, attitudes, and motivations globally, with participation
from researchers in every scholarly field.
In March 2014, more than 90,000 researchers around the world
were invited to participate in Wiley’s Researcher Data Insights
Survey. Participants were researchers who had published at least
one journal article in the past year with any publisher.
We received an overwhelming 2,886 responses from around the
world.
Slide from Catherine Giffi, Director, Strategic Market Analysis, Global Research, Wiley

Key Findings
• Most researchers are sharing their data.
• Those not sharing have a variety of reasons.
• Data that’s being shared typically is <10 GB.
• The most common type of data that is being
shared is flat, tabular data (.csv, .txt, .xl)
• Data is usually saved on hard drives.

Why Researchers Do Not Share
• Intellectual property or confidentiality issues (59%)
• Concerned research might be “scooped” (39%)
• Concerns about misinterpretation or misuse (32%)
• Concerns about attribution/citation credit (31%)
• Ethical concerns (24%)
• Insufficient time/resources (19%)
• Funder/institution does not require sharing (13%)
• Lack of funding (13%)
• Not sure where to share (5%)
• Not sure how to share (3%)
See also:
http://exchanges.wiley.com/blog/2014/11/03/how-and-why-researchers-share-data-and-why-they-dont/
http://scholarlykitchen.sspnet.org/2014/11/11/to-share-or-not-to-share-that-is-the-research-data-question/

How Can Publishers Promote Data Sharing
Researchers are never so captive as when they publishing
But we need to help — not just harass.
Carrots and Sticks
And- why us?
– Create Journal Data Release Policies
– Check Data Release Policy is followed
– Find Ways to Aid Researchers in Releasing Data
– Consider ways to support/protect researchers
who do share ahead of publications
– Promote Data Citation

Incentives/credit
Credit where credit is overdue:
“One option would be to provide researchers who release data to
public repositories with a means of accreditation.”
“An ability to search the literature for all online papers that used a
particular data set would enable appropriate attribution for those
who share. “
Nature Biotechnology 27, 579 (2009)
Prepublication data sharing
(Toronto International Data Release Workshop)
“Data producers benefit from creating ?
a citable reference, as it can
later be used to reflect impact of the data sets.”
Nature 461, 168-170 (2009)

Genomics Data Sharing Policies…
Bermuda Accords 1996/1997/1998:
1. Automatic release of sequence assemblies within 24 hours.
2. Immediate publication of finished annotated sequences.
3. Aim to make the entire sequence freely available in the public domain for
both research and development in order to maximise benefits to society.
Fort Lauderdale Agreement, 2003:
1. Sequence traces from whole genome shotgun projects are to be
deposited in a trace archive within one week of production.
2. Whole genome assemblies are to be deposited in a public nucleotide
sequence database as soon as possible after the assembled sequence
has met a set of quality evaluation criteria.
Toronto International data release workshop, 2009:
The goal was to reaffirm and refine, where needed, the policies related to
the early release of genomic data, and to extend, if possible, similar data
release policies to other types of large biological datasets – whether from
proteomics, biobanking or metabolite research.

Sharing Data from Large-scale Biological Research Projects: A System of
Tripartite Responsibility (From the Fort Lauderdale Meeting 2003)
http://www.genome.gov/pages/research/wellcomereport0303.pdf

Citing Data Isn’t New
The Physical Sciences have been doing this for a while
DataCite and DOIs
“increase acceptance of research data as
legitimate, citable contributions to the
scholarly record”.
Aims to:
“data generated in the course of research
are just as valuable to the ongoing
academic discourse as papers and
monographs”.

How We Envision Research Publication
(Communicating Science)
Open-access journal Data Publishing Platform
Data Sets in
GigaDB
Analyses in
GigaGalaxy
Paper in
GigaScience
Data Analysis Platform

Other Journals are now doing similar
This is most commonly done in the form of a Data Paper
rather than a release of data that is citable in itself.
• A Data Paper is affectively a Description of the Data
• Other journals that do Data Publishing as a formal
paper type
• F1000 Research (launched in 2012)
• Has Data papers as one of several types of papers
• Scientific Data (launched in 2014)
• Solely publishes Data Descriptors
• There are more…

Making the Data Itself Citable
We provide a linked database
The data are then directly linked to the paper- but can also be cited
separately through a Data DOI
We can do this because we have a collaboration between BMC
(who handles the standard paper publication) and BGI (which has
enormous data storage capacity.)
However: There are many community available databases- so in
principle- any journal can do this by taking advantage of such
available resources.
These include the usual suspects: EBI, NCBI, DDBJ etc.
Databases that take all data types and provide Data DOIs: Dryad,
FigShare, etc.
There are also numerous smaller community databases specific to
different fields or data types.

For data citation to work, needs:
• Acceptance by journals.
• Data+Citation: inclusion in the references.
• Tracking by citation indexes.
• Usage of the metrics by the community…

Back to E.coli O104:H4
• As noted: articles on these early released and
citable data were published
• Also- the early releasers were not the first to
publish
• Nor was the data cited

This open-source
analysis work
was published on
August 25th

The journal did
not approve of
inclusion of the
data citation.
Nor was any
indication of
where the
genome
information
could be found

This report was the first to
be publisher- and it
included and used
information from the
crowd-source release as
well as the other early
release.
No where in the paper is
there any indication of
where to obtain this data
Nor is there an indication
of where to obtain the
sequence data they
generated

This group made
their 0104:H4
sequence available
at the time of
completion- prior
to publication in
the NCBI database.
Though no link to
the Accession
Number is easily
found in the paper.

This report DID include a reference for the data
(even though they did not use it in their analysis)

• Data submitted to NCBI databases:
- Raw data SRA:SRA046843
- Assemblies of 3 strains Genbank:AHAO00000000-AHAQ00000000
- SNPs dbSNP:1056306
- CNVs
- InDels }
dbVAR:nstd63
- SV
• Submission to public databases complemented by
its citable form in GigaDB (doi:10.5524/100012).

In Practice…
http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-data-the-polar-bear/

The polar bear DATA was released –prepublication- in 2011
They were used and cited in the following studies- before the main paper on
the sequencing was published
Hailer, F et al., Nuclear genomic sequences reveal that polar bears are an old
and distinct bear lineage. Science. 2012 Apr 20;336(6079):344-7.
doi:10.1126/science.1216424.
Cahill, JA et al., Genomic evidence for island population conversion resolves
conflicting theories of polar bear evolution. PLoS Genet. 2013;9(3):e1003345.
doi:10.1371/journal.pgen.1003345.
Morgan, CC et al., Heterogeneous models place the root of the placental
mammal phylogeny. Mol Biol Evol. 2013 Sep;30(9):2145-56.
doi:10.1093/molbev/mst117.
Cronin, MA et al., Molecular Phylogeny and SNP Variation of Polar Bears
(Ursus maritimus), Brown Bears (U. arctos), and Black Bears (U. americanus)
Derived from Genome Sequences. J Hered. 2014; 105(3):312-23.
doi:10.1093/jhered/est133.
Bidon, T et al., Brown and Polar Bear Y Chromosomes Reveal Extensive Male-
Biased Gene Flow within Brother Lineages. Mol Biol Evol. 2014 Apr 4.
doi:10.1093/molbev/msu109

However, this didn’t include the citation…

One step forward — two steps back

Removing data citations from the
references
One journal informed the authors that non-reviewed material could
not be cited in the references of the paper
Another journal stripped the data citation from the references- and
went an extra step and changed the citation in the Data Availability
section to the URL where the DOI directed it to at that time
We happened to know about this one- and were able to create a forward to the
DOI’d page when the URL broke after we moved our database platform
Note: Much of this was due to a standard operating procedure in the
production department
Lesson: If you decide to include Data Citations- tell your entire team

For data citation to work, needs:
• Acceptance by journals.
• Data+Citation: inclusion in the references.
• Tracking by citation indexes.
• Usage of the metrics by the community…
This is a work in progress…

Data Citation Really is a Major Incentive
On Weds this week- we released the genome sequence
from 3000 Rice strains (13.4 TB of data)
• These data were also deposited in NIH SRA repository
• So why did we do it too?
1. It is linked directly to the Data Paper that provides
details of data production, quality, and basic analysis
2. Authors were hesitant to release these data (a HUGE
community resource) prior to the analysis paper
publication (which, for 3000 strains… would take
years…). The opportunity to have these data citable
(and trackable) encouraged the authors and led to
their releasing these data and doing so in
collaboration with GigaScience’s Biocurator
The 3,000 Rice Genomes Project. (2014) GigaScience 3:7 http://dx.doi.org/10.1186/2047-217X-3-7;
The 3000 Rice Genomes Project (2014) GigaScience Database. http://dx.doi.org/10.5524/200001

No: your data is not too large to share
Rice 3K project: 3,000 rice genomes, 13.4TB public data
IRRI GALAXY

Beyond Data Citation
Reviewing Data
Data Release policies include the need to
help authors
Data availability without metadata is
practically useless

Reviewing Data
It’s too hard- we can’t ask our reviewers
to do that!
Use Data Reviewers

Example in Neuroscience
1. Neuroscience Data
are not typically
shared
2. For most papers: Data
AND Tools are not
typically made
available to the
reviewers
3. Journal Editors think
Reviewers will not
want to review data
GigaScience 2014, 3:3 doi:10.1186/2047-217X-3-3

Example in Neuroscience
• Neuroscience Data are not typically shared
• Author Dr. Stephen Eglen said: “One way of encouraging neuroscientists to
share their data is to provide some form of academic credit.”
• We hosted with a DOI: 366 recordings from 12 electrophysiology datasets
• GigaDB is included in Thompson Reuters Data Citation Index
• Data AND Tools are not typically made available to the reviewers
• We made manuscript, data and tools all available to the reviewers.
• We make sure to include reviewers who are able to properly assess the data
itself and rerun the tools
• To reduce burdens- we sometimes select a reviewer who ONLY looks at the
data.
• Journal Editors think Reviewers will not want to review data
• What Reviewer Dr. Thomas Wachtler said: “The paper by Eglen and
colleagues is a shining example of openness in that it enables replicating the
results almost as easily as by pressing a button.”
• What Reviewer Dr. Christophe Pouzat said: “In addition to making the
presented research trustworthy, the reproducible research paradigm
definitely makes the reviewers job more fun!”

Data Release policies include the need to
help authors
Collaborations
With data repositories
With other journals

Consider Cross Journal Support
Competition is good…
….but sometimes we should collaborate
for the community good
• PLoS recent data deposition policies have led to
community concerns about feasibility.
• We support (and applaud) this …we have an even stricter
data deposition policy
• But- PLoS ONE received a submission that was a
comparative study of earthworm morphology and
anatomy using a 3D non-invasive imaging technique
called micro-computed tomography (or microCT) …And
there is no good place to put this
• These data are extremely complex, videos, multiple files-with
several folders of ~10 GB

Consider Cross Journal Support
• GigaScience and PLOS ONE collaborated. They published
the main article; we published a Data Note describing the
data itself and hosted all the data on GigaDB under
separate citation.
• With our Aspera Connection- reviewers could download
even the 10 TB folders in ~1/2 hour
• Reviewer Dr. Sarah Faulwetter noted the usefulness of
having these data available, saying: Instead of having to
go through the lengthy process of obtaining the physical
specimen from a museum, I can now download a fairly
accurate representation from the web.
Lenihan et al (2014). GigaScience, 3:6 http://dx.doi.org/10.1186/2047-217X-3-6; Lenihan, et al (2014): GigaScience Database.
http://dx.doi.org/10.5524/100092; Fernández et al (2014) PLOS ONE 9 (5) e96617 http://dx.doi.org/10.1371/journal.pone.0096617

Data availability without metadata is
practically useless
Engage/Employ/Interact with Curators

Challenges for the future…
1. Lack of interoperability/sufficient metadata
2. Long tail of curation (“Democratization” of “big-data”)
?

Think about what you do… and what you can do…
• Promote- rather than inhibit- prepublication data sharing
• Promote Data Citation in the reference section
– incentivizes data release
– Makes it easier for readers to find
• Promote Data Sharing upon publication
– Consider your data release policies
• Form collaborations with repositories to aid authors in depositing
their work
– Identify community organizations with metadata standards
• Make data available for reviewers (author website, community
repositories, dryad and similar (your publisher?)
– at least do a sanity check
– Use “data reviewers”
No- this isn’t easy, but do what you can now
And work toward the rest
Evolve

It’s Time to Move Beyond
Dead Trees
1665 1812 1869

Thanks to:
Scott Edmunds, Executive Editor
Nicole Nogoy, Commissioning Editor
Peter Li, Lead Data Manager
Chris Hunter, Lead BioCurator
Rob Davidson, Data Scientist
Xiao (Jesse) Si Zhe, Database Developer
Amye Kenall, Journal Development Manager
Contact us:
editorial@gigasciencejournal.com
database@gigasciencejournal.com
Follow us:
@GigaScience
facebook.com/GigaScience
blogs.openaccesscentral.com/blogs/gigablog
www.gigasciencejournal.com
www.gigadb.org

2014 CrossRef Annual Meeting Keynote: Ways and Needs to Promote Rapid Data Sharing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to 2014 CrossRef Annual Meeting Keynote: Ways and Needs to Promote Rapid Data Sharing

Similar to 2014 CrossRef Annual Meeting Keynote: Ways and Needs to Promote Rapid Data Sharing (20)

More from Crossref

More from Crossref (20)

Recently uploaded

Recently uploaded (20)

2014 CrossRef Annual Meeting Keynote: Ways and Needs to Promote Rapid Data Sharing

Editor's Notes