Ways and Needs to Promote 
Rapid Data Sharing 
Laurie Goodman, PhD 
Editor-in-Chief GigaScience 
ORCID ID: 0000-0001-9724-5976
Scientific Communication 
Via Publication 
• Scholarly articles are merely advertisement of scholarship . 
The actual scholarly artefacts, i.e. the data and 
computational methods, which support the 
scholarship, remain largely inaccessible --- Jon B. 
Buckheit and David L. Donoho, WaveLab and reproducible 
research, 1995 
• Core scientific statements or assertions are intertwined and 
hidden in the conventional scholarly narratives 
• Lack of transparency, lack of credit for anything other than 
“regular” dead tree publication
A Tale of Two Bacteria 
1. On May 2, 2011 German Doctors Reported the first case of an 
E.coli infection, that was accompanied by hemolytic-uremic 
syndrome 
2. On May 21, 2011 the first death occurred from this bacteria 
(denoted E.coli O104:H4) 
3. On June 3, 2014, BGI completed a draft sequence of E.coli 
O104:H4 from a sample provided by doctors at the University 
Medical Centre Hamburg-Eppendorf 
4. At this point- the leaders at BGI held a discussion about 
whether to release the sequence data immediately: what were 
the potential repercussions of doing so 
The question arose: 
If the data were released now- would it affect 
their ability to publish later?
A Tale of Two Bacteria 
• In one world- the researchers — who were concerned about their 
ability to publish as this is the way to obtain recognition and 
obtain grants (which are essential for them to work) — waited. 
The first publication appeared on July 29th 
• In another world, the researchers — who decided public health 
was more important than obtaining a publication — released the 
data immediately. 
The first publication appeared on July 29th — but was not 
from that group who released the data (though information on 
that data was included. 
We live in World 2 and what followed was exciting 
and had broad repercussions.
Whether the concern about the ability to publish 
if data are released early is real or imagined 
Researchers act on that concern
Whether the concern about the ability to publish 
if data are released early is real or imagined 
Researchers act on that concern 
Note: Harmsen’s group DID share- immediately upon sequencing— the O104:H4 
outbreak strain data. The data referred to here was the 2001 strain that was believed 
to be the strain involved in a 2001 outbreak of similar type. 
This slide is meant to only highlight that concerns about being scooped drive early 
sharing decisions. 
Given that the first paper published did use the early available O104:H4 data, it would 
be expected that these data, had they been shared, would have been used in that 
paper as well.
These data were put on an FTP 
server under a CCO waiver and also 
given a DOI to make access 
‘permanent’ 
To maximize its utility to the research community and aid those fighting 
the current epidemic, genomic data is released here into the public domain 
under a CC0 license. Until the publication of research papers on the 
assembly and whole-genome analysis of this isolate we would ask you to 
cite this dataset as: 
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, 
J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; 
Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; 
Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the 
Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium 
(2011) 
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI 
Shenzhen. doi:10.5524/100001 
http://dx.doi.org/10.5524/100001 
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to 
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
Downstream consequences: 
1. Citations (~180) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons 
4. Example for faster & more open science 
“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli 
strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days 
for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could 
use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that 
allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and 
publish their work without wasting time on legal wrangling.”
1.3 The power of intelligently open data 
The benefits of intelligently open data were powerfully 
illustrated by events following an outbreak of a severe gastro-intestinal 
infection in Hamburg in Germany in May 2011. This 
spread through several European countries and the US, 
affecting about 4000 people and resulting in over 50 deaths. All 
tested positive for an unusual and little-known Shiga-toxin– 
producing E. coli bacterium. The strain was initially analysed by 
scientists at BGI-Shenzhen in China, working together with 
those in Hamburg, and three days later a draft genome was 
released under an open data licence. This generated interest 
from bioinformaticians on four continents. 24 hours after the 
release of the genome it had been assembled. Within a week 
two dozen reports had been filed on an open-source site 
dedicated to the analysis of the strain. These analyses 
provided crucial information about the strain’s virulence and 
resistance genes – how it spreads and which antibiotics are 
effective against it. They produced results in time to help 
contain the outbreak. By July 2011, scientists published papers 
based on this work. By opening up their early sequencing 
results to international collaboration, researchers in Hamburg 
produced results that were quickly tested by a wide range of 
experts, used to produce new knowledge and ultimately to 
control a public health emergency.
All that aside 
Can we all agree that releasing the E.coli data 
ahead of publication was ‘good’ 
At least from a public health perspective 
Here are the numbers for the E.coli 2011 Outbreak 
In total, ~4000 people were infected and 53 died
If so— then from a Public Health 
perspective…Consider Deaths Worldwide* 
Infectious Disease 
Measles: 122,000 per year 
Hepatitis C-related liver disease: 350,000-500,000 per year 
Malaria: 627,000 per year 
HIV/AIDS: 1.4-1.7 million per year 
Non-communicable, with genetic predisposition 
Prostate cancer: 307,000 per year 
Breast cancer: 522,000 per year 
Suicide: 800,000 per year 
Diabetes: 1.5 million per year 
Cancer: 8.2 million per year 
Cardiovascular Disease: 17.5 million per year 
Non-genetic/Non-infectious 
Pesticide Poisoning: 250,000 per year 
Malnutrition: 2.8 million children (under 5) per year 
*World Health Organization Fact Sheets http://www.who.int/en/ 
Clearly —from a public 
health perspective— early 
data sharing in every area, 
especially prior to 
publication — is essential.
Sharing Data is Essential for Many 
Reasons
Sharing aids fields… 
Rice v Wheat: consequences of publically available genome data 
700 
600 
500 
400 
300 
200 
100 
0 
rice wheat 
Every 10 datasets collected contributes to at least 4 papers in the 
following 3-years. 
Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 
(7347), 285-285 DOI: 10.1038/473285a
Sharing aids authors… 
Sharing Detailed Research 
Data Is Associated with 
Increased Citation Rate. 
Piwowar HA, Day RS, Fridsma DB (2007) 
PLoS ONE 2(3): e308. 
doi:10.1371/journal.pone.0000308
Lack of Sharing Impacts Reproducibility 
Out of 18 microarray papers, results 
from 10 could not be reproduced 
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14 
2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Sharing can reduce retractions 
>15X increase in last decade 
Strong correlation of “retraction index” with 
higher impact factor 
At current % increase by 2045 as 
many papers published as 
retracted! 
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 
2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
Data Sharing Hurdles 
? 
If only it were easy… 
There are numerous reasons why researchers 
do not share data: 
The majority of which are good reasons
Wiley Researcher Data Insights Survey 
Our objective was to establish a baseline view of data sharing 
practices, attitudes, and motivations globally, with participation 
from researchers in every scholarly field. 
In March 2014, more than 90,000 researchers around the world 
were invited to participate in Wiley’s Researcher Data Insights 
Survey. Participants were researchers who had published at least 
one journal article in the past year with any publisher. 
We received an overwhelming 2,886 responses from around the 
world. 
Slide from Catherine Giffi, Director, Strategic Market Analysis, Global Research, Wiley
Wiley Researcher Data Insights Survey 
Key Findings 
• Most researchers are sharing their data. 
• Those not sharing have a variety of reasons. 
• Data that’s being shared typically is <10 GB. 
• The most common type of data that is being 
shared is flat, tabular data (.csv, .txt, .xl) 
• Data is usually saved on hard drives. 
Slide from Catherine Giffi, Director, Strategic Market Analysis, Global Research, Wiley
Wiley Researcher Data Insights Survey 
Why Researchers Do Not Share 
• Intellectual property or confidentiality issues (59%) 
• Concerned research might be “scooped” (39%) 
• Concerns about misinterpretation or misuse (32%) 
• Concerns about attribution/citation credit (31%) 
• Ethical concerns (24%) 
• Insufficient time/resources (19%) 
• Funder/institution does not require sharing (13%) 
• Lack of funding (13%) 
• Not sure where to share (5%) 
• Not sure how to share (3%) 
Slide from Catherine Giffi, Director, Strategic Market Analysis, Global Research, Wiley 
Report is underway: but See: 
http://exchanges.wiley.com/blog/2014/11/03/how-and-why-researchers-share-data-and-why-they-dont/ 
http://scholarlykitchen.sspnet.org/2014/11/11/to-share-or-not-to-share-that-is-the-research-data-question/
How Can Publishers Promote Data Sharing 
Researchers are never so captive as when they publishing 
But we need to help — not just harass. 
Carrots and Sticks 
And- why us? 
– Create Journal Data Release Policies 
– Check Data Release Policy is followed 
– Find Ways to Aid Researchers in Releasing Data 
– Consider ways to support/protect researchers 
who do share ahead of publications 
– Promote Data Citation
How Can Publishers Promote Data Sharing 
Researchers are never so captive as when they publishing 
But we need to help — not just harass. 
Carrots and Sticks 
And- why us? 
– Create Journal Data Release Policies 
– Check Data Release Policy is followed 
– Find Ways to Aid Researchers in Releasing Data 
– Consider ways to support/protect researchers 
who do share ahead of publications 
– Promote Data Citation
Incentives/credit 
Credit where credit is overdue: 
“One option would be to provide researchers who release data to 
public repositories with a means of accreditation.” 
“An ability to search the literature for all online papers that used a 
particular data set would enable appropriate attribution for those 
who share. “ 
Nature Biotechnology 27, 579 (2009) 
Prepublication data sharing 
(Toronto International Data Release Workshop) 
“Data producers benefit from creating ? 
a citable reference, as it can 
later be used to reflect impact of the data sets.” 
Nature 461, 168-170 (2009)
Genomics Data Sharing Policies… 
Bermuda Accords 1996/1997/1998: 
1. Automatic release of sequence assemblies within 24 hours. 
2. Immediate publication of finished annotated sequences. 
3. Aim to make the entire sequence freely available in the public domain for 
both research and development in order to maximise benefits to society. 
Fort Lauderdale Agreement, 2003: 
1. Sequence traces from whole genome shotgun projects are to be 
deposited in a trace archive within one week of production. 
2. Whole genome assemblies are to be deposited in a public nucleotide 
sequence database as soon as possible after the assembled sequence 
has met a set of quality evaluation criteria. 
Toronto International data release workshop, 2009: 
The goal was to reaffirm and refine, where needed, the policies related to 
the early release of genomic data, and to extend, if possible, similar data 
release policies to other types of large biological datasets – whether from 
proteomics, biobanking or metabolite research.
Sharing Data from Large-scale Biological Research Projects: A System of 
Tripartite Responsibility (From the Fort Lauderdale Meeting 2003) 
http://www.genome.gov/pages/research/wellcomereport0303.pdf
Citing Data Isn’t New 
The Physical Sciences have been doing this for a while 
DataCite and DOIs 
“increase acceptance of research data as 
legitimate, citable contributions to the 
scholarly record”. 
Aims to: 
“data generated in the course of research 
are just as valuable to the ongoing 
academic discourse as papers and 
monographs”.
How We Envision Research Publication 
(Communicating Science) 
Open-access journal Data Publishing Platform 
Data Sets in 
GigaDB 
Analyses in 
GigaGalaxy 
Paper in 
GigaScience 
Data Analysis Platform
Other Journals are now doing similar 
This is most commonly done in the form of a Data Paper 
rather than a release of data that is citable in itself. 
• A Data Paper is affectively a Description of the Data 
• Other journals that do Data Publishing as a formal 
paper type 
• F1000 Research (launched in 2012) 
• Has Data papers as one of several types of papers 
• Scientific Data (launched in 2014) 
• Solely publishes Data Descriptors 
• There are more…
Making the Data Itself Citable 
We provide a linked database 
The data are then directly linked to the paper- but can also be cited 
separately through a Data DOI 
We can do this because we have a collaboration between BMC 
(who handles the standard paper publication) and BGI (which has 
enormous data storage capacity.) 
However: There are many community available databases- so in 
principle- any journal can do this by taking advantage of such 
available resources. 
These include the usual suspects: EBI, NCBI, DDBJ etc. 
Databases that take all data types and provide Data DOIs: Dryad, 
FigShare, etc. 
There are also numerous smaller community databases specific to 
different fields or data types.
For data citation to work, needs: 
• Acceptance by journals. 
• Data+Citation: inclusion in the references. 
• Tracking by citation indexes. 
• Usage of the metrics by the community…
For data citation to work, needs: 
• Acceptance by journals. 
• Data+Citation: inclusion in the references. 
• Tracking by citation indexes. 
• Usage of the metrics by the community…
In Principle…
Back to E.coli O104:H4 
• As noted: articles on these early released and 
citable data were published 
• Also- the early releasers were not the first to 
publish 
• Nor was the data cited
This open-source 
analysis work 
was published on 
August 25th
The journal did 
not approve of 
inclusion of the 
data citation. 
Nor was any 
indication of 
where the 
genome 
information 
could be found
This report was the first to 
be publisher- and it 
included and used 
information from the 
crowd-source release as 
well as the other early 
release. 
No where in the paper is 
there any indication of 
where to obtain this data 
Nor is there an indication 
of where to obtain the 
sequence data they 
generated
This group made 
their 0104:H4 
sequence available 
at the time of 
completion- prior 
to publication in 
the NCBI database. 
Though no link to 
the Accession 
Number is easily 
found in the paper.
This report DID include a reference for the data 
(even though they did not use it in their analysis) 
This link… leads to an empty site
Had they used 
the DOI, the 
data, though 
they had 
migrated to a 
different 
database, 
would have 
been found
For data citation to work, needs: 
• Acceptance by journals. 
• Data+Citation: inclusion in the references. 
• Tracking by citation indexes. 
• Usage of the metrics by the community…
In Practice…
• Data submitted to NCBI databases: 
- Raw data SRA:SRA046843 
- Assemblies of 3 strains Genbank:AHAO00000000-AHAQ00000000 
- SNPs dbSNP:1056306 
- CNVs 
- InDels } 
dbVAR:nstd63 
- SV 
• Submission to public databases complemented by 
its citable form in GigaDB (doi:10.5524/100012).
In the references…
Is the DOI…
In Practice…
In Practice… 
http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-data-the-polar-bear/
The polar bear DATA was released –prepublication- in 2011 
They were used and cited in the following studies- before the main paper on the 
sequencing was published 
Hailer, F et al., Nuclear genomic sequences reveal that polar bears are an old and distinct 
bear lineage. Science. 2012 Apr 20;336(6079):344-7. doi:10.1126/science.1216424. 
Cahill, JA et al., Genomic evidence for island population conversion resolves conflicting 
theories of polar bear evolution. PLoS Genet. 2013;9(3):e1003345. 
doi:10.1371/journal.pgen.1003345. 
Morgan, CC et al., Heterogeneous models place the root of the placental mammal 
phylogeny. Mol Biol Evol. 2013 Sep;30(9):2145-56. doi:10.1093/molbev/mst117. 
Cronin, MA et al., Molecular Phylogeny and SNP Variation of Polar Bears (Ursus 
maritimus), Brown Bears (U. arctos), and Black Bears (U. americanus) Derived from 
Genome Sequences. J Hered. 2014; 105(3):312-23. doi:10.1093/jhered/est133. 
Bidon, T et al., Brown and Polar Bear Y Chromosomes Reveal Extensive Male-Biased Gene 
Flow within Brother Lineages. Mol Biol Evol. 2014 Apr 4. doi:10.1093/molbev/msu109 
http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-data-the-polar-bear/
Cell Press Journals had indicated 
publishing a dataset prior to publication 
could be considered as prior publication
Even though the data had 
been released 2 years earlier 
and cited in other papers- the 
main analysis paper was 
published in Cell 
However, this didn’t include the data citation…
One step forward — two steps back
Removing data citations from the 
references 
One journal informed the authors that non-reviewed material could 
not be cited in the references of the paper 
Another journal stripped the data citation from the references- and 
went an extra step and changed the citation in the Data Availability 
section to the URL where the DOI directed it to at that time 
We happened to know about this one- and were able to create a forward to the 
DOI’d page when the URL broke after we moved our database platform 
Note: Much of this was due to a standard operating procedure in the 
production department 
Lesson: If you decide to include Data Citations- tell your entire team
For data citation to work, needs: 
• Acceptance by journals. 
• Data+Citation: inclusion in the references. 
• Tracking by citation indexes. 
• Usage of the metrics by the community…
Data publication 
in databases is 
now being 
tracked by this 
and other 
tracking 
resources
For data citation to work, needs: 
• Acceptance by journals. 
• Data+Citation: inclusion in the references. 
• Tracking by citation indexes. 
• Usage of the metrics by the community… 
This is a work in progress… as data 
citation in the life sciences is still a 
new endeavor
Data Citation Really is a Major Incentive 
This year, we released the genome sequences from 3000 
Rice strains (13.4 TB of data) 
• These data were also deposited in NIH SRA repository 
• So why did we do it too? 
1. It is linked directly to the Data Paper that provides 
details of data production, quality, and basic analysis 
2. Authors were hesitant to release these data (a HUGE 
community resource) prior to the analysis paper 
publication (which, for 3000 strains… could possibly 
take years…). The opportunity to have these data 
citable (and trackable) encouraged the authors and 
led to their releasing these data and doing so in 
collaboration with GigaScience’s Biocurator 
The 3,000 Rice Genomes Project. (2014) GigaScience 3:7 http://dx.doi.org/10.1186/2047-217X-3-7; 
The 3000 Rice Genomes Project (2014) GigaScience Database. http://dx.doi.org/10.5524/200001
No: your data is not too large to share 
Rice 3K project: 3,000 rice genomes, 13.4TB public data 
IRRI GALAXY
Beyond Data Citation 
Reviewing Data 
Data Release policies include the need to 
help authors 
Data availability without metadata is 
practically useless
Beyond Data Citation 
Reviewing Data 
It’s too hard- we can’t ask our reviewers 
to do that! 
Use Data Reviewers
Example in Neuroscience 
1. Neuroscience Data 
are not typically 
shared 
2. For most papers: Data 
AND Tools are not 
typically made 
available to the 
reviewers 
3. Journal Editors think 
Reviewers will not 
want to review data 
GigaScience 2014, 3:3 doi:10.1186/2047-217X-3-3
Example in Neuroscience 
• Neuroscience Data are not typically shared 
• Author Dr. Stephen Eglen said: “One way of encouraging neuroscientists to 
share their data is to provide some form of academic credit.” 
• We hosted with a DOI: 366 recordings from 12 electrophysiology datasets 
• GigaDB is included in Thompson Reuters Data Citation Index 
• Data AND Tools are not typically made available to the reviewers 
• We made manuscript, data and tools all available to the reviewers. 
• We make sure to include reviewers who are able to properly assess the data 
itself and rerun the tools 
• To reduce burdens- we sometimes select a reviewer who ONLY looks at the 
data. 
• Journal Editors think Reviewers will not want to review data 
• What Reviewer Dr. Thomas Wachtler said: “The paper by Eglen and 
colleagues is a shining example of openness in that it enables replicating the 
results almost as easily as by pressing a button.” 
• What Reviewer Dr. Christophe Pouzat said: “In addition to making the 
presented research trustworthy, the reproducible research paradigm 
definitely makes the reviewers job more fun!”
Beyond Data Citation 
Data Release policies include the need to 
help authors 
Collaborations 
With data repositories 
With other journals
Consider Cross Journal Support 
Competition is good… 
….but sometimes we should collaborate 
for the community good 
• PLoS recent data deposition policies have led to 
community concerns about feasibility. 
• We support (and applaud) this …we have an even stricter 
data deposition policy 
• But- PLoS ONE received a submission that was a 
comparative study of earthworm morphology and 
anatomy using a 3D non-invasive imaging technique 
called micro-computed tomography (or microCT) …And 
there is no good place to put this 
• These data are extremely complex, videos, multiple files-with 
several folders of ~10 GB
Consider Cross Journal Support 
• GigaScience and PLOS ONE collaborated. They published 
the main article; we published a Data Note describing the 
data itself and hosted all the data on GigaDB under 
separate citation. 
• With our Aspera Connection- reviewers could download 
even the 10 TB folders in ~1/2 hour 
• Reviewer Dr. Sarah Faulwetter noted the usefulness of 
having these data available, saying: Instead of having to 
go through the lengthy process of obtaining the physical 
specimen from a museum, I can now download a fairly 
accurate representation from the web. 
Lenihan et al (2014). GigaScience, 3:6 http://dx.doi.org/10.1186/2047-217X-3-6; Lenihan, et al (2014): GigaScience Database. 
http://dx.doi.org/10.5524/100092; Fernández et al (2014) PLOS ONE 9 (5) e96617 http://dx.doi.org/10.1371/journal.pone.0096617
Beyond Data Citation 
Data availability without metadata is 
practically useless 
Engage/Employ/Interact with Curators
Challenges for the future… 
1. Lack of interoperability/sufficient metadata 
2. Long tail of curation (“Democratization” of “big-data”) 
?
Think about what you do… and what you can do… 
• Promote- rather than inhibit- prepublication data sharing 
• Promote Data Citation in the reference section 
– incentivizes data release 
– Makes it easier for readers to find 
• Promote Data Sharing upon publication 
– Consider your data release policies 
• Form collaborations with repositories to aid authors in depositing 
their work 
– Identify community organizations with metadata standards 
• Make data available for reviewers (author website, community 
repositories, dryad and similar (your publisher?) 
– at least do a sanity check 
– Use “data reviewers” 
No- this isn’t easy, but do what you can now 
And work toward the rest 
Evolve
It’s Time to Move Beyond 
Dead Trees 
1665 1812 1869
Thanks to: 
Scott Edmunds, Executive Editor 
Nicole Nogoy, Commissioning Editor 
Peter Li, Lead Data Manager 
Chris Hunter, Lead BioCurator 
Rob Davidson, Data Scientist 
Xiao (Jesse) Si Zhe, Database Developer 
Amye Kenall, Journal Development Manager 
Contact us: 
editorial@gigasciencejournal.com 
database@gigasciencejournal.com 
Follow us: 
@GigaScience 
facebook.com/GigaScience 
blogs.openaccesscentral.com/blogs/gigablog 
www.gigasciencejournal.com 
www.gigadb.org

Laurie Goodman at #crossref14: Ways and Needs to Promote Rapid Data Sharing

  • 1.
    Ways and Needsto Promote Rapid Data Sharing Laurie Goodman, PhD Editor-in-Chief GigaScience ORCID ID: 0000-0001-9724-5976
  • 2.
    Scientific Communication ViaPublication • Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995 • Core scientific statements or assertions are intertwined and hidden in the conventional scholarly narratives • Lack of transparency, lack of credit for anything other than “regular” dead tree publication
  • 3.
    A Tale ofTwo Bacteria 1. On May 2, 2011 German Doctors Reported the first case of an E.coli infection, that was accompanied by hemolytic-uremic syndrome 2. On May 21, 2011 the first death occurred from this bacteria (denoted E.coli O104:H4) 3. On June 3, 2014, BGI completed a draft sequence of E.coli O104:H4 from a sample provided by doctors at the University Medical Centre Hamburg-Eppendorf 4. At this point- the leaders at BGI held a discussion about whether to release the sequence data immediately: what were the potential repercussions of doing so The question arose: If the data were released now- would it affect their ability to publish later?
  • 4.
    A Tale ofTwo Bacteria • In one world- the researchers — who were concerned about their ability to publish as this is the way to obtain recognition and obtain grants (which are essential for them to work) — waited. The first publication appeared on July 29th • In another world, the researchers — who decided public health was more important than obtaining a publication — released the data immediately. The first publication appeared on July 29th — but was not from that group who released the data (though information on that data was included. We live in World 2 and what followed was exciting and had broad repercussions.
  • 5.
    Whether the concernabout the ability to publish if data are released early is real or imagined Researchers act on that concern
  • 6.
    Whether the concernabout the ability to publish if data are released early is real or imagined Researchers act on that concern Note: Harmsen’s group DID share- immediately upon sequencing— the O104:H4 outbreak strain data. The data referred to here was the 2001 strain that was believed to be the strain involved in a 2001 outbreak of similar type. This slide is meant to only highlight that concerns about being scooped drive early sharing decisions. Given that the first paper published did use the early available O104:H4 data, it would be expected that these data, had they been shared, would have been used in that paper as well.
  • 7.
    These data wereput on an FTP server under a CCO waiver and also given a DOI to make access ‘permanent’ To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as: Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001 To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
  • 10.
    Downstream consequences: 1.Citations (~180) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons 4. Example for faster & more open science “Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and publish their work without wasting time on legal wrangling.”
  • 11.
    1.3 The powerof intelligently open data The benefits of intelligently open data were powerfully illustrated by events following an outbreak of a severe gastro-intestinal infection in Hamburg in Germany in May 2011. This spread through several European countries and the US, affecting about 4000 people and resulting in over 50 deaths. All tested positive for an unusual and little-known Shiga-toxin– producing E. coli bacterium. The strain was initially analysed by scientists at BGI-Shenzhen in China, working together with those in Hamburg, and three days later a draft genome was released under an open data licence. This generated interest from bioinformaticians on four continents. 24 hours after the release of the genome it had been assembled. Within a week two dozen reports had been filed on an open-source site dedicated to the analysis of the strain. These analyses provided crucial information about the strain’s virulence and resistance genes – how it spreads and which antibiotics are effective against it. They produced results in time to help contain the outbreak. By July 2011, scientists published papers based on this work. By opening up their early sequencing results to international collaboration, researchers in Hamburg produced results that were quickly tested by a wide range of experts, used to produce new knowledge and ultimately to control a public health emergency.
  • 12.
    All that aside Can we all agree that releasing the E.coli data ahead of publication was ‘good’ At least from a public health perspective Here are the numbers for the E.coli 2011 Outbreak In total, ~4000 people were infected and 53 died
  • 13.
    If so— thenfrom a Public Health perspective…Consider Deaths Worldwide* Infectious Disease Measles: 122,000 per year Hepatitis C-related liver disease: 350,000-500,000 per year Malaria: 627,000 per year HIV/AIDS: 1.4-1.7 million per year Non-communicable, with genetic predisposition Prostate cancer: 307,000 per year Breast cancer: 522,000 per year Suicide: 800,000 per year Diabetes: 1.5 million per year Cancer: 8.2 million per year Cardiovascular Disease: 17.5 million per year Non-genetic/Non-infectious Pesticide Poisoning: 250,000 per year Malnutrition: 2.8 million children (under 5) per year *World Health Organization Fact Sheets http://www.who.int/en/ Clearly —from a public health perspective— early data sharing in every area, especially prior to publication — is essential.
  • 14.
    Sharing Data isEssential for Many Reasons
  • 15.
    Sharing aids fields… Rice v Wheat: consequences of publically available genome data 700 600 500 400 300 200 100 0 rice wheat Every 10 datasets collected contributes to at least 4 papers in the following 3-years. Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a
  • 16.
    Sharing aids authors… Sharing Detailed Research Data Is Associated with Increased Citation Rate. Piwowar HA, Day RS, Fridsma DB (2007) PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308
  • 17.
    Lack of SharingImpacts Reproducibility Out of 18 microarray papers, results from 10 could not be reproduced 1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14 2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
  • 18.
    Sharing can reduceretractions >15X increase in last decade Strong correlation of “retraction index” with higher impact factor At current % increase by 2045 as many papers published as retracted! 1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
  • 19.
    Data Sharing Hurdles ? If only it were easy… There are numerous reasons why researchers do not share data: The majority of which are good reasons
  • 20.
    Wiley Researcher DataInsights Survey Our objective was to establish a baseline view of data sharing practices, attitudes, and motivations globally, with participation from researchers in every scholarly field. In March 2014, more than 90,000 researchers around the world were invited to participate in Wiley’s Researcher Data Insights Survey. Participants were researchers who had published at least one journal article in the past year with any publisher. We received an overwhelming 2,886 responses from around the world. Slide from Catherine Giffi, Director, Strategic Market Analysis, Global Research, Wiley
  • 21.
    Wiley Researcher DataInsights Survey Key Findings • Most researchers are sharing their data. • Those not sharing have a variety of reasons. • Data that’s being shared typically is <10 GB. • The most common type of data that is being shared is flat, tabular data (.csv, .txt, .xl) • Data is usually saved on hard drives. Slide from Catherine Giffi, Director, Strategic Market Analysis, Global Research, Wiley
  • 22.
    Wiley Researcher DataInsights Survey Why Researchers Do Not Share • Intellectual property or confidentiality issues (59%) • Concerned research might be “scooped” (39%) • Concerns about misinterpretation or misuse (32%) • Concerns about attribution/citation credit (31%) • Ethical concerns (24%) • Insufficient time/resources (19%) • Funder/institution does not require sharing (13%) • Lack of funding (13%) • Not sure where to share (5%) • Not sure how to share (3%) Slide from Catherine Giffi, Director, Strategic Market Analysis, Global Research, Wiley Report is underway: but See: http://exchanges.wiley.com/blog/2014/11/03/how-and-why-researchers-share-data-and-why-they-dont/ http://scholarlykitchen.sspnet.org/2014/11/11/to-share-or-not-to-share-that-is-the-research-data-question/
  • 23.
    How Can PublishersPromote Data Sharing Researchers are never so captive as when they publishing But we need to help — not just harass. Carrots and Sticks And- why us? – Create Journal Data Release Policies – Check Data Release Policy is followed – Find Ways to Aid Researchers in Releasing Data – Consider ways to support/protect researchers who do share ahead of publications – Promote Data Citation
  • 24.
    How Can PublishersPromote Data Sharing Researchers are never so captive as when they publishing But we need to help — not just harass. Carrots and Sticks And- why us? – Create Journal Data Release Policies – Check Data Release Policy is followed – Find Ways to Aid Researchers in Releasing Data – Consider ways to support/protect researchers who do share ahead of publications – Promote Data Citation
  • 25.
    Incentives/credit Credit wherecredit is overdue: “One option would be to provide researchers who release data to public repositories with a means of accreditation.” “An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “ Nature Biotechnology 27, 579 (2009) Prepublication data sharing (Toronto International Data Release Workshop) “Data producers benefit from creating ? a citable reference, as it can later be used to reflect impact of the data sets.” Nature 461, 168-170 (2009)
  • 26.
    Genomics Data SharingPolicies… Bermuda Accords 1996/1997/1998: 1. Automatic release of sequence assemblies within 24 hours. 2. Immediate publication of finished annotated sequences. 3. Aim to make the entire sequence freely available in the public domain for both research and development in order to maximise benefits to society. Fort Lauderdale Agreement, 2003: 1. Sequence traces from whole genome shotgun projects are to be deposited in a trace archive within one week of production. 2. Whole genome assemblies are to be deposited in a public nucleotide sequence database as soon as possible after the assembled sequence has met a set of quality evaluation criteria. Toronto International data release workshop, 2009: The goal was to reaffirm and refine, where needed, the policies related to the early release of genomic data, and to extend, if possible, similar data release policies to other types of large biological datasets – whether from proteomics, biobanking or metabolite research.
  • 27.
    Sharing Data fromLarge-scale Biological Research Projects: A System of Tripartite Responsibility (From the Fort Lauderdale Meeting 2003) http://www.genome.gov/pages/research/wellcomereport0303.pdf
  • 28.
    Citing Data Isn’tNew The Physical Sciences have been doing this for a while DataCite and DOIs “increase acceptance of research data as legitimate, citable contributions to the scholarly record”. Aims to: “data generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”.
  • 29.
    How We EnvisionResearch Publication (Communicating Science) Open-access journal Data Publishing Platform Data Sets in GigaDB Analyses in GigaGalaxy Paper in GigaScience Data Analysis Platform
  • 30.
    Other Journals arenow doing similar This is most commonly done in the form of a Data Paper rather than a release of data that is citable in itself. • A Data Paper is affectively a Description of the Data • Other journals that do Data Publishing as a formal paper type • F1000 Research (launched in 2012) • Has Data papers as one of several types of papers • Scientific Data (launched in 2014) • Solely publishes Data Descriptors • There are more…
  • 31.
    Making the DataItself Citable We provide a linked database The data are then directly linked to the paper- but can also be cited separately through a Data DOI We can do this because we have a collaboration between BMC (who handles the standard paper publication) and BGI (which has enormous data storage capacity.) However: There are many community available databases- so in principle- any journal can do this by taking advantage of such available resources. These include the usual suspects: EBI, NCBI, DDBJ etc. Databases that take all data types and provide Data DOIs: Dryad, FigShare, etc. There are also numerous smaller community databases specific to different fields or data types.
  • 32.
    For data citationto work, needs: • Acceptance by journals. • Data+Citation: inclusion in the references. • Tracking by citation indexes. • Usage of the metrics by the community…
  • 33.
    For data citationto work, needs: • Acceptance by journals. • Data+Citation: inclusion in the references. • Tracking by citation indexes. • Usage of the metrics by the community…
  • 34.
  • 35.
    Back to E.coliO104:H4 • As noted: articles on these early released and citable data were published • Also- the early releasers were not the first to publish • Nor was the data cited
  • 36.
    This open-source analysiswork was published on August 25th
  • 37.
    The journal did not approve of inclusion of the data citation. Nor was any indication of where the genome information could be found
  • 39.
    This report wasthe first to be publisher- and it included and used information from the crowd-source release as well as the other early release. No where in the paper is there any indication of where to obtain this data Nor is there an indication of where to obtain the sequence data they generated
  • 40.
    This group made their 0104:H4 sequence available at the time of completion- prior to publication in the NCBI database. Though no link to the Accession Number is easily found in the paper.
  • 41.
    This report DIDinclude a reference for the data (even though they did not use it in their analysis) This link… leads to an empty site
  • 42.
    Had they used the DOI, the data, though they had migrated to a different database, would have been found
  • 43.
    For data citationto work, needs: • Acceptance by journals. • Data+Citation: inclusion in the references. • Tracking by citation indexes. • Usage of the metrics by the community…
  • 44.
  • 45.
    • Data submittedto NCBI databases: - Raw data SRA:SRA046843 - Assemblies of 3 strains Genbank:AHAO00000000-AHAQ00000000 - SNPs dbSNP:1056306 - CNVs - InDels } dbVAR:nstd63 - SV • Submission to public databases complemented by its citable form in GigaDB (doi:10.5524/100012).
  • 47.
  • 48.
  • 49.
  • 51.
  • 52.
    The polar bearDATA was released –prepublication- in 2011 They were used and cited in the following studies- before the main paper on the sequencing was published Hailer, F et al., Nuclear genomic sequences reveal that polar bears are an old and distinct bear lineage. Science. 2012 Apr 20;336(6079):344-7. doi:10.1126/science.1216424. Cahill, JA et al., Genomic evidence for island population conversion resolves conflicting theories of polar bear evolution. PLoS Genet. 2013;9(3):e1003345. doi:10.1371/journal.pgen.1003345. Morgan, CC et al., Heterogeneous models place the root of the placental mammal phylogeny. Mol Biol Evol. 2013 Sep;30(9):2145-56. doi:10.1093/molbev/mst117. Cronin, MA et al., Molecular Phylogeny and SNP Variation of Polar Bears (Ursus maritimus), Brown Bears (U. arctos), and Black Bears (U. americanus) Derived from Genome Sequences. J Hered. 2014; 105(3):312-23. doi:10.1093/jhered/est133. Bidon, T et al., Brown and Polar Bear Y Chromosomes Reveal Extensive Male-Biased Gene Flow within Brother Lineages. Mol Biol Evol. 2014 Apr 4. doi:10.1093/molbev/msu109 http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-data-the-polar-bear/
  • 53.
    Cell Press Journalshad indicated publishing a dataset prior to publication could be considered as prior publication
  • 54.
    Even though thedata had been released 2 years earlier and cited in other papers- the main analysis paper was published in Cell However, this didn’t include the data citation…
  • 55.
    One step forward— two steps back
  • 56.
    Removing data citationsfrom the references One journal informed the authors that non-reviewed material could not be cited in the references of the paper Another journal stripped the data citation from the references- and went an extra step and changed the citation in the Data Availability section to the URL where the DOI directed it to at that time We happened to know about this one- and were able to create a forward to the DOI’d page when the URL broke after we moved our database platform Note: Much of this was due to a standard operating procedure in the production department Lesson: If you decide to include Data Citations- tell your entire team
  • 57.
    For data citationto work, needs: • Acceptance by journals. • Data+Citation: inclusion in the references. • Tracking by citation indexes. • Usage of the metrics by the community…
  • 58.
    Data publication indatabases is now being tracked by this and other tracking resources
  • 59.
    For data citationto work, needs: • Acceptance by journals. • Data+Citation: inclusion in the references. • Tracking by citation indexes. • Usage of the metrics by the community… This is a work in progress… as data citation in the life sciences is still a new endeavor
  • 60.
    Data Citation Reallyis a Major Incentive This year, we released the genome sequences from 3000 Rice strains (13.4 TB of data) • These data were also deposited in NIH SRA repository • So why did we do it too? 1. It is linked directly to the Data Paper that provides details of data production, quality, and basic analysis 2. Authors were hesitant to release these data (a HUGE community resource) prior to the analysis paper publication (which, for 3000 strains… could possibly take years…). The opportunity to have these data citable (and trackable) encouraged the authors and led to their releasing these data and doing so in collaboration with GigaScience’s Biocurator The 3,000 Rice Genomes Project. (2014) GigaScience 3:7 http://dx.doi.org/10.1186/2047-217X-3-7; The 3000 Rice Genomes Project (2014) GigaScience Database. http://dx.doi.org/10.5524/200001
  • 61.
    No: your datais not too large to share Rice 3K project: 3,000 rice genomes, 13.4TB public data IRRI GALAXY
  • 62.
    Beyond Data Citation Reviewing Data Data Release policies include the need to help authors Data availability without metadata is practically useless
  • 63.
    Beyond Data Citation Reviewing Data It’s too hard- we can’t ask our reviewers to do that! Use Data Reviewers
  • 64.
    Example in Neuroscience 1. Neuroscience Data are not typically shared 2. For most papers: Data AND Tools are not typically made available to the reviewers 3. Journal Editors think Reviewers will not want to review data GigaScience 2014, 3:3 doi:10.1186/2047-217X-3-3
  • 65.
    Example in Neuroscience • Neuroscience Data are not typically shared • Author Dr. Stephen Eglen said: “One way of encouraging neuroscientists to share their data is to provide some form of academic credit.” • We hosted with a DOI: 366 recordings from 12 electrophysiology datasets • GigaDB is included in Thompson Reuters Data Citation Index • Data AND Tools are not typically made available to the reviewers • We made manuscript, data and tools all available to the reviewers. • We make sure to include reviewers who are able to properly assess the data itself and rerun the tools • To reduce burdens- we sometimes select a reviewer who ONLY looks at the data. • Journal Editors think Reviewers will not want to review data • What Reviewer Dr. Thomas Wachtler said: “The paper by Eglen and colleagues is a shining example of openness in that it enables replicating the results almost as easily as by pressing a button.” • What Reviewer Dr. Christophe Pouzat said: “In addition to making the presented research trustworthy, the reproducible research paradigm definitely makes the reviewers job more fun!”
  • 66.
    Beyond Data Citation Data Release policies include the need to help authors Collaborations With data repositories With other journals
  • 67.
    Consider Cross JournalSupport Competition is good… ….but sometimes we should collaborate for the community good • PLoS recent data deposition policies have led to community concerns about feasibility. • We support (and applaud) this …we have an even stricter data deposition policy • But- PLoS ONE received a submission that was a comparative study of earthworm morphology and anatomy using a 3D non-invasive imaging technique called micro-computed tomography (or microCT) …And there is no good place to put this • These data are extremely complex, videos, multiple files-with several folders of ~10 GB
  • 68.
    Consider Cross JournalSupport • GigaScience and PLOS ONE collaborated. They published the main article; we published a Data Note describing the data itself and hosted all the data on GigaDB under separate citation. • With our Aspera Connection- reviewers could download even the 10 TB folders in ~1/2 hour • Reviewer Dr. Sarah Faulwetter noted the usefulness of having these data available, saying: Instead of having to go through the lengthy process of obtaining the physical specimen from a museum, I can now download a fairly accurate representation from the web. Lenihan et al (2014). GigaScience, 3:6 http://dx.doi.org/10.1186/2047-217X-3-6; Lenihan, et al (2014): GigaScience Database. http://dx.doi.org/10.5524/100092; Fernández et al (2014) PLOS ONE 9 (5) e96617 http://dx.doi.org/10.1371/journal.pone.0096617
  • 69.
    Beyond Data Citation Data availability without metadata is practically useless Engage/Employ/Interact with Curators
  • 70.
    Challenges for thefuture… 1. Lack of interoperability/sufficient metadata 2. Long tail of curation (“Democratization” of “big-data”) ?
  • 71.
    Think about whatyou do… and what you can do… • Promote- rather than inhibit- prepublication data sharing • Promote Data Citation in the reference section – incentivizes data release – Makes it easier for readers to find • Promote Data Sharing upon publication – Consider your data release policies • Form collaborations with repositories to aid authors in depositing their work – Identify community organizations with metadata standards • Make data available for reviewers (author website, community repositories, dryad and similar (your publisher?) – at least do a sanity check – Use “data reviewers” No- this isn’t easy, but do what you can now And work toward the rest Evolve
  • 72.
    It’s Time toMove Beyond Dead Trees 1665 1812 1869
  • 73.
    Thanks to: ScottEdmunds, Executive Editor Nicole Nogoy, Commissioning Editor Peter Li, Lead Data Manager Chris Hunter, Lead BioCurator Rob Davidson, Data Scientist Xiao (Jesse) Si Zhe, Database Developer Amye Kenall, Journal Development Manager Contact us: editorial@gigasciencejournal.com database@gigasciencejournal.com Follow us: @GigaScience facebook.com/GigaScience blogs.openaccesscentral.com/blogs/gigablog www.gigasciencejournal.com www.gigadb.org

Editor's Notes

  • #19 Isn’t hyperbole fun?
  • #37 And a paper by the group was published in a high impact journal even though the data were released early in a citable format
  • #46 Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data to dbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • #47 Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data to dbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • #48 Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data to dbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • #49 Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data to dbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • #71 (A) Cumulative base pairs in INSDC over time, excluding the Trace Archive (raw data from capillary sequencing platforms). (B) Base pairs in INSDC over time since 1980, broken down into selected data components. Cumulative data volume in base pairs broken down into assembled sequence (whole genome shotgun methods and others) and raw next-generation-sequence data.