Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2014 CrossRef Annual Meeting Keynote: Ways and Needs to Promote Rapid Data Sharing

947 views

Published on

Keynote address: "Ways and Needs to Promote Rapid Data Sharing" by Laurie Goodman of GigaScience.

Data is the base upon which all scientific discoveries are built, and data availability speeds the rate at which discoveries are made. Given that the overall goal for research is to improve human health and our environment, waiting to release data until after the first publication (sometimes taking years) is unacceptable. There are myriad issues that impede researchers from openly, and most importantly, rapidly sharing data, including lack of incentives: no credit, limited funding benefits, and little impact on career advancement; and cultural issues: the fear of being scooped. However, scientific publishers —the communicators of science and a key mechanism by which a researcher’s productivity is measured— can, and should, play a central role in promoting data sharing. Data citation and publication are just some of the ways we can support and encourage researchers who share data. Here, I will provide examples to help make clear the need for publishers to play an active role in this process and provide potential ways to facilitate our ability to promote open and rapid data sharing. This is not easy; but it is essential.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

2014 CrossRef Annual Meeting Keynote: Ways and Needs to Promote Rapid Data Sharing

  1. 1. Ways and Needs to Promote Rapid Data Sharing Laurie Goodman, PhD Editor-in-Chief GigaScience ORCID ID: 0000-0001-9724-5976
  2. 2. Scientific Communication Via Publication • Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995 • Core scientific statements or assertions are intertwined and hidden in the conventional scholarly narratives • Lack of transparency, lack of credit for anything other than “regular” dead tree publication
  3. 3. A Tale of Two Bacteria 1. On May 2, 2011 German Doctors Reported the first case of an E.coli infection, that was accompanied by hemolytic-uremic syndrome 2. On May 21, 2011 the first death occurred from this bacteria (denoted E.coli O104:H4) 3. On June 3, 2014, BGI completed a draft sequence of E.coli O104:H4 from a sample provided by doctors at the University Medical Centre Hamburg-Eppendorf 4. At this point- the leaders at BGI held a discussion about whether to release the sequence data immediately: what were the potential repercussions of doing so The question arose: If the data were released now- would it affect their ability to publish later?
  4. 4. A Tale of Two Bacteria • In one world- the researchers — who were concerned about their ability to publish as this is the way to obtain recognition and obtain grants (which are essential for them to work) — waited. The first publication appeared on July 29th • In another world, the researchers — who decided public health was more important than obtaining a publication — released the data immediately. The first publication appeared on July 29th — but was not from that group who released the data (though information on that data was included.
  5. 5. Whether the concern about the ability to publish if data are released early is real or imagined Researchers act on that concern
  6. 6. Whether the concern about the ability to publish if data are released early is real or imagined Researchers act on that concern
  7. 7. These data were put on an FTP server under a CCO waiver and also given a DOI to make access ‘permanent’ To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as: Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001 To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
  8. 8. Downstream consequences: 1. Citations (~180) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons 4. Example for faster & more open science “Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and publish their work without wasting time on legal wrangling.”
  9. 9. 1.3 The power of intelligently open data The benefits of intelligently open data were powerfully illustrated by events following an outbreak of a severe gastro-intestinal infection in Hamburg in Germany in May 2011. This spread through several European countries and the US, affecting about 4000 people and resulting in over 50 deaths. All tested positive for an unusual and little-known Shiga-toxin– producing E. coli bacterium. The strain was initially analysed by scientists at BGI-Shenzhen in China, working together with those in Hamburg, and three days later a draft genome was released under an open data licence. This generated interest from bioinformaticians on four continents. 24 hours after the release of the genome it had been assembled. Within a week two dozen reports had been filed on an open-source site dedicated to the analysis of the strain. These analyses provided crucial information about the strain’s virulence and resistance genes – how it spreads and which antibiotics are effective against it. They produced results in time to help contain the outbreak. By July 2011, scientists published papers based on this work. By opening up their early sequencing results to international collaboration, researchers in Hamburg produced results that were quickly tested by a wide range of experts, used to produce new knowledge and ultimately to control a public health emergency.
  10. 10. All that aside Can we all agree that releasing the E.coli data ahead of publication was ‘good’ At least from a public health perspective Here are the numbers for the E.coli 2011 Outbreak In total, ~4000 people were infected and 53 died
  11. 11. From a Public Health perspective…Deaths Worldwide* Infectious Disease Measles: 122,000 per year Hepatitis C-related liver disease: 350,000-500,000 per year Malaria: 627,000 per year HIV/AIDS: 1.4-1.7 million per year Non-communicable, with genetic predisposition Prostate cancer: 307,000 per year Breast cancer: 522,000 per year Suicide: 800,000 per year Diabetes: 1.5 million per year Cancer: 8.2 million per year Cardiovascular Disease: 17.5 million per year Non-genetic/Non-infectious Pesticide Poisoning: 250,000 per year Malnutrition: 2.8 million children (under 5) per year *World Health Organization Fact Sheets http://www.who.int/en/
  12. 12. Sharing Data is Essential for Many Reasons
  13. 13. Sharing aids fields… Rice v Wheat: consequences of publically available genome data 700 600 500 400 300 200 100 0 rice wheat Every 10 datasets collected contributes to at least 4 papers in the following 3-years. Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a
  14. 14. Sharing aids authors… Sharing Detailed Research Data Is Associated with Increased Citation Rate. Piwowar HA, Day RS, Fridsma DB (2007) PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308
  15. 15. Lack of Sharing Impacts Reproducibility Out of 18 microarray papers, results from 10 could not be reproduced 1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14 2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
  16. 16. Sharing can reduce retractions >15X increase in last decade Strong correlation of “retraction index” with higher impact factor At current % increase by 2045 as many papers published as retracted! 1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
  17. 17. Data Sharing Hurdles ? If only it were easy… There are numerous reasons why researchers do not share data: The majority of which are good reasons
  18. 18. Wiley Researcher Data Insights Survey Our objective was to establish a baseline view of data sharing practices, attitudes, and motivations globally, with participation from researchers in every scholarly field. In March 2014, more than 90,000 researchers around the world were invited to participate in Wiley’s Researcher Data Insights Survey. Participants were researchers who had published at least one journal article in the past year with any publisher. We received an overwhelming 2,886 responses from around the world. Slide from Catherine Giffi, Director, Strategic Market Analysis, Global Research, Wiley
  19. 19. Wiley Researcher Data Insights Survey Key Findings • Most researchers are sharing their data. • Those not sharing have a variety of reasons. • Data that’s being shared typically is <10 GB. • The most common type of data that is being shared is flat, tabular data (.csv, .txt, .xl) • Data is usually saved on hard drives. Slide from Catherine Giffi, Director, Strategic Market Analysis, Global Research, Wiley
  20. 20. Wiley Researcher Data Insights Survey Why Researchers Do Not Share • Intellectual property or confidentiality issues (59%) • Concerned research might be “scooped” (39%) • Concerns about misinterpretation or misuse (32%) • Concerns about attribution/citation credit (31%) • Ethical concerns (24%) • Insufficient time/resources (19%) • Funder/institution does not require sharing (13%) • Lack of funding (13%) • Not sure where to share (5%) • Not sure how to share (3%) Slide from Catherine Giffi, Director, Strategic Market Analysis, Global Research, Wiley See also: http://exchanges.wiley.com/blog/2014/11/03/how-and-why-researchers-share-data-and-why-they-dont/ http://scholarlykitchen.sspnet.org/2014/11/11/to-share-or-not-to-share-that-is-the-research-data-question/
  21. 21. How Can Publishers Promote Data Sharing Researchers are never so captive as when they publishing But we need to help — not just harass. Carrots and Sticks And- why us? – Create Journal Data Release Policies – Check Data Release Policy is followed – Find Ways to Aid Researchers in Releasing Data – Consider ways to support/protect researchers who do share ahead of publications – Promote Data Citation
  22. 22. How Can Publishers Promote Data Sharing Researchers are never so captive as when they publishing But we need to help — not just harass. Carrots and Sticks And- why us? – Create Journal Data Release Policies – Check Data Release Policy is followed – Find Ways to Aid Researchers in Releasing Data – Consider ways to support/protect researchers who do share ahead of publications – Promote Data Citation
  23. 23. Incentives/credit Credit where credit is overdue: “One option would be to provide researchers who release data to public repositories with a means of accreditation.” “An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “ Nature Biotechnology 27, 579 (2009) Prepublication data sharing (Toronto International Data Release Workshop) “Data producers benefit from creating ? a citable reference, as it can later be used to reflect impact of the data sets.” Nature 461, 168-170 (2009)
  24. 24. Genomics Data Sharing Policies… Bermuda Accords 1996/1997/1998: 1. Automatic release of sequence assemblies within 24 hours. 2. Immediate publication of finished annotated sequences. 3. Aim to make the entire sequence freely available in the public domain for both research and development in order to maximise benefits to society. Fort Lauderdale Agreement, 2003: 1. Sequence traces from whole genome shotgun projects are to be deposited in a trace archive within one week of production. 2. Whole genome assemblies are to be deposited in a public nucleotide sequence database as soon as possible after the assembled sequence has met a set of quality evaluation criteria. Toronto International data release workshop, 2009: The goal was to reaffirm and refine, where needed, the policies related to the early release of genomic data, and to extend, if possible, similar data release policies to other types of large biological datasets – whether from proteomics, biobanking or metabolite research.
  25. 25. Sharing Data from Large-scale Biological Research Projects: A System of Tripartite Responsibility (From the Fort Lauderdale Meeting 2003) http://www.genome.gov/pages/research/wellcomereport0303.pdf
  26. 26. Citing Data Isn’t New The Physical Sciences have been doing this for a while DataCite and DOIs “increase acceptance of research data as legitimate, citable contributions to the scholarly record”. Aims to: “data generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”.
  27. 27. How We Envision Research Publication (Communicating Science) Open-access journal Data Publishing Platform Data Sets in GigaDB Analyses in GigaGalaxy Paper in GigaScience Data Analysis Platform
  28. 28. Other Journals are now doing similar This is most commonly done in the form of a Data Paper rather than a release of data that is citable in itself. • A Data Paper is affectively a Description of the Data • Other journals that do Data Publishing as a formal paper type • F1000 Research (launched in 2012) • Has Data papers as one of several types of papers • Scientific Data (launched in 2014) • Solely publishes Data Descriptors • There are more…
  29. 29. Making the Data Itself Citable We provide a linked database The data are then directly linked to the paper- but can also be cited separately through a Data DOI We can do this because we have a collaboration between BMC (who handles the standard paper publication) and BGI (which has enormous data storage capacity.) However: There are many community available databases- so in principle- any journal can do this by taking advantage of such available resources. These include the usual suspects: EBI, NCBI, DDBJ etc. Databases that take all data types and provide Data DOIs: Dryad, FigShare, etc. There are also numerous smaller community databases specific to different fields or data types.
  30. 30. For data citation to work, needs: • Acceptance by journals. • Data+Citation: inclusion in the references. • Tracking by citation indexes. • Usage of the metrics by the community…
  31. 31. For data citation to work, needs: • Acceptance by journals. • Data+Citation: inclusion in the references. • Tracking by citation indexes. • Usage of the metrics by the community…
  32. 32. In Principle…
  33. 33. Back to E.coli O104:H4 • As noted: articles on these early released and citable data were published • Also- the early releasers were not the first to publish • Nor was the data cited
  34. 34. This open-source analysis work was published on August 25th
  35. 35. The journal did not approve of inclusion of the data citation. Nor was any indication of where the genome information could be found
  36. 36. This report was the first to be publisher- and it included and used information from the crowd-source release as well as the other early release. No where in the paper is there any indication of where to obtain this data Nor is there an indication of where to obtain the sequence data they generated
  37. 37. This group made their 0104:H4 sequence available at the time of completion- prior to publication in the NCBI database. Though no link to the Accession Number is easily found in the paper.
  38. 38. This report DID include a reference for the data (even though they did not use it in their analysis)
  39. 39. For data citation to work, needs: • Acceptance by journals. • Data+Citation: inclusion in the references. • Tracking by citation indexes. • Usage of the metrics by the community…
  40. 40. In Practice…
  41. 41. • Data submitted to NCBI databases: - Raw data SRA:SRA046843 - Assemblies of 3 strains Genbank:AHAO00000000-AHAQ00000000 - SNPs dbSNP:1056306 - CNVs - InDels } dbVAR:nstd63 - SV • Submission to public databases complemented by its citable form in GigaDB (doi:10.5524/100012).
  42. 42. In the references…
  43. 43. Is the DOI…
  44. 44. In Practice…
  45. 45. In Practice… http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-data-the-polar-bear/
  46. 46. The polar bear DATA was released –prepublication- in 2011 They were used and cited in the following studies- before the main paper on the sequencing was published Hailer, F et al., Nuclear genomic sequences reveal that polar bears are an old and distinct bear lineage. Science. 2012 Apr 20;336(6079):344-7. doi:10.1126/science.1216424. Cahill, JA et al., Genomic evidence for island population conversion resolves conflicting theories of polar bear evolution. PLoS Genet. 2013;9(3):e1003345. doi:10.1371/journal.pgen.1003345. Morgan, CC et al., Heterogeneous models place the root of the placental mammal phylogeny. Mol Biol Evol. 2013 Sep;30(9):2145-56. doi:10.1093/molbev/mst117. Cronin, MA et al., Molecular Phylogeny and SNP Variation of Polar Bears (Ursus maritimus), Brown Bears (U. arctos), and Black Bears (U. americanus) Derived from Genome Sequences. J Hered. 2014; 105(3):312-23. doi:10.1093/jhered/est133. Bidon, T et al., Brown and Polar Bear Y Chromosomes Reveal Extensive Male- Biased Gene Flow within Brother Lineages. Mol Biol Evol. 2014 Apr 4. doi:10.1093/molbev/msu109
  47. 47. Cell Press Journals
  48. 48. However, this didn’t include the citation…
  49. 49. One step forward — two steps back
  50. 50. Removing data citations from the references One journal informed the authors that non-reviewed material could not be cited in the references of the paper Another journal stripped the data citation from the references- and went an extra step and changed the citation in the Data Availability section to the URL where the DOI directed it to at that time We happened to know about this one- and were able to create a forward to the DOI’d page when the URL broke after we moved our database platform Note: Much of this was due to a standard operating procedure in the production department Lesson: If you decide to include Data Citations- tell your entire team
  51. 51. For data citation to work, needs: • Acceptance by journals. • Data+Citation: inclusion in the references. • Tracking by citation indexes. • Usage of the metrics by the community…
  52. 52. For data citation to work, needs: • Acceptance by journals. • Data+Citation: inclusion in the references. • Tracking by citation indexes. • Usage of the metrics by the community… This is a work in progress…
  53. 53. Data Citation Really is a Major Incentive On Weds this week- we released the genome sequence from 3000 Rice strains (13.4 TB of data) • These data were also deposited in NIH SRA repository • So why did we do it too? 1. It is linked directly to the Data Paper that provides details of data production, quality, and basic analysis 2. Authors were hesitant to release these data (a HUGE community resource) prior to the analysis paper publication (which, for 3000 strains… would take years…). The opportunity to have these data citable (and trackable) encouraged the authors and led to their releasing these data and doing so in collaboration with GigaScience’s Biocurator The 3,000 Rice Genomes Project. (2014) GigaScience 3:7 http://dx.doi.org/10.1186/2047-217X-3-7; The 3000 Rice Genomes Project (2014) GigaScience Database. http://dx.doi.org/10.5524/200001
  54. 54. No: your data is not too large to share Rice 3K project: 3,000 rice genomes, 13.4TB public data IRRI GALAXY
  55. 55. Beyond Data Citation Reviewing Data Data Release policies include the need to help authors Data availability without metadata is practically useless
  56. 56. Beyond Data Citation Reviewing Data It’s too hard- we can’t ask our reviewers to do that! Use Data Reviewers
  57. 57. Example in Neuroscience 1. Neuroscience Data are not typically shared 2. For most papers: Data AND Tools are not typically made available to the reviewers 3. Journal Editors think Reviewers will not want to review data GigaScience 2014, 3:3 doi:10.1186/2047-217X-3-3
  58. 58. Example in Neuroscience • Neuroscience Data are not typically shared • Author Dr. Stephen Eglen said: “One way of encouraging neuroscientists to share their data is to provide some form of academic credit.” • We hosted with a DOI: 366 recordings from 12 electrophysiology datasets • GigaDB is included in Thompson Reuters Data Citation Index • Data AND Tools are not typically made available to the reviewers • We made manuscript, data and tools all available to the reviewers. • We make sure to include reviewers who are able to properly assess the data itself and rerun the tools • To reduce burdens- we sometimes select a reviewer who ONLY looks at the data. • Journal Editors think Reviewers will not want to review data • What Reviewer Dr. Thomas Wachtler said: “The paper by Eglen and colleagues is a shining example of openness in that it enables replicating the results almost as easily as by pressing a button.” • What Reviewer Dr. Christophe Pouzat said: “In addition to making the presented research trustworthy, the reproducible research paradigm definitely makes the reviewers job more fun!”
  59. 59. Beyond Data Citation Data Release policies include the need to help authors Collaborations With data repositories With other journals
  60. 60. Consider Cross Journal Support Competition is good… ….but sometimes we should collaborate for the community good • PLoS recent data deposition policies have led to community concerns about feasibility. • We support (and applaud) this …we have an even stricter data deposition policy • But- PLoS ONE received a submission that was a comparative study of earthworm morphology and anatomy using a 3D non-invasive imaging technique called micro-computed tomography (or microCT) …And there is no good place to put this • These data are extremely complex, videos, multiple files-with several folders of ~10 GB
  61. 61. Consider Cross Journal Support • GigaScience and PLOS ONE collaborated. They published the main article; we published a Data Note describing the data itself and hosted all the data on GigaDB under separate citation. • With our Aspera Connection- reviewers could download even the 10 TB folders in ~1/2 hour • Reviewer Dr. Sarah Faulwetter noted the usefulness of having these data available, saying: Instead of having to go through the lengthy process of obtaining the physical specimen from a museum, I can now download a fairly accurate representation from the web. Lenihan et al (2014). GigaScience, 3:6 http://dx.doi.org/10.1186/2047-217X-3-6; Lenihan, et al (2014): GigaScience Database. http://dx.doi.org/10.5524/100092; Fernández et al (2014) PLOS ONE 9 (5) e96617 http://dx.doi.org/10.1371/journal.pone.0096617
  62. 62. Beyond Data Citation Data availability without metadata is practically useless Engage/Employ/Interact with Curators
  63. 63. Challenges for the future… 1. Lack of interoperability/sufficient metadata 2. Long tail of curation (“Democratization” of “big-data”) ?
  64. 64. Think about what you do… and what you can do… • Promote- rather than inhibit- prepublication data sharing • Promote Data Citation in the reference section – incentivizes data release – Makes it easier for readers to find • Promote Data Sharing upon publication – Consider your data release policies • Form collaborations with repositories to aid authors in depositing their work – Identify community organizations with metadata standards • Make data available for reviewers (author website, community repositories, dryad and similar (your publisher?) – at least do a sanity check – Use “data reviewers” No- this isn’t easy, but do what you can now And work toward the rest Evolve
  65. 65. It’s Time to Move Beyond Dead Trees 1665 1812 1869
  66. 66. Thanks to: Scott Edmunds, Executive Editor Nicole Nogoy, Commissioning Editor Peter Li, Lead Data Manager Chris Hunter, Lead BioCurator Rob Davidson, Data Scientist Xiao (Jesse) Si Zhe, Database Developer Amye Kenall, Journal Development Manager Contact us: editorial@gigasciencejournal.com database@gigasciencejournal.com Follow us: @GigaScience facebook.com/GigaScience blogs.openaccesscentral.com/blogs/gigablog www.gigasciencejournal.com www.gigadb.org

×