Rice v Wheat: consequences of publically available genome data. rice wheat 700 600 500 400 300 200 100 0
Sharing aids everyone…Sharing Detailed ResearchData Is Associated withIncreased Citation Rate.Piwowar HA, Day RS, Fridsma DB (2007)PLoS ONE 2(3): e308.doi:10.1371/journal.pone.0000308 Every 10 datasets collected contributes to at least 4 papers in the following 3-years. Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a
Are there now too many hurdles?Technical: too large volumes too heterogeneous no home for many data types too time consumingEconomic: too expensive, no long-term fundingCultural: inertia ? no incentives to share unaware of how
Incentives/creditCredit where credit is overdue:“One option would be to provide researchers who release data topublic repositories with a means of accreditation.”“An ability to search the literature for all online papers that used aparticular data set would enable appropriate attribution for thosewho share. “Nature Biotechnology 27, 579 (2009)Prepublication data sharing(Toronto International Data Release Workshop)“Data producers benefit from creating a citable reference, as it canlater be used to reflect impact of the data sets.”Nature 461, 168-170 (2009)
Datacitation: Datacite and DOIsDigital Object Identifiers (DOIs) offer a solution Mostly widely used identifier for Dataset scientific articles Yancheva et al (2007). Analyses on Researchers, authors, publishers sediment of Lake Maar. PANGAEA. know how to use them doi:10.1594/PANGAEA.587840 Put datasets on the same playing field as articles
Datacitation: Datacite and DOIs>1 million DOIs since Dec 2009Central metadata repository to link with WoS/ISI - finally can track and credit use!
Now taking submissions… Large-Scale Data Journal/Database In conjunction with:Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDAssistant Editor: Alexandra Basford, PhD www.gigasciencejournal.com
Editorial Board: InternationalStephan Beck, UK Stephen OBrien, USAAlvis Brazma, UK Hanchuan Peng, USAAnn-Shyn Chiang, Taiwan Russell Poldrack, USARichard Durbin, UK Ming Qi, China/USAPaul Flicek, UK Susanna-Assunta Sansone, UKRobert Hanner, Canada Michael Schatz, USAYoshihide Hayashizaki, Japan David Schwartz, USAHenning Hermjakob, UK Fritz Sommer, USAWolfgang Huber, Germany Lincoln Stein, CanadaGary King, USA Sumio Sugano, JapanTin-Lap Lee, Hong Kong Thomas Wachtler, GermanyDonald Moerman, Canada Jun Wang, ChinaKaren Nelson, USA Alistair Young, New ZealandFrancis Ouellette, Canada Zang Yufeng, China Marie Zins, France www.gigasciencejournal.com
Editorial Board: InternationalStephan Beck, Epigenomics Stephen OBrien, GenomicsAlvis Brazma, Transcriptomics Hanchuan Peng, Imaging/NeuroAnn-Shyn Chiang, Neuroscience Russell Poldrack, NeuroscienceRichard Durbin, Genetics/Genomics Ming Qi, GeneticsPaul Flicek, Genomics Susanna-Assunta Sansone, StandardsRobert Hanner, DNA Barcoding/Ecology Michael Schatz, Cloud ComputingYoshihide Hayashizaki, Genomics David Schwartz, Optical MappingHenning Hermjakob, Proteomics Fritz Sommer, NeuroscienceWolfgang Huber, Functional Genomics Lincoln Stein, Cloud ComputingGary King, Medicine Sumio Sugano, GenomicsTin-Lap Lee, Genomics Thomas Wachtler, NeuroscienceDonald Moerman, Functional Genomics Jun Wang, GenomicsKaren Nelson, Metagenomics Alistair Young, Medical ImagingFrancis Ouellette, Genomics Zang Yufeng, Neuroscience Marie Zins, Medicine www.gigasciencejournal.com
Criteria and Focus of Journal/Database Reproducibility/Reuse Utility/Usability Standards/Searchability/Scale/Sharing Data publishing/DOI www.gigasciencejournal.com
Use of Data = Importance + Usability subjective? easier to assess www.gigasciencejournal.com
Reproducibility/Reuse BGI Cloud Computing resources for handling and analyzing large-scale data. Integrated tools to promote more widespread access, viewing, and analysis of data. Encourage and aid use of workflow systems for methods (e.g. submission of Galaxy XML files). www.gigasciencejournal.com
Special Series/Hub for cloud-based tools Technical notes: test tools in the BGI-Cloud. Tools + Test Data (BGI or user) in one place. Aids reproducibility. Aids reviewers (free) Aids authors: visibility (pubmed, etc.) hosting (included/free offers) –contact us: firstname.lastname@example.org Oledoe flickr cc www.gigasciencejournal.com
Standards/Searchability/Sharing ISA-Tab compatibility to aid and promote best practice in metadata reporting. All supporting data must be publically available. Ask for MIBBI compliance and use of reporting checklists. Part of the Biosharing network. www.gigasciencejournal.com
Data publishing/DOI New journal format combines standard manuscript publication with an extensive database to host all associated data. Data hosting will follow standard funding agency and community guidelines. DOI assignment available for submitted data to allow ease of finding and citing datasets, as well as for citation tracking. www.gigasciencejournal.com
The era of the data consumer?Free access to data – but analysis hubs/nodes for will form around it ?
GDSAP:Genomic Data Submission and Analytical platform Big data from the Data, Data, Data… “Sequencing Farm” Data Modeling Tin-Lap Lee, CUHK Pipeline design Validation Commercial applications “Apps”
BGI Datasets Get DOI®sInvertebrate PLANTSAnt Vertebrates Chinese cabbage- Florida carpenter ant Giant panda Macaque Cucumber- Jerdon’s jumping ant - Chinese rhesus Foxtail millet- Leaf-cutter ant - Crab-eating PigeonpeaRoundworm Naked mole rat PotatoSilkworm Penguin Sorghum - Emperor penguinHuman - Adelie penguinAsian individual (YH) Pigeon, domestic- DNA Methylome Polar bear- Genome Assembly Sheep doi:10.5524/100004- Transcriptome Tibetan antelopeAncient DNA (coming soon)- Saqqaq Eskimo Microbe- Aboriginal Australian E. Coli O104:H4 TY-2482 Cell-Line Chinese Hamster Ovary
BGI Datasets Get DOI®s Many unpublished…Invertebrate PLANTSAnt Vertebrates Chinese cabbage- Florida carpenter ant Giant panda Macaque Cucumber- Jerdon’s jumping ant - Chinese rhesus Foxtail millet- Leaf-cutter ant - Crab-eating PigeonpeaRoundworm Naked mole rat PotatoSilkworm Penguin Sorghum - Emperor penguinHuman - Adelie penguinAsian individual (YH) Pigeon, domestic- DNA Methylome Polar bear- Genome Assembly Sheep doi:10.5524/100004- Transcriptome Tibetan antelopeAncient DNA (coming soon)- Saqqaq Eskimo Microbe- Aboriginal Australian E. Coli O104:H4 TY-2482 Cell-Line Chinese Hamster Ovary
Data also submitted to NCBI (including SV data to dbVar)Complemented by citable form, and data-types including: Assemblies of 3 strains Raw Data SNPs InDels CNVs SV
Our first DOI:To maximize its utility to the research community and aid those fighting the currentepidemic, genomic data is released here into the public domain under a CC0license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G;Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S;Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z;Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J andthe Escherichia coli O104:H4 TY-2482 isolate genome sequencingconsortium (2011)Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGIShenzhen. http://dx.doi.org/10.5524/100001 To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
“The way that the genetic data of the 2011 E. coli strain were disseminatedglobally suggests a more effective approach for tackling public healthproblems. Both groups put their sequencing data on the Internet, so scientiststhe world over could immediately begin their own analysis of the bugsmakeup. BGI scientists also are using Twitter to communicate their latestfindings.”“German scientists and their colleagues at the Beijing Genomics Institute in China havebeen working on uncovering secrets of the outbreak. BGI scientists revised their draftgenetic sequence of the E. coli strain and have been sharing their data with dozens ofscientists around the world as a way to "crowdsource" this data. By publishing their datapublicy and freely, these other scientists can have a look at the genetic structure, and tryto sort it out for themselves.”
We want your data! email@example.com@gigasciencejournal.com @gigascience facebook.com/GigaScience blogs.openaccesscentral.com/blogs/gigablog/ www.gigasciencejournal.com