Your SlideShare is downloading. ×
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

970
views

Published on

Scott Edmunds talk on GigaScience Big-Data, Data Citation and future data handling at the International Conference of Genomics on the 15th November 2011.

Scott Edmunds talk on GigaScience Big-Data, Data Citation and future data handling at the International Conference of Genomics on the 15th November 2011.

Published in: Technology, Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
970
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Scott Edmunds : Big Data, Data Citationand Future Data HandlingWilliam Gibson: "Information is the currency of the future world" www.gigasciencejournal.com cc Flickr allan*
  • 2. Data Tsunami? Flickr cc: opensourceway
  • 3. Rice v Wheat: consequences of publically available genome data. rice wheat 700 600 500 400 300 200 100 0
  • 4. Sharing aids everyone…Sharing Detailed ResearchData Is Associated withIncreased Citation Rate.Piwowar HA, Day RS, Fridsma DB (2007)PLoS ONE 2(3): e308.doi:10.1371/journal.pone.0000308 Every 10 datasets collected contributes to at least 4 papers in the following 3-years. Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a
  • 5. Problems? Flickr cc: opensourceway
  • 6. Sequencing cost ($ per Mbp) Moore’s Law ~100,000X Sequencing Source: E Lander/Broad
  • 7. Sequencing Output Data Moore’s/Kryder s Law
  • 8. Sequencing Output Data Dissemination?
  • 9. Potential sequencing capacity1 Illumina HiSeq 2000 (+Truseq upgrade) = 600Gb/run (12 days)X 128 Hiseq = 6Tb/day = >2Pb/year= ~ 2000 Human Genomes/day
  • 10. Difficulties keeping up… Flickr cc: opensourceway
  • 11. Do we have models for long term funding?Human Gene Mutation DatabaseKyoto Encyclopedia of Genes and Genomes ? Flickr cc: opensourceway
  • 12. Are there now too many hurdles? ?
  • 13. Are there now too many hurdles?Technical: too large volumes too heterogeneous no home for many data types too time consumingEconomic: too expensive, no long-term fundingCultural: inertia ? no incentives to share unaware of how
  • 14. Potential solutions?
  • 15. Incentives/creditCredit where credit is overdue:“One option would be to provide researchers who release data topublic repositories with a means of accreditation.”“An ability to search the literature for all online papers that used aparticular data set would enable appropriate attribution for thosewho share. “Nature Biotechnology 27, 579 (2009)Prepublication data sharing(Toronto International Data Release Workshop)“Data producers benefit from creating a citable reference, as it canlater be used to reflect impact of the data sets.”Nature 461, 168-170 (2009)
  • 16. Datacitation: Datacite and DOIsDigital Object Identifiers (DOIs)  offer a solution Mostly widely used identifier for Dataset scientific articles Yancheva et al (2007). Analyses on Researchers, authors, publishers sediment of Lake Maar. PANGAEA. know how to use them doi:10.1594/PANGAEA.587840 Put datasets on the same playing field as articles
  • 17. Datacitation: Datacite and DOIs>1 million DOIs since Dec 2009Central metadata repository to link with WoS/ISI - finally can track and credit use!
  • 18. Now taking submissions… Large-Scale Data Journal/Database In conjunction with:Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDAssistant Editor: Alexandra Basford, PhD www.gigasciencejournal.com
  • 19. Now taking submissions…
  • 20. Editorial Board: InternationalStephan Beck, UK Stephen OBrien, USAAlvis Brazma, UK Hanchuan Peng, USAAnn-Shyn Chiang, Taiwan Russell Poldrack, USARichard Durbin, UK Ming Qi, China/USAPaul Flicek, UK Susanna-Assunta Sansone, UKRobert Hanner, Canada Michael Schatz, USAYoshihide Hayashizaki, Japan David Schwartz, USAHenning Hermjakob, UK Fritz Sommer, USAWolfgang Huber, Germany Lincoln Stein, CanadaGary King, USA Sumio Sugano, JapanTin-Lap Lee, Hong Kong Thomas Wachtler, GermanyDonald Moerman, Canada Jun Wang, ChinaKaren Nelson, USA Alistair Young, New ZealandFrancis Ouellette, Canada Zang Yufeng, China Marie Zins, France www.gigasciencejournal.com
  • 21. Editorial Board: InternationalStephan Beck, Epigenomics Stephen OBrien, GenomicsAlvis Brazma, Transcriptomics Hanchuan Peng, Imaging/NeuroAnn-Shyn Chiang, Neuroscience Russell Poldrack, NeuroscienceRichard Durbin, Genetics/Genomics Ming Qi, GeneticsPaul Flicek, Genomics Susanna-Assunta Sansone, StandardsRobert Hanner, DNA Barcoding/Ecology Michael Schatz, Cloud ComputingYoshihide Hayashizaki, Genomics David Schwartz, Optical MappingHenning Hermjakob, Proteomics Fritz Sommer, NeuroscienceWolfgang Huber, Functional Genomics Lincoln Stein, Cloud ComputingGary King, Medicine Sumio Sugano, GenomicsTin-Lap Lee, Genomics Thomas Wachtler, NeuroscienceDonald Moerman, Functional Genomics Jun Wang, GenomicsKaren Nelson, Metagenomics Alistair Young, Medical ImagingFrancis Ouellette, Genomics Zang Yufeng, Neuroscience Marie Zins, Medicine www.gigasciencejournal.com
  • 22. Criteria and Focus of Journal/Database Reproducibility/Reuse Utility/Usability Standards/Searchability/Scale/Sharing Data publishing/DOI www.gigasciencejournal.com
  • 23. Use of Data = Importance + Usability subjective? easier to assess www.gigasciencejournal.com
  • 24. Reproducibility/Reuse  BGI Cloud Computing resources for handling and analyzing large-scale data. Integrated tools to promote more widespread access, viewing, and analysis of data. Encourage and aid use of workflow systems for methods (e.g. submission of Galaxy XML files). www.gigasciencejournal.com
  • 25. Special Series/Hub for cloud-based tools Technical notes: test tools in the BGI-Cloud. Tools + Test Data (BGI or user) in one place. Aids reproducibility. Aids reviewers (free) Aids authors: visibility (pubmed, etc.) hosting (included/free offers) –contact us: editorial@gigasciencejournal.com Oledoe flickr cc www.gigasciencejournal.com
  • 26. Standards/Searchability/Sharing  ISA-Tab compatibility to aid and promote best practice in metadata reporting. All supporting data must be publically available. Ask for MIBBI compliance and use of reporting checklists. Part of the Biosharing network. www.gigasciencejournal.com
  • 27. Data publishing/DOI New journal format combines standard manuscript publication with an extensive database to host all associated data.  Data hosting will follow standard funding agency and community guidelines. DOI assignment available for submitted data to allow ease of finding and citing datasets, as well as for citation tracking. www.gigasciencejournal.com
  • 28. of data use/release?
  • 29. The era of the data consumer?
  • 30. The era of the data consumer??
  • 31. The era of the data consumer?Free access to data – but analysis hubs/nodes for will form around it ?
  • 32. GDSAP:Genomic Data Submission and Analytical platform Big data from the Data, Data, Data… “Sequencing Farm” Data Modeling Tin-Lap Lee, CUHK Pipeline design Validation Commercial applications “Apps”
  • 33. New Databasewww.gigaDB.org
  • 34. New Databasewww.gigaDB.org
  • 35. BGI Datasets Get DOI®sInvertebrate PLANTSAnt Vertebrates Chinese cabbage- Florida carpenter ant Giant panda Macaque Cucumber- Jerdon’s jumping ant - Chinese rhesus Foxtail millet- Leaf-cutter ant - Crab-eating PigeonpeaRoundworm Naked mole rat PotatoSilkworm Penguin Sorghum - Emperor penguinHuman - Adelie penguinAsian individual (YH) Pigeon, domestic- DNA Methylome Polar bear- Genome Assembly Sheep doi:10.5524/100004- Transcriptome Tibetan antelopeAncient DNA (coming soon)- Saqqaq Eskimo Microbe- Aboriginal Australian E. Coli O104:H4 TY-2482 Cell-Line Chinese Hamster Ovary
  • 36. BGI Datasets Get DOI®s Many unpublished…Invertebrate PLANTSAnt Vertebrates Chinese cabbage- Florida carpenter ant Giant panda Macaque Cucumber- Jerdon’s jumping ant - Chinese rhesus Foxtail millet- Leaf-cutter ant - Crab-eating PigeonpeaRoundworm Naked mole rat PotatoSilkworm Penguin Sorghum - Emperor penguinHuman - Adelie penguinAsian individual (YH) Pigeon, domestic- DNA Methylome Polar bear- Genome Assembly Sheep doi:10.5524/100004- Transcriptome Tibetan antelopeAncient DNA (coming soon)- Saqqaq Eskimo Microbe- Aboriginal Australian E. Coli O104:H4 TY-2482 Cell-Line Chinese Hamster Ovary
  • 37. Data also submitted to NCBI (including SV data to dbVar)Complemented by citable form, and data-types including: Assemblies of 3 strains Raw Data SNPs InDels CNVs SV
  • 38. Our first DOI:To maximize its utility to the research community and aid those fighting the currentepidemic, genomic data is released here into the public domain under a CC0license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G;Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S;Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z;Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J andthe Escherichia coli O104:H4 TY-2482 isolate genome sequencingconsortium (2011)Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGIShenzhen. http://dx.doi.org/10.5524/100001 To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
  • 39. “The way that the genetic data of the 2011 E. coli strain were disseminatedglobally suggests a more effective approach for tackling public healthproblems. Both groups put their sequencing data on the Internet, so scientiststhe world over could immediately begin their own analysis of the bugsmakeup. BGI scientists also are using Twitter to communicate their latestfindings.”“German scientists and their colleagues at the Beijing Genomics Institute in China havebeen working on uncovering secrets of the outbreak. BGI scientists revised their draftgenetic sequence of the E. coli strain and have been sharing their data with dozens ofscientists around the world as a way to "crowdsource" this data. By publishing their datapublicy and freely, these other scientists can have a look at the genetic structure, and tryto sort it out for themselves.”
  • 40. We want your data! scott@gigasciencejournal.comeditorial@gigasciencejournal.com @gigascience facebook.com/GigaScience blogs.openaccesscentral.com/blogs/gigablog/ www.gigasciencejournal.com