Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

4,046 views
3,947 views

Published on

Alexandra Basford's talk in the curation session at the InCoB meeting in Kuala Lumpar, 30/11/11 on: GigaScience: A Journal’s Perspective on Data Standards and Biocuration

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
4,046
On SlideShare
0
From Embeds
0
Number of Embeds
2,695
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Integrated tools to promote more widespread access, viewing, and analysis of the stored data. BGI Cloud Computing resources for handling and analyzing large-scale data. All Data given a DOI to allow ease of finding and citing datasets, as well as for citation tracking.
  • Our facilities feature Sanger and next-generation sequencing technologies, providing the highest throughput sequencing capacity in the world. Powered by 137 IlluminaHiSeq 2000 instruments and 27 Applied BiosystemsSOLiD™ 4 Systems, we provide, high-quality sequencing results with industry-leading turnaround time. As of December 2010, our sequencing capacity is 5 Tb raw data per day, supported by several supercomputing centers with a total peak performance up to 102 Tflops, 20 TB of memory, and 10 PB storage. We provide stable and efficient resources to store and analyze massive amounts of data generated by next generation sequencing.
  • Not shown: 1,000 Medelian Disorders Project, Autism Sequencing Project, Netherlands sequencing…
  • Assemblies and raw data are still going to NCBI.
  • Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data todbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • Integrated tools to promote more widespread access, viewing, and analysis of the stored data. BGI Cloud Computing resources for handling and analyzing large-scale data. All Data given a DOI to allow ease of finding and citing datasets, as well as for citation tracking.
  • Integrated tools to promote more widespread access, viewing, and analysis of the stored data. BGI Cloud Computing resources for handling and analyzing large-scale data. All Data given a DOI to allow ease of finding and citing datasets, as well as for citation tracking.
  • Have all of the metadata fields, working on integrating the tools.
  • Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

    1. 1. A Journal’s Perspective on Data Standards and Biocuration Alexandra Basford, PhDw w w. g i g a s c i e n c e j o u r n a l . c o m
    2. 2. Overview / The Curation Challenges of a Introduction Journal/Database Reproducibility/ReuseData Publishing Utility/Usability Our DOI Adventures Standards/Searchability/ Sharing
    3. 3. Overview / The Curation Challenges of a Introduction Journal/DatabaseHow do we deal with “big data”? Reproducibility/ReuseData Publishing Utility/Usability Our DOI Adventures Standards/Searchability/ Sharing
    4. 4. vs. ?
    5. 5. What is ?
    6. 6. w w w. g ig asci en cej o u rn al . co m
    7. 7. is a new open-access open- data journal for the publication of all types of biological studies that use or create large- scale data setsThe scope spans the biomedical and life sciences,including: - “Omics” - Ecology - Imaging - Medicine - Neuroscience - Systems biology … “big and sharable” Published by in partnership with
    8. 8. Editorial Board – InternationalStephan Beck, UK Stephen OBrien, USAAlvis Brazma, UK Hanchuan Peng, USAAnn-Shyn Chiang, Taiwan Russell Poldrack, USARichard Durbin, UK Ming Qi, China/USAPaul Flicek, UK Susanna-Assunta Sansone, UKRobert Hanner, Canada Michael Schatz, USAYoshihide Hayashizaki, Japan David Schwartz, USAHenning Hermjakob, UK Fritz Sommer, USAWolfgang Huber, Germany Lincoln Stein, CanadaGary King, USA Sumio Sugano, JapanTin-Lap Lee, Hong Kong Thomas Wachtler, GermanyDonald Moerman, Canada Jun Wang, ChinaKaren Nelson, USA Alistair Young, New ZealandFrancis Ouellette, Canada Zang Yufeng, ChinaLennart Hammarström, Sweden Marie Zins, FrancePaul Horton, Japan
    9. 9. Editorial Board – MultidisciplinaryStephan Beck, Epigenomics Stephen OBrien, GenomicsAlvis Brazma, Transcriptomics Hanchuan Peng, Imaging/NeuroAnn-Shyn Chiang, Neuroscience Russell Poldrack, NeuroscienceRichard Durbin, Genetics/Genomics Ming Qi, GeneticsPaul Flicek, Genomics Susanna-Assunta Sansone, StandardsRobert Hanner, DNA Barcoding/Ecology Michael Schatz, Cloud ComputingYoshihide Hayashizaki, Genomics David Schwartz, Optical MappingHenning Hermjakob, Proteomics Fritz Sommer, NeuroscienceWolfgang Huber, Functional Genomics Lincoln Stein, Cloud ComputingGary King, Medicine Sumio Sugano, GenomicsTin-Lap Lee, Genomics Thomas Wachtler, NeuroscienceDonald Moerman, Functional Genomics Jun Wang, GenomicsKaren Nelson, Metagenomics Alistair Young, Medical ImagingFrancis Ouellette, Genomics Zang Yufeng, NeuroscienceLennart Hammarström, Immuno/Genetics Marie Zins, MedicinePaul Horton, Genetics/Tools
    10. 10. Now acceptingsubmissions
    11. 11. What is ?
    12. 12. w w w. G i g a D B . o r g
    13. 13. &✕vs. !
    14. 14. An Unusual Format• GigaScience combines standard manuscript publication with an ever expanding database• Evolving data repository – Integrating tools for public access, viewing, and analysis of the stored data – Improvements driven by community input• All datasets are assigned data digital object identifiers (DOIs) to make them easy to access, track, and cite &
    15. 15. Data Sharing Hurdles• Technical – too large volumes – too heterogeneous – no home for many data types• Economic – too expensive – no long-term funding• Cultural – inertia – no incentives to share – unaware of how ? – too time consuming
    16. 16. Changing TrendsCultural shift towards data sharing. Growing/widening user base. The long tail of new “big-data” producers? Curation, cutation, curation ?
    17. 17. Use of Data = Importance + Usability subjective? easier to assess
    18. 18. Challenges for a Journal/Database Reproducibility/ReuseUtility/Usability Standards/Searchability/Shari ng Data publishing/DOI DOI®
    19. 19. Why DOI®s?• Guarantee of permanency .org• Clear method for data tracking and data citation, allowing: – Increased the searchability (and hopefully use) of data – Credit for data production, making it clear who produced the data and when – Credit to original authors for their data’s use – The ability to track and receive feedback on data usage – A data citation metric potentially rivaling and complementary to the impact factor – The potential make the data available and receive credit for it earlier, then later publishing papers on the dataset
    20. 20. Largest Sequencing Capacity in the World Sequencers Data Production137 Illumina/HiSeq 2000 5.6 Tb / day27 LifeTech/SOLiD 4 > 1500X of human genome / day16 AB/3730xl + 110 MegaBACEs Multiple Supercomputing Centers2 Illumina iScan 157 TB Flops 20 TB Memory 12.6 PB Storage
    21. 21. BGI – “Sequence it.”
    22. 22. Early BGI DOI®s
    23. 23. Datasets VertebratesInvertebrates Giant panda Plants Macaque Chinese cabbageAnt - Chinese rhesus Cucumber- Florida carpenter ant - Crab-eating- Jerdon’s jumping ant Foxtail millet Naked mole rat Pigeonpea- Leaf-cutter ant Penguin PotatoRoundworm - Emperor penguin SorghumSilkworm - Adelie penguin Pigeon, domesticHuman Polar bearAsian individual (YH) Sheep- DNA Methylome Tibetan antelope- Genome Assembly- Transcriptome MicrobeAncient DNA (coming soon) E. Coli O104:H4 TY-2482- Saqqaq Eskimo- Aboriginal Australian Cell Line Chinese Hamster Ovary
    24. 24. The Success of E. coli
    25. 25. Our First DOI®To maximize its utility to the research community and aid those fighting the currentepidemic, genomic data is released here into the public domain under a CC0license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang,Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun,Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ;Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482isolate genome sequencing consortium (2011)Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen.doi:10.5524/100001http://dx.doi.org/10.5524/100001 To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
    26. 26. N Engl J Med 2011; 365:718-724.
    27. 27. The Macaque Story
    28. 28. Analysis paper published
    29. 29. Data DOIs appear in the paper
    30. 30. Sorghum as the New Gold Standard
    31. 31. • Data also submitted to NCBI (including SV data to dbVar)• Submission to public databases complemented by its citable form in GigaDB: - Assemblies of three strains - Raw data - SNPs - InDels - CNVs - SV
    32. 32. In the paper…
    33. 33. In the references…
    34. 34. Is the DOI.
    35. 35. Progress!We begin issuing data DOIs Journals accept articles with data August July that have data DOIs Data DOIs listed in journal October articles Data DOIs are properly cited in the November reference section of journal articles (It’s been a busy year.)
    36. 36. Challenges for a Journal/Database Reproducibility/ReuseUtility/Usability Standards/Searchability/Shari ng Data publishing/DOI DOI®
    37. 37. Challenges for / Reproducibility/Reuse Utility/Usability Standards/Searchability/Shari ng✔Data publishing/DOI DOI®
    38. 38. Reproducibility/Reuse • BGI Cloud Computing resources for handling and analyzing large-scale data. • Integrated tools to promote more widespread access, viewing, and analysis of data. • Encourage and aid use of workflow systems for methods (e.g. submission of Galaxy XML files).
    39. 39. Utility/Usability = ease of access • Special series/hub for cloud-based tools - Technical notes: test tools in the BGI-Cloud. - Tools + test data (BGI or user) in one place. - Aids reproducibility. - Aids reviewers (free) - Aids authors: visibility (pubmed, etc.) hosting (included/free offers) –contact us: editorial@gigasciencejournal.com Oledoe flickr cc
    40. 40. Utility/Usability = tools Tin-Lap Lee, CUHK
    41. 41. Standards/Searchability/Sharing • ISA-Tab compatibility to aid and promote best practice in metadata reporting. • All supporting data must be publically available. • Ask for MIBBI compliance and use of reporting checklists. • Part of the Biosharing network and the International Neuroinformatics Coordinating Facility.
    42. 42. Big Data •Initiated 505 plant and animal genome projects •Completed fine or draft genome maps for over 100 speciesldl.genomics.cn •Finished the sequencing of about 200 species
    43. 43. Editor-in-Chief: Laurie Goodman, PhD Editor: Scott Edmunds, PhD Assistant Editor: Alexandra Basford, PhD Contact: editorial@gigasciencejournal.comFollow GigaScience on Twitter @GigaScience w w w. g i g a s c i e n c e j o u r n a l . c o m w w w. g i g a D B . o r g

    ×