Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration
Upcoming SlideShare
Loading in...5
×
 

Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration

on

  • 4,056 views

Alexandra Basford's talk in the curation session at the InCoB meeting in Kuala Lumpar, 30/11/11 on: GigaScience: A Journal’s Perspective on Data Standards and Biocuration

Alexandra Basford's talk in the curation session at the InCoB meeting in Kuala Lumpar, 30/11/11 on: GigaScience: A Journal’s Perspective on Data Standards and Biocuration

Statistics

Views

Total Views
4,056
Views on SlideShare
1,465
Embed Views
2,591

Actions

Likes
0
Downloads
5
Comments
0

8 Embeds 2,591

http://mnemosyne.de-blog.jp 2554
http://webcache.googleusercontent.com 12
http://www.mnemosyne.de-blog.jp 10
http://paper.li 9
https://twitter.com 3
http://a0.twimg.com 1
http://cache.yahoofs.jp 1
http://hghltd.yandex.net 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Integrated tools to promote more widespread access, viewing, and analysis of the stored data. BGI Cloud Computing resources for handling and analyzing large-scale data. All Data given a DOI to allow ease of finding and citing datasets, as well as for citation tracking.
  • Our facilities feature Sanger and next-generation sequencing technologies, providing the highest throughput sequencing capacity in the world. Powered by 137 IlluminaHiSeq 2000 instruments and 27 Applied BiosystemsSOLiD™ 4 Systems, we provide, high-quality sequencing results with industry-leading turnaround time. As of December 2010, our sequencing capacity is 5 Tb raw data per day, supported by several supercomputing centers with a total peak performance up to 102 Tflops, 20 TB of memory, and 10 PB storage. We provide stable and efficient resources to store and analyze massive amounts of data generated by next generation sequencing.
  • Not shown: 1,000 Medelian Disorders Project, Autism Sequencing Project, Netherlands sequencing…
  • Assemblies and raw data are still going to NCBI.
  • Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data todbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • Integrated tools to promote more widespread access, viewing, and analysis of the stored data. BGI Cloud Computing resources for handling and analyzing large-scale data. All Data given a DOI to allow ease of finding and citing datasets, as well as for citation tracking.
  • Integrated tools to promote more widespread access, viewing, and analysis of the stored data. BGI Cloud Computing resources for handling and analyzing large-scale data. All Data given a DOI to allow ease of finding and citing datasets, as well as for citation tracking.
  • Have all of the metadata fields, working on integrating the tools.

Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and Biocuration Presentation Transcript

  • A Journal’s Perspective on Data Standards and Biocuration Alexandra Basford, PhDw w w. g i g a s c i e n c e j o u r n a l . c o m
  • Overview / The Curation Challenges of a Introduction Journal/Database Reproducibility/ReuseData Publishing Utility/Usability Our DOI Adventures Standards/Searchability/ Sharing
  • Overview / The Curation Challenges of a Introduction Journal/DatabaseHow do we deal with “big data”? Reproducibility/ReuseData Publishing Utility/Usability Our DOI Adventures Standards/Searchability/ Sharing
  • vs. ?
  • What is ?
  • w w w. g ig asci en cej o u rn al . co m
  • is a new open-access open- data journal for the publication of all types of biological studies that use or create large- scale data setsThe scope spans the biomedical and life sciences,including: - “Omics” - Ecology - Imaging - Medicine - Neuroscience - Systems biology … “big and sharable” Published by in partnership with
  • Editorial Board – InternationalStephan Beck, UK Stephen OBrien, USAAlvis Brazma, UK Hanchuan Peng, USAAnn-Shyn Chiang, Taiwan Russell Poldrack, USARichard Durbin, UK Ming Qi, China/USAPaul Flicek, UK Susanna-Assunta Sansone, UKRobert Hanner, Canada Michael Schatz, USAYoshihide Hayashizaki, Japan David Schwartz, USAHenning Hermjakob, UK Fritz Sommer, USAWolfgang Huber, Germany Lincoln Stein, CanadaGary King, USA Sumio Sugano, JapanTin-Lap Lee, Hong Kong Thomas Wachtler, GermanyDonald Moerman, Canada Jun Wang, ChinaKaren Nelson, USA Alistair Young, New ZealandFrancis Ouellette, Canada Zang Yufeng, ChinaLennart Hammarström, Sweden Marie Zins, FrancePaul Horton, Japan
  • Editorial Board – MultidisciplinaryStephan Beck, Epigenomics Stephen OBrien, GenomicsAlvis Brazma, Transcriptomics Hanchuan Peng, Imaging/NeuroAnn-Shyn Chiang, Neuroscience Russell Poldrack, NeuroscienceRichard Durbin, Genetics/Genomics Ming Qi, GeneticsPaul Flicek, Genomics Susanna-Assunta Sansone, StandardsRobert Hanner, DNA Barcoding/Ecology Michael Schatz, Cloud ComputingYoshihide Hayashizaki, Genomics David Schwartz, Optical MappingHenning Hermjakob, Proteomics Fritz Sommer, NeuroscienceWolfgang Huber, Functional Genomics Lincoln Stein, Cloud ComputingGary King, Medicine Sumio Sugano, GenomicsTin-Lap Lee, Genomics Thomas Wachtler, NeuroscienceDonald Moerman, Functional Genomics Jun Wang, GenomicsKaren Nelson, Metagenomics Alistair Young, Medical ImagingFrancis Ouellette, Genomics Zang Yufeng, NeuroscienceLennart Hammarström, Immuno/Genetics Marie Zins, MedicinePaul Horton, Genetics/Tools
  • Now acceptingsubmissions
  • What is ?
  • w w w. G i g a D B . o r g
  • &✕vs. !
  • An Unusual Format• GigaScience combines standard manuscript publication with an ever expanding database• Evolving data repository – Integrating tools for public access, viewing, and analysis of the stored data – Improvements driven by community input• All datasets are assigned data digital object identifiers (DOIs) to make them easy to access, track, and cite &
  • Data Sharing Hurdles• Technical – too large volumes – too heterogeneous – no home for many data types• Economic – too expensive – no long-term funding• Cultural – inertia – no incentives to share – unaware of how ? – too time consuming
  • Changing TrendsCultural shift towards data sharing. Growing/widening user base. The long tail of new “big-data” producers? Curation, cutation, curation ?
  • Use of Data = Importance + Usability subjective? easier to assess
  • Challenges for a Journal/Database Reproducibility/ReuseUtility/Usability Standards/Searchability/Shari ng Data publishing/DOI DOI®
  • Why DOI®s?• Guarantee of permanency .org• Clear method for data tracking and data citation, allowing: – Increased the searchability (and hopefully use) of data – Credit for data production, making it clear who produced the data and when – Credit to original authors for their data’s use – The ability to track and receive feedback on data usage – A data citation metric potentially rivaling and complementary to the impact factor – The potential make the data available and receive credit for it earlier, then later publishing papers on the dataset
  • Largest Sequencing Capacity in the World Sequencers Data Production137 Illumina/HiSeq 2000 5.6 Tb / day27 LifeTech/SOLiD 4 > 1500X of human genome / day16 AB/3730xl + 110 MegaBACEs Multiple Supercomputing Centers2 Illumina iScan 157 TB Flops 20 TB Memory 12.6 PB Storage
  • BGI – “Sequence it.”
  • Early BGI DOI®s
  • Datasets VertebratesInvertebrates Giant panda Plants Macaque Chinese cabbageAnt - Chinese rhesus Cucumber- Florida carpenter ant - Crab-eating- Jerdon’s jumping ant Foxtail millet Naked mole rat Pigeonpea- Leaf-cutter ant Penguin PotatoRoundworm - Emperor penguin SorghumSilkworm - Adelie penguin Pigeon, domesticHuman Polar bearAsian individual (YH) Sheep- DNA Methylome Tibetan antelope- Genome Assembly- Transcriptome MicrobeAncient DNA (coming soon) E. Coli O104:H4 TY-2482- Saqqaq Eskimo- Aboriginal Australian Cell Line Chinese Hamster Ovary
  • The Success of E. coli
  • Our First DOI®To maximize its utility to the research community and aid those fighting the currentepidemic, genomic data is released here into the public domain under a CC0license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang,Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun,Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ;Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482isolate genome sequencing consortium (2011)Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen.doi:10.5524/100001http://dx.doi.org/10.5524/100001 To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
  • N Engl J Med 2011; 365:718-724.
  • The Macaque Story
  • Analysis paper published
  • Data DOIs appear in the paper
  • Sorghum as the New Gold Standard
  • • Data also submitted to NCBI (including SV data to dbVar)• Submission to public databases complemented by its citable form in GigaDB: - Assemblies of three strains - Raw data - SNPs - InDels - CNVs - SV
  • In the paper…
  • In the references…
  • Is the DOI.
  • Progress!We begin issuing data DOIs Journals accept articles with data August July that have data DOIs Data DOIs listed in journal October articles Data DOIs are properly cited in the November reference section of journal articles (It’s been a busy year.)
  • Challenges for a Journal/Database Reproducibility/ReuseUtility/Usability Standards/Searchability/Shari ng Data publishing/DOI DOI®
  • Challenges for / Reproducibility/Reuse Utility/Usability Standards/Searchability/Shari ng✔Data publishing/DOI DOI®
  • Reproducibility/Reuse • BGI Cloud Computing resources for handling and analyzing large-scale data. • Integrated tools to promote more widespread access, viewing, and analysis of data. • Encourage and aid use of workflow systems for methods (e.g. submission of Galaxy XML files).
  • Utility/Usability = ease of access • Special series/hub for cloud-based tools - Technical notes: test tools in the BGI-Cloud. - Tools + test data (BGI or user) in one place. - Aids reproducibility. - Aids reviewers (free) - Aids authors: visibility (pubmed, etc.) hosting (included/free offers) –contact us: editorial@gigasciencejournal.com Oledoe flickr cc
  • Utility/Usability = tools Tin-Lap Lee, CUHK
  • Standards/Searchability/Sharing • ISA-Tab compatibility to aid and promote best practice in metadata reporting. • All supporting data must be publically available. • Ask for MIBBI compliance and use of reporting checklists. • Part of the Biosharing network and the International Neuroinformatics Coordinating Facility.
  • Big Data •Initiated 505 plant and animal genome projects •Completed fine or draft genome maps for over 100 speciesldl.genomics.cn •Finished the sequencing of about 200 species
  • Editor-in-Chief: Laurie Goodman, PhD Editor: Scott Edmunds, PhD Assistant Editor: Alexandra Basford, PhD Contact: editorial@gigasciencejournal.comFollow GigaScience on Twitter @GigaScience w w w. g i g a s c i e n c e j o u r n a l . c o m w w w. g i g a D B . o r g