Your SlideShare is downloading. ×
0
: a Journal or a Database?<br />(Lessons learned from the Genomics “Tsunami”)<br />Scott Edmunds<br />HUPO Congress 2011, ...
BGI Introduction<br />Formerly known as Beijing Genomics Institute<br />Founded in 1999<br />Now the largest genomic organ...
Largest Sequencing Capacity in the World<br />Sequencers<br />137Illumina/HiSeq 2000<br />27LifeTech/SOLiD 4<br />	16     ...
Mass spectrometry at BGI<br />QTRAP 5500, AB SCIEX<br />Orbitrap velos, Thermo Scientific<br />maXis Q-TOF, Bruker<br />ul...
Products and Services Offered to Collaborators<br />Protein Profiling for any species <br />		(tying in with 1000 PARGP)<b...
“Trans-Omics”<br />Objective to integrate data from: <br /><ul><li>Genomics
Transcriptomics
Proteomics
Metabolomics</li></li></ul><li>BGI Proteomics Dept Focus:<br />RAW MS data storage and analysis<br />Upstream analysis <br...
Lessons Learned:<br />What went right?<br />
Lessons Learned: <br />1. having a cool project helps…<br />Bill Clinton: <br />“We are here to celebrate the completion o...
Lessons Learned: <br />2. Reproducibility is important…<br />Helped by stability of:<br />Platforms <br />Infrastructure<b...
Lessons Learned: <br />3. Sharing is important…<br />V<br />
Lessons Learned: <br />3. Sharing is important…<br />V<br />
Lessons Learned: <br />3. Sharing is important…<br />Bermuda Accords 1996/1997/1998:<br /> Automatic release of sequence a...
Benefits of Data-sharing<br />Sharing Detailed Research Data Is Associated with Increased Citation Rate. <br />Piwowar HA,...
Rice v Wheat: consequences of publically available genome data.<br />
The Ecoresponsive Genome of Daphnia pulexColbourne et al., Science4 February 2011: <br />200Mb Genome, 30,907 genes<br />D...
Daphnia Genome Consortium<br />wFleabase: 					Mar 2006<br />Genome release: 			July 2007<br />Genome Published:		Feb 2011...
Problems?<br />Flickr cc: opensourceway<br />
Lessons Learned: <br />4. Need to manage expectations…<br />June 2000<br />Thomas Michael Dexter (Wellcome trust): <br />“...
Lessons Learned: <br />4. Need to manage expectations…<br />June 2010<br />
Lessons Learned: 5. Data, data, data  <br />Sequencing cost($ per Mbp)<br />Moore’s Law<br />~100,000X<br />Sequencing<br ...
Lessons Learned: 5. Data, data, data  <br />Sequencing Output<br />Data<br />Storage<br />Moore’s/Kryders Law<br />
Lessons Learned: 5. Data, data, data  <br />Sequencing Output<br />Data<br />Publication<br />Dissemination?<br />
Lessons Learned: 5. Data, data, data  <br />Can we keep up?<br />Flickr cc: opensourceway<br />
Lessons Learned: 5. Data, data, data  <br />Do we have models for long term funding?<br />Human Gene Mutation Database<br ...
Lessons Learned: 5. Data, data, data  <br />Growing/widening user base.<br />3rd Gen sequencers: “Democratizing sequencing...
Lessons Learned: 5. Data, data, data  <br />Curation, curation, curation?<br />?<br />The long tail of new “big-data” prod...
Lessons Learned: 5. Data, data, data  <br />Are there now too many hurdles?<br />?<br />
Lessons Learned: 5. Data, data, data  <br />Are there now too many hurdles?<br />Technical: 		too large volumes<br />		 		...
Potential solutions?<br />
Potential solutions: Better handling of data, data, data  <br />Cloud?<br />
Potential solutions: Better handling of data, data, data  <br /><ul><li>What to save/what to throw away?
Better Compression?</li></li></ul><li>Potential solutions: Better handling of metadata…<br />Cloud solutions?<br />Better ...
Potential Solutions: <br />New incentives/credit<br />Credit where credit is overdue:<br />“One option would be to provide...
Datacitation: Datacite and DOIs<br />Digital Object Identifiers (DOIs) offer a solution<br /><ul><li>Mostly widely used id...
Researchers, authors, publishers know how to use them
Put datasets on the same playing field as articles</li></ul><br />Dataset<br />Yancheva et al (2007). Analyses on sedimen...
Datacitation: Datacite and DOIs<br />>1 million DOIs since Dec 2009<br />Central metadata repository to link with WoS/ISI<...
How can we combine these?<br />Databases<br />?<br />Journals<br />
Now taking submissions…<br />Large-Scale Data <br />Journal/Database<br />In conjunction with:<br />Editor-in-Chief: Lauri...
Criteria and Focus of Journal/Database<br /><ul><li>Reproducibility/Reuse
Utility/Usability
Standards/Searchability/Scale/Sharing
Data publishing/DOI</li></ul>www.gigasciencejournal.com<br />
Data publishing/DOI<br /><ul><li>Data hosting will follow standard funding agency and community guidelines.
DOI  assignment available for submitted data to allow ease of findingand citing datasets, as well as for citation tracking.
Datasets tracked by WOS/ISI allowing additional metrics/credit for use.</li></ul>www.gigasciencejournal.com<br />
Reproducibility/Reuse<br /><ul><li> BGI Cloud Computing resources for handling and analyzing large-scale data.
Upcoming SlideShare
Loading in...5
×

Scott Edmunds: GigaScience - a journal or a database? Lessons learned from the Genomics Tsunami

1,611

Published on

Scott Edmunds talk at the HUPO congress in Geneva, September 6th 2011 on GigaScience - a journal or a database? Lessons learned from the Genomics Tsunami.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,611
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • BGI (formerly known as Beijing Genomics Institute) was founded in 1999 and has since become the largest genomic organization in the world, with a focus on research and applications in healthcare, agriculture, conservation, and bio-energy fields.Our goal is to make leading-edge genomics highly accessible to the global research community by leveraging industry’s best technology, economies of scale and expert bioinformatics resources. BGI Americas was established as an interface with customer and collaborations in North and South Americas.
  • Our facilities feature Sanger and next-generation sequencing technologies, providing the highest throughput sequencing capacity in the world. Powered by 137 IlluminaHiSeq 2000 instruments and 27 Applied BiosystemsSOLiD™ 4 Systems, we provide, high-quality sequencing results with industry-leading turnaround time. As of December 2010, our sequencing capacity is 5 Tb raw data per day, supported by several supercomputing centers with a total peak performance up to 102 Tflops, 20 TB of memory, and 10 PB storage. We provide stable and efficient resources to store and analyze massive amounts of data generated by next generation sequencing.
  • Helps reproducibility, but some debate over whether it can help that much regarding scaling.
  • Need to help authors and curators.
  • Transcript of "Scott Edmunds: GigaScience - a journal or a database? Lessons learned from the Genomics Tsunami"

    1. 1. : a Journal or a Database?<br />(Lessons learned from the Genomics “Tsunami”)<br />Scott Edmunds<br />HUPO Congress 2011, Geneva<br />www.gigasciencejournal.com<br />
    2. 2. BGI Introduction<br />Formerly known as Beijing Genomics Institute<br />Founded in 1999<br />Now the largest genomic organization in the world<br />Goal <br />Use genomics technology to impact the society<br />Make leading edge genomics highly <br /> accessible to the global research community<br />
    3. 3. Largest Sequencing Capacity in the World<br />Sequencers<br />137Illumina/HiSeq 2000<br />27LifeTech/SOLiD 4<br /> 16 AB/3730xl + 110 MegaBACEs<br /> 2 IlluminaiScan<br />Data Production<br /> 5.6 Tb / day<br /> > 1500X of human genome / day<br />Multiple Supercomputing Centers<br /> 157 TB Flops<br /> 20 TB Memory <br /> 12.6 PB Storage<br />
    4. 4. Mass spectrometry at BGI<br />QTRAP 5500, AB SCIEX<br />Orbitrap velos, Thermo Scientific<br />maXis Q-TOF, Bruker<br />ultraflex, Bruker<br />
    5. 5. Products and Services Offered to Collaborators<br />Protein Profiling for any species <br /> (tying in with 1000 PARGP)<br />Techniques:<br />Quantitative analysis<br />Post-translational modification<br />Target Proteomics<br />Metabolomics<br />
    6. 6.
    7. 7. “Trans-Omics”<br />Objective to integrate data from: <br /><ul><li>Genomics
    8. 8. Transcriptomics
    9. 9. Proteomics
    10. 10. Metabolomics</li></li></ul><li>BGI Proteomics Dept Focus:<br />RAW MS data storage and analysis<br />Upstream analysis <br />“Large-scale” screening/quantitative analysis<br />Working on:<br />Automatic analysis pipelines/tools<br />Industrial usage/standards<br />
    11. 11. Lessons Learned:<br />What went right?<br />
    12. 12. Lessons Learned: <br />1. having a cool project helps…<br />Bill Clinton: <br />“We are here to celebrate the completion of the first survey of the entire human genome. Without a doubt, this is the most important, most wondrous map ever produced by human kind. “<br />“Today we are learning the language in which God created life.” <br />
    13. 13. Lessons Learned: <br />2. Reproducibility is important…<br />Helped by stability of:<br />Platforms <br />Infrastructure<br />Standards<br />1st Gen<br />2ndGen<br />
    14. 14. Lessons Learned: <br />3. Sharing is important…<br />V<br />
    15. 15. Lessons Learned: <br />3. Sharing is important…<br />V<br />
    16. 16. Lessons Learned: <br />3. Sharing is important…<br />Bermuda Accords 1996/1997/1998:<br /> Automatic release of sequence assemblies within 24 hours.<br />Immediate publication of finished annotated sequences.<br />Aim to make the entire sequence freely available in the public domain for both research and development in order to maximise benefits to society.<br />Fort Lauderdale Agreement, 2003:<br />Sequence traces from whole genome shotgun projects are to be deposited in a trace archive within one week of production. <br />Whole genome assemblies are to be deposited in a public nucleotide sequence database as soon as possible after the assembled sequence has met a set of quality evaluation criteria.<br />Toronto International data release workshop, 2009:<br />The goal was to reaffirm and refine, where needed, the policies related to the early release of genomic data, and to extend, if possible, similar data release policies to other types of large biological datasets – whether from proteomics, biobanking or metabolite research.<br />
    17. 17. Benefits of Data-sharing<br />Sharing Detailed Research Data Is Associated with Increased Citation Rate. <br />Piwowar HA, Day RS, Fridsma DB (2007) PLoSONE 2(3): e308. doi:10.1371/journal.pone.0000308<br />Every 10 datasets collected contributes to at least 4papers in the following 3-years.<br />Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a<br />
    18. 18. Rice v Wheat: consequences of publically available genome data.<br />
    19. 19. The Ecoresponsive Genome of Daphnia pulexColbourne et al., Science4 February 2011: <br />200Mb Genome, 30,907 genes<br />Duplicated genes most responsive to ecological challenges<br />
    20. 20. Daphnia Genome Consortium<br />wFleabase: Mar 2006<br />Genome release: July 2007<br />Genome Published: Feb 2011<br />>58 companion papers<br />https://daphnia.cgb.indiana.edu/Publications<br />
    21. 21. Problems?<br />Flickr cc: opensourceway<br />
    22. 22. Lessons Learned: <br />4. Need to manage expectations…<br />June 2000<br />Thomas Michael Dexter (Wellcome trust): <br />“Mapping the human genome has been compared with putting a man on the moon, but I believe it is more than that. This is the outstanding achievement not only of our lifetime, but in terms of human history” <br />
    23. 23. Lessons Learned: <br />4. Need to manage expectations…<br />June 2010<br />
    24. 24. Lessons Learned: 5. Data, data, data <br />Sequencing cost($ per Mbp)<br />Moore’s Law<br />~100,000X<br />Sequencing<br />Source: E Lander/Broad<br />
    25. 25. Lessons Learned: 5. Data, data, data <br />Sequencing Output<br />Data<br />Storage<br />Moore’s/Kryders Law<br />
    26. 26. Lessons Learned: 5. Data, data, data <br />Sequencing Output<br />Data<br />Publication<br />Dissemination?<br />
    27. 27. Lessons Learned: 5. Data, data, data <br />Can we keep up?<br />Flickr cc: opensourceway<br />
    28. 28. Lessons Learned: 5. Data, data, data <br />Do we have models for long term funding?<br />Human Gene Mutation Database<br />Kyoto Encyclopedia of Genes and Genomes<br />?<br />Flickr cc: opensourceway<br />
    29. 29. Lessons Learned: 5. Data, data, data <br />Growing/widening user base.<br />3rd Gen sequencers: “Democratizing sequencing”<br />?<br />
    30. 30. Lessons Learned: 5. Data, data, data <br />Curation, curation, curation?<br />?<br />The long tail of new “big-data” producers?<br />
    31. 31. Lessons Learned: 5. Data, data, data <br />Are there now too many hurdles?<br />?<br />
    32. 32. Lessons Learned: 5. Data, data, data <br />Are there now too many hurdles?<br />Technical: too large volumes<br /> too heterogeneous <br /> no home for many data types<br /> too time consuming<br />Economic: too expensive, no long-term funding<br />Cultural: inertia<br /> no incentives to share <br /> unaware of how<br />?<br />
    33. 33. Potential solutions?<br />
    34. 34. Potential solutions: Better handling of data, data, data <br />Cloud?<br />
    35. 35. Potential solutions: Better handling of data, data, data <br /><ul><li>What to save/what to throw away?
    36. 36. Better Compression?</li></li></ul><li>Potential solutions: Better handling of metadata…<br />Cloud solutions?<br />Better tools for assessing data quality…<br />
    37. 37. Potential Solutions: <br />New incentives/credit<br />Credit where credit is overdue:<br />“One option would be to provide researchers who release data to public repositories with a means of accreditation.”<br />“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “<br />Nature Biotechnology 27, 579 (2009) <br />Prepublication data sharing <br />(Toronto International Data Release Workshop)<br />“Data producers benefit from creating a citable reference, as it can later be used to reflect impact of the data sets.”<br />Nature461, 168-170 (2009) <br />?<br />
    38. 38. Datacitation: Datacite and DOIs<br />Digital Object Identifiers (DOIs) offer a solution<br /><ul><li>Mostly widely used identifier for scientific articles
    39. 39. Researchers, authors, publishers know how to use them
    40. 40. Put datasets on the same playing field as articles</li></ul><br />Dataset<br />Yancheva et al (2007). Analyses on sediment of Lake Maar. PANGAEA.<br />doi:10.1594/PANGAEA.587840<br />
    41. 41. Datacitation: Datacite and DOIs<br />>1 million DOIs since Dec 2009<br />Central metadata repository to link with WoS/ISI<br />- finally can track and credit use!<br />
    42. 42. How can we combine these?<br />Databases<br />?<br />Journals<br />
    43. 43. Now taking submissions…<br />Large-Scale Data <br />Journal/Database<br />In conjunction with:<br />Editor-in-Chief: Laurie Goodman, PhD<br /> Editor: Scott Edmunds, PhD<br /> Assistant Editor: Alexandra Basford, PhD<br />www.gigasciencejournal.com<br />
    44. 44. Criteria and Focus of Journal/Database<br /><ul><li>Reproducibility/Reuse
    45. 45. Utility/Usability
    46. 46. Standards/Searchability/Scale/Sharing
    47. 47. Data publishing/DOI</li></ul>www.gigasciencejournal.com<br />
    48. 48. Data publishing/DOI<br /><ul><li>Data hosting will follow standard funding agency and community guidelines.
    49. 49. DOI assignment available for submitted data to allow ease of findingand citing datasets, as well as for citation tracking.
    50. 50. Datasets tracked by WOS/ISI allowing additional metrics/credit for use.</li></ul>www.gigasciencejournal.com<br />
    51. 51. Reproducibility/Reuse<br /><ul><li> BGI Cloud Computing resources for handling and analyzing large-scale data.
    52. 52. Integrated tools to promote more widespread access, viewing, and analysis of data.
    53. 53. Encourage and aid use of workflow systems for methods (e.g. submission of Galaxy XML files).</li></ul>www.gigasciencejournal.com<br />
    54. 54. Special Series/Hub for cloud-based tools<br /><ul><li>Technical notes: test tools in the BGI-Cloud.
    55. 55. Tools + Test Data (BGI or user) in one place.
    56. 56. Aids reproducibility.
    57. 57. Aids reviewers (free)
    58. 58. Aids authors: visibility (pubmed, etc.) hosting (included/free offers)</li></ul> –contact us: editorial@gigasciencejournal.com<br />Oledoeflickr cc<br />www.gigasciencejournal.com<br />
    59. 59. Standards/Searchability/Sharing<br /><ul><li>ISA-Tab compatibility to aid and promote best practice in metadata reporting.
    60. 60. Allsupporting data must be publically available.
    61. 61. Ask for MIBBI compliance and use of reporting checklists.
    62. 62. Part of the Biosharing network.</li></ul>www.gigasciencejournal.com<br />
    63. 63. Our first DOI:<br />To maximize its utility to the research community and aid those  fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:<br />Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001<br />To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China. <br />
    64. 64.
    65. 65.
    66. 66. “The way that the genetic data of the 2011 E. coli strain were disseminated globally suggests a more effective approach for tackling public health problems. Both groups put their sequencing data on the Internet, so scientists the world over could immediately begin their own analysis of the bug's makeup. BGI scientists also are using Twitter to communicate their latest findings.”<br />“German scientists and their colleagues at the Beijing Genomics Institute in China have been working on uncovering secrets of the outbreak. BGI scientists revised their draft genetic sequence of the E. coli strain and have been sharing their data with dozens of scientists around the world as a way to "crowdsource" this data. By publishing their data publicy and freely, these other scientists can have a look at the genetic structure, and try to sort it out for themselves.” <br />
    67. 67.
    68. 68. G10K Genomes Get DOI®s<br />doi:10.5524/100004 <br />
    69. 69. We want your data!<br />scott@gigasciencejournal.com<br />editorial@gigasciencejournal.com<br />@gigascience<br />www.gigasciencejournal.com<br />
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×