Scott Edmunds: GigaScience - a journal or a database? Lessons learned from the Genomics Tsunami

: a Journal or a Database? (Lessons learned from the Genomics “Tsunami”) Scott Edmunds HUPO Congress 2011, Geneva www.gigasciencejournal.com

BGI Introduction Formerly known as Beijing Genomics Institute Founded in 1999 Now the largest genomic organization in the world Goal Use genomics technology to impact the society Make leading edge genomics highly accessible to the global research community

Largest Sequencing Capacity in the World Sequencers 137Illumina/HiSeq 2000 27LifeTech/SOLiD 4 16 AB/3730xl + 110 MegaBACEs 2 IlluminaiScan Data Production 5.6 Tb / day > 1500X of human genome / day Multiple Supercomputing Centers 157 TB Flops 20 TB Memory 12.6 PB Storage

Mass spectrometry at BGI QTRAP 5500, AB SCIEX Orbitrap velos, Thermo Scientific maXis Q-TOF, Bruker ultraflex, Bruker

Products and Services Offered to Collaborators Protein Profiling for any species (tying in with 1000 PARGP) Techniques: Quantitative analysis Post-translational modification Target Proteomics Metabolomics

“Trans-Omics” Objective to integrate data from: ,[object Object]

Lessons Learned: What went right?

Lessons Learned: 1. having a cool project helps… Bill Clinton: “We are here to celebrate the completion of the first survey of the entire human genome. Without a doubt, this is the most important, most wondrous map ever produced by human kind. “ “Today we are learning the language in which God created life.”

Lessons Learned: 2. Reproducibility is important… Helped by stability of: Platforms Infrastructure Standards 1st Gen 2ndGen

Lessons Learned: 3. Sharing is important… V

Lessons Learned: 3. Sharing is important… Bermuda Accords 1996/1997/1998: Automatic release of sequence assemblies within 24 hours. Immediate publication of finished annotated sequences. Aim to make the entire sequence freely available in the public domain for both research and development in order to maximise benefits to society. Fort Lauderdale Agreement, 2003: Sequence traces from whole genome shotgun projects are to be deposited in a trace archive within one week of production. Whole genome assemblies are to be deposited in a public nucleotide sequence database as soon as possible after the assembled sequence has met a set of quality evaluation criteria. Toronto International data release workshop, 2009: The goal was to reaffirm and refine, where needed, the policies related to the early release of genomic data, and to extend, if possible, similar data release policies to other types of large biological datasets – whether from proteomics, biobanking or metabolite research.

Benefits of Data-sharing Sharing Detailed Research Data Is Associated with Increased Citation Rate. Piwowar HA, Day RS, Fridsma DB (2007) PLoSONE 2(3): e308. doi:10.1371/journal.pone.0000308 Every 10 datasets collected contributes to at least 4papers in the following 3-years. Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a

Rice v Wheat: consequences of publically available genome data.

The Ecoresponsive Genome of Daphnia pulexColbourne et al., Science4 February 2011: 200Mb Genome, 30,907 genes Duplicated genes most responsive to ecological challenges

Daphnia Genome Consortium wFleabase: Mar 2006 Genome release: July 2007 Genome Published: Feb 2011 >58 companion papers https://daphnia.cgb.indiana.edu/Publications

Problems? Flickr cc: opensourceway

Lessons Learned: 4. Need to manage expectations… June 2000 Thomas Michael Dexter (Wellcome trust): “Mapping the human genome has been compared with putting a man on the moon, but I believe it is more than that. This is the outstanding achievement not only of our lifetime, but in terms of human history”

Lessons Learned: 4. Need to manage expectations… June 2010

Lessons Learned: 5. Data, data, data Sequencing cost($ per Mbp) Moore’s Law ~100,000X Sequencing Source: E Lander/Broad

Lessons Learned: 5. Data, data, data Sequencing Output Data Storage Moore’s/Kryders Law

Lessons Learned: 5. Data, data, data Sequencing Output Data Publication Dissemination?

Lessons Learned: 5. Data, data, data Can we keep up? Flickr cc: opensourceway

Lessons Learned: 5. Data, data, data Do we have models for long term funding? Human Gene Mutation Database Kyoto Encyclopedia of Genes and Genomes ? Flickr cc: opensourceway

Lessons Learned: 5. Data, data, data Growing/widening user base. 3rd Gen sequencers: “Democratizing sequencing” ?

Lessons Learned: 5. Data, data, data Curation, curation, curation? ? The long tail of new “big-data” producers?

Lessons Learned: 5. Data, data, data Are there now too many hurdles? ?

Lessons Learned: 5. Data, data, data Are there now too many hurdles? Technical: too large volumes too heterogeneous no home for many data types too time consuming Economic: too expensive, no long-term funding Cultural: inertia no incentives to share unaware of how ?

Potential solutions: Better handling of data, data, data Cloud?

Potential solutions: Better handling of data, data, data ,[object Object]

Better Compression?,[object Object]

Potential Solutions: New incentives/credit Credit where credit is overdue: “One option would be to provide researchers who release data to public repositories with a means of accreditation.” “An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “ Nature Biotechnology 27, 579 (2009) Prepublication data sharing (Toronto International Data Release Workshop) “Data producers benefit from creating a citable reference, as it can later be used to reflect impact of the data sets.” Nature461, 168-170 (2009) ?

Datacitation: Datacite and DOIs Digital Object Identifiers (DOIs) offer a solution ,[object Object]

Researchers, authors, publishers know how to use them

Put datasets on the same playing field as articles Dataset Yancheva et al (2007). Analyses on sediment of Lake Maar. PANGAEA. doi:10.1594/PANGAEA.587840

Datacitation: Datacite and DOIs >1 million DOIs since Dec 2009 Central metadata repository to link with WoS/ISI - finally can track and credit use!

How can we combine these? Databases ? Journals

Now taking submissions… Large-Scale Data Journal/Database In conjunction with: Editor-in-Chief: Laurie Goodman, PhD Editor: Scott Edmunds, PhD Assistant Editor: Alexandra Basford, PhD www.gigasciencejournal.com

Criteria and Focus of Journal/Database ,[object Object]

Standards/Searchability/Scale/Sharing

Data publishing/DOIwww.gigasciencejournal.com

Data publishing/DOI ,[object Object]

DOI assignment available for submitted data to allow ease of findingand citing datasets, as well as for citation tracking.

Datasets tracked by WOS/ISI allowing additional metrics/credit for use.www.gigasciencejournal.com

Reproducibility/Reuse ,[object Object]

Scott Edmunds: GigaScience - a journal or a database? Lessons learned from the Genomics Tsunami

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scott Edmunds: GigaScience - a journal or a database? Lessons learned from the Genomics Tsunami

Similar to Scott Edmunds: GigaScience - a journal or a database? Lessons learned from the Genomics Tsunami (20)

More from GigaScience, BGI Hong Kong

More from GigaScience, BGI Hong Kong (20)

Recently uploaded

Recently uploaded (20)

Scott Edmunds: GigaScience - a journal or a database? Lessons learned from the Genomics Tsunami

Editor's Notes