Your SlideShare is downloading. ×
0
www.gigasciencejournal.com
Overview           /               Genomics #101                          Data-Sharing Issues  Introduction               ...
A brief history of genomics…Human Genome Project: 1990-2003.1 Genome = $3 Billion   Source: http://www.genome.gov/Images/p...
A brief history of genomics… Source: http://www.genome.gov/sequencingcosts/ (with apologies)
A brief history of genomics…         1st Gen         2nd (next) Gen                                              3rd (next...
A brief history of genomics…         3rd (next-next) Gen? Source: http://www.genome.gov/sequencingcosts/ (with apologies)
BGI Introduction• Formerly known as Beijing Genomics Institute• Founded in 1999 (1% of HGP)• Not-for-profit research insti...
Global, with HQ in Shenzhen
Global, with HQ in Shenzhen
Global Sequencing Capacity                        Data Production                          5.6 Tb / day                > 1...
BGI Sequencing Capacity           Sequencers                 Data Production137   Illumina/HiSeq 2000               5.6 Tb...
Goal – “Just sequence it.”  M+M+M: Million Genome Projects• Plant and Animal Genomes: G10K, i5K...• Variation Genomes: 10K...
BGI Goes Denmark
BGI Goes Denmark
Genomics: the data-sharing success story?:                V
Sharing/reproducibility helped bystability of:                  1st Gen       2nd Gen1. Platforms1. Repositories          ...
Genomics Data Sharing Policies…   Bermuda Accords 1996/1997/1998:   1. Automatic release of sequence assemblies within 24 ...
Challenges for the future…  (A) Cumulative base pairs in INSDC over  time, excluding the Trace Archive.  (B) Base pairs in...
Challenges for the future…1. Data Volumes (transfer, backlogs, funding issues)2. Compliance3. Lack of interoperability/suf...
New incentives/creditCredit where credit is overdue:“One option would be to provide researchers who release data topublic ...
New incentives/credit      = Data Citation?         “increase acceptance of research data as         legitimate, citable c...
First issue next month…      Large-Scale Data      Journal/Database    In conjunction with:Editor-in-Chief: Laurie Goodman...
Associated Database   www.gigaDB.org
Papers in the era of big-data       goal: Executable Research Objects                              Citable DOI
Adventures in Data Citation  doi:10.5524/100001
For data citation to work, needs:1. Proven utility/potential user base.2. Acceptance/inclusion by journals.3. Data+Citatio...
Datacitation 1: utility/user base.Establishment of data DOIs and use by databases:                  Shackleton NJ, Hall MA...
BGI Datasets Get DOI®sInvertebrate                                            Many released pre-publication…Ant           ...
Our first DOI:To maximize its utility to the research community and aid those fightingthe current epidemic, genomic data i...
Downstream consequences:1. Therapeutics (primers, antimicrobials) 2. Platform Comparisons (Loman et al., Nature Biotech 20...
Data Citation 2: acceptance by journals
Data Citation 2: acceptance by journals
Data+Citation 3: inclusion in the references
• Data submitted to NCBI databases:-   Raw data                      SRA:SRA046843-   Assemblies of 3 strains       Genban...
In the references…
Is the DOI…
And now in Nature Biotech…
And in more journals…               Hodkinson BP, Uehling JK, Smith ME (2012) Data from: Lepidostroma               vilgal...
For data citation to work, needs:1. Proven utility/potential user base.   ✔2. Acceptance/inclusion by journals.     ✔3. Da...
Datacitation 4: tracking?
Datacitation 4: tracking?                        ✗FAIL       DataCite metadata in harvestable form (OAI-PMH)              ...
Datacitation 4: tracking?             ✗FAILDataCite metadata in harvestable form (OAI-PMH)✗      Working on it.       Comi...
Datacitation 5: metrics?“As a result of diverse practices and toollimitations, data citations are currently verydifficult to...
Datacitation 5: metrics?                          ✗FAIL    Research Remix, 29th May 2012: http://researchremix.wordpress.c...
Where data citation is in 2012:1. Proven utility/potential user base.   ✔2. Acceptance/inclusion by journals.     ✔3. Data...
Minor quibbles: export to citation managers                       DCC/DataCite recommended format:Zheng, L-Y; Guo, X-S; He...
Minor quibbles: clearer guidelines     Rules for versioning/where do you set granularity?   Experiment                    ...
Papers in the era of big-data                            goal: Executable Research ObjectsJuly 2012   Wilson GA, Dhami P, ...
Do you have interesting large-scale            biological data sets?   Submit to:• Rapid review/Open Access/High-visibilit...
Thanks to:Laurie Goodman       Alexandra BasfordTam Sneddon          Shaoguang LiangTin-Lap Lee (CUHK)   Qiong Luo (HKUST)...
Scott Edmunds at DataCite 2012: Adventures in Data Citation
Scott Edmunds at DataCite 2012: Adventures in Data Citation
Scott Edmunds at DataCite 2012: Adventures in Data Citation
Scott Edmunds at DataCite 2012: Adventures in Data Citation
Scott Edmunds at DataCite 2012: Adventures in Data Citation
Scott Edmunds at DataCite 2012: Adventures in Data Citation
Scott Edmunds at DataCite 2012: Adventures in Data Citation
Scott Edmunds at DataCite 2012: Adventures in Data Citation
Scott Edmunds at DataCite 2012: Adventures in Data Citation
Upcoming SlideShare
Loading in...5
×

Scott Edmunds at DataCite 2012: Adventures in Data Citation

1,129

Published on

Scott Edmunds at the DataCite summer meeting: Adventures in Data Citation. June 14th 2012

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,129
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • BGI (formerly known as Beijing Genomics Institute) was founded in 1999 and has since become the largest genomic organization in the world, with a focus on research and applications in healthcare, agriculture, conservation, and bio-energy fields.Our goal is to make leading-edge genomics highly accessible to the global research community by leveraging industry’s best technology, economies of scale and expert bioinformatics resources. BGI Americas was established as an interface with customer and collaborations in North and South Americas.
  • Our facilities feature Sanger and next-generation sequencing technologies, providing the highest throughput sequencing capacity in the world. Powered by 137 IlluminaHiSeq 2000 instruments and 27 Applied BiosystemsSOLiD™ 4 Systems, we provide, high-quality sequencing results with industry-leading turnaround time. As of December 2010, our sequencing capacity is 5 Tb raw data per day, supported by several supercomputing centers with a total peak performance up to 102 Tflops, 20 TB of memory, and 10 PB storage. We provide stable and efficient resources to store and analyze massive amounts of data generated by next generation sequencing.
  • Our facilities feature Sanger and next-generation sequencing technologies, providing the highest throughput sequencing capacity in the world. Powered by 137 IlluminaHiSeq 2000 instruments and 27 Applied BiosystemsSOLiD™ 4 Systems, we provide, high-quality sequencing results with industry-leading turnaround time. As of December 2010, our sequencing capacity is 5 Tb raw data per day, supported by several supercomputing centers with a total peak performance up to 102 Tflops, 20 TB of memory, and 15 PB storage. We provide stable and efficient resources to store and analyze massive amounts of data generated by next generation sequencing. The LHC of Biology?
  • Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data todbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data todbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data todbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data todbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • Transcript of "Scott Edmunds at DataCite 2012: Adventures in Data Citation"

    1. 1. www.gigasciencejournal.com
    2. 2. Overview / Genomics #101 Data-Sharing Issues Introduction How it’s working…Adventures in Data Citation Downstream consequences… Our Examples My two RMB/what is still needed…
    3. 3. A brief history of genomics…Human Genome Project: 1990-2003.1 Genome = $3 Billion Source: http://www.genome.gov/Images/press_photos/highres/38-300.jpg
    4. 4. A brief history of genomics… Source: http://www.genome.gov/sequencingcosts/ (with apologies)
    5. 5. A brief history of genomics… 1st Gen 2nd (next) Gen 3rd (next-next) Gen? Source: http://www.genome.gov/sequencingcosts/ (with apologies)
    6. 6. A brief history of genomics… 3rd (next-next) Gen? Source: http://www.genome.gov/sequencingcosts/ (with apologies)
    7. 7. BGI Introduction• Formerly known as Beijing Genomics Institute• Founded in 1999 (1% of HGP)• Not-for-profit research institute funded by commercial sequencing-as-a-service• Now the largest genomic organization in the world• Goal – Use genomics technology to impact the society – Make leading edge genomics highly accessible to the global research community
    8. 8. Global, with HQ in Shenzhen
    9. 9. Global, with HQ in Shenzhen
    10. 10. Global Sequencing Capacity Data Production 5.6 Tb / day > 1500X of human genome / day Multiple Supercomputing Centers 157 TB Flops 20 TB Memory 14.7 PB Storage
    11. 11. BGI Sequencing Capacity Sequencers Data Production137 Illumina/HiSeq 2000 5.6 Tb / day27 LifeTech/SOLiD 4 > 1500X of human genome / day1 454 GS FLX+ 1372 Illumina iScan Multiple Supercomputing Centers1 Illumina MiSeq 157 TB Flops1 Ion Torrent 20 TB Memory 14.7 PB Storage
    12. 12. Goal – “Just sequence it.” M+M+M: Million Genome Projects• Plant and Animal Genomes: G10K, i5K...• Variation Genomes: 10K rice resequencing....• Human Genomes: Ancient, Population, Medical• Cell Genomes: cancer single cell• Micro Ecosystems: Metahit, EMP, etc.• Personal Genomes
    13. 13. BGI Goes Denmark
    14. 14. BGI Goes Denmark
    15. 15. Genomics: the data-sharing success story?: V
    16. 16. Sharing/reproducibility helped bystability of: 1st Gen 2nd Gen1. Platforms1. Repositories :2. Standards
    17. 17. Genomics Data Sharing Policies… Bermuda Accords 1996/1997/1998: 1. Automatic release of sequence assemblies within 24 hours. 2. Immediate publication of finished annotated sequences. 3. Aim to make the entire sequence freely available in the public domain for both research and development in order to maximise benefits to society. Fort Lauderdale Agreement, 2003: 1. Sequence traces from whole genome shotgun projects are to be deposited in a trace archive within one week of production. 2. Whole genome assemblies are to be deposited in a public nucleotide sequence database as soon as possible after the assembled sequence has met a set of quality evaluation criteria. Toronto International data release workshop, 2009: The goal was to reaffirm and refine, where needed, the policies related to the early release of genomic data, and to extend, if possible, similar data release policies to other types of large biological datasets – whether from proteomics, biobanking or metabolite research.
    18. 18. Challenges for the future… (A) Cumulative base pairs in INSDC over time, excluding the Trace Archive. (B) Base pairs in INSDC, broken down into selected data components.Published by Oxford University Press 2011. Karsch-Mizrachi I et al. Nucl. Acids Res. 2012;40:D33-D37
    19. 19. Challenges for the future…1. Data Volumes (transfer, backlogs, funding issues)2. Compliance3. Lack of interoperability/sufficient metadata4. Long tail of curation (“Democratization” of “big-data”)
    20. 20. New incentives/creditCredit where credit is overdue:“One option would be to provide researchers who release data topublic repositories with a means of accreditation.”“An ability to search the literature for all online papers that used aparticular data set would enable appropriate attribution for thosewho share. “Nature Biotechnology 27, 579 (2009)Prepublication data sharing(Toronto International Data Release Workshop)“Data producers benefit from creating a citable reference, as it can ?later be used to reflect impact of the data sets.”Nature 461, 168-170 (2009)
    21. 21. New incentives/credit = Data Citation? “increase acceptance of research data as legitimate, citable contributions to the scholarly record”. “data generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”. ?
    22. 22. First issue next month… Large-Scale Data Journal/Database In conjunction with:Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDAssistant Editor: Alexandra Basford, PhDLead Curator: Tam Sneddon D.Phil www.gigasciencejournal.com
    23. 23. Associated Database www.gigaDB.org
    24. 24. Papers in the era of big-data goal: Executable Research Objects Citable DOI
    25. 25. Adventures in Data Citation doi:10.5524/100001
    26. 26. For data citation to work, needs:1. Proven utility/potential user base.2. Acceptance/inclusion by journals.3. Data+Citation: inclusion in the references.4. Tracking by citation indexes.5. Usage of the metrics by the community…
    27. 27. Datacitation 1: utility/user base.Establishment of data DOIs and use by databases: Shackleton NJ, Hall MA, Vincent E (2001): Mean stable carbon isotope ratios of Cibicidoides wuellerstorfi from sediment core MD95-2042 on the Iberian margin, North Atlantic. PANGAEA - Data Publisher for Earth & Environmental Science. http://doi.pangaea.de/10.1594/PANGAEA.58229 Cited in: Pahnke K, Zahn R: Southern Hemisphere Water Mass Conversion Linked with North Atlantic Climate Variability. Science 2005, 307:1741 -1746. Nocek B, Xu X, Savchenko A, Edwards A, Joachimiak A. 2007. PDB ID: 2P06 Crystal structure of a predicted coding region AF_0060 from Archaeoglobus fulgidus DSM 4304. 10.2210/pdb2p06/pdb. Cited in: Andreeva A, Howorth D, Chandonia J-M, Brenner SE, Hubbard TJP, Chothia C, Murzin AG: Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 2008, 36:D419-425.
    28. 28. BGI Datasets Get DOI®sInvertebrate Many released pre-publication…Ant PLANTS- Florida carpenter ant Chinese cabbage Vertebrates- Jerdon’s jumping ant Cucumber Giant panda Macaque- Leaf-cutter ant Foxtail millet - Chinese rhesusRoundworm Pigeonpea - Crab-eatingSchistosoma Potato Mini-PigSilkworm Sorghum Naked mole rat PenguinHuman - Emperor penguinAsian individual (YH) - Adelie penguin- DNA Methylome Pigeon, domestic- Genome Assembly Polar bear- Transcriptome Sheep doi:10.5524/100004Cancer (14TB) Tibetan antelopeAncient DNA Microbe- Saqqaq Eskimo E. Coli O104:H4 TY-2482- Aboriginal Australian Cell-Line Chinese Hamster Ovary
    29. 29. Our first DOI:To maximize its utility to the research community and aid those fightingthe current epidemic, genomic data is released here into the public domainunder a CC0 license. Until the publication of research papers on theassembly and whole-genome analysis of this isolate we would ask you tocite this dataset as:Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang,J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J;Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X;Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and theEscherichia coli O104:H4 TY-2482 isolate genome sequencing consortium(2011)Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGIShenzhen. doi:10.5524/100001http://dx.doi.org/10.5524/100001 To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
    30. 30. Downstream consequences:1. Therapeutics (primers, antimicrobials) 2. Platform Comparisons (Loman et al., Nature Biotech 2012)3. Speed/legal-freedom“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia colistrain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take daysfor the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team coulduse data collected on the strain. Luckily, one team had released its data under a Creative Commons licence thatallowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort andpublish their work without wasting time on legal wrangling.”
    31. 31. Data Citation 2: acceptance by journals
    32. 32. Data Citation 2: acceptance by journals
    33. 33. Data+Citation 3: inclusion in the references
    34. 34. • Data submitted to NCBI databases:- Raw data SRA:SRA046843- Assemblies of 3 strains Genbank:AHAO00000000-AHAQ00000000- SNPs dbSNP:1056306- CNVs-- InDels SV } dbVAR:nstd63• Submission to public databases complemented by its citable form in GigaDB (doi:10.5524/100012).
    35. 35. In the references…
    36. 36. Is the DOI…
    37. 37. And now in Nature Biotech…
    38. 38. And in more journals… Hodkinson BP, Uehling JK, Smith ME (2012) Data from: Lepidostroma vilgalysii, a new basidiolichen from the New World. Dryad Digital Repository. doi:10.5061/dryad.j1g5dh23Cited in:Hodkinson BP, Uehling JK, Smith ME: Lepidostroma vilgalysii, a new basidiolichenfrom the New World. Mycological Progress 2012. Advance Online Publication. Roberts SB (2012) Herring Hepatic Transcriptome 34300 contigs.fa. Figshare. Available: hdl.handle.net/10779/084d34370fbda29bbc6​7b3c5ecb02 575. Accessed 2012 Jan 20. Cited in: Roberts SB, Hauser L, Seeb LW, Seeb JE (2012) Development of Genomic Resources for Pacific Herring through Targeted Transcriptome Pyrosequencing. PLoS ONE 7(2): e30908. doi:10.1371/journal.pone.0030908
    39. 39. For data citation to work, needs:1. Proven utility/potential user base. ✔2. Acceptance/inclusion by journals. ✔3. Data+Citation: inclusion in the references. ✔4. Tracking by citation indexes.5. Usage of the metrics by the community…
    40. 40. Datacitation 4: tracking?
    41. 41. Datacitation 4: tracking? ✗FAIL DataCite metadata in harvestable form (OAI-PMH) - lists some DataCite DOIs, but says:Datasets listed are the “result of approximations in the indexingalgorithms.”“Google Scholars intended coverage is for scholarly articles. Atthis point, we dont include datasets. “
    42. 42. Datacitation 4: tracking? ✗FAILDataCite metadata in harvestable form (OAI-PMH)✗ Working on it. Coming soon? …the final challenge?
    43. 43. Datacitation 5: metrics?“As a result of diverse practices and toollimitations, data citations are currently verydifficult to track.”
    44. 44. Datacitation 5: metrics? ✗FAIL Research Remix, 29th May 2012: http://researchremix.wordpress.com/2012/05/29/dear-research- data-advocate-please-sign-the-petition-oamonday/I’m afraid we are making promises to datacreators about attribution and reward that wecan’t keep. ”Make your data citeable!” is the cry.Ok. So citeable is step one. Cited is step two. Butfor the citation to be useful, it has to be indexedso that citation metrics can be tracked andadmired and used.Who is indexing data citations right now? As faras I can tell: absolutely no one.
    45. 45. Where data citation is in 2012:1. Proven utility/potential user base. ✔2. Acceptance/inclusion by journals. ✔3. Data+Citation: inclusion in the references. ✔4. Tracking by citation indexes. ✗5. Usage of the metrics by the community… ✗
    46. 46. Minor quibbles: export to citation managers DCC/DataCite recommended format:Zheng, L-Y; Guo, X-S; He, B; Sun, L-J; Peng, Y; Dong, S-S; Liu, T-F; Jiang, S;Ramachandran, S; Liu, C-M; Jing, H-C; (2011): Genome data from sweet and grainsorghum (Sorghum bicolor); GigaScience. http://dx.doi.org/10.5524/100012 formatting:Zheng, L-Y (2011). Genome data from sweet and grain sorghum (Sorghum bicolor).GigaScience. Retrieved from http://dx.doi.org/10.5524/100012 Mendeley formatting:Zheng L-Y  Guo X-S  He B  Sun L-J  Peng Y  Dong S-S  Liu T-F  Jiang S  ; ; ; ; ; ; ; ;Ramachandran S  Liu C-M  Jing H-C: Genome data from sweet and grain sorghum ; ;(Sorghum bicolor). 2011.
    47. 47. Minor quibbles: clearer guidelines Rules for versioning/where do you set granularity? Experiment e.g. doi:10.5524/100001 Papers(e.g. ACRG project) e.g. doi:10.5524/100001-2 Data/ Datasets Micropubs (e.g. cancer type) e.g. doi:10.5524/100001-2000 Sample or doi:10.5524/100001_xyz(e.g. specimen xyz) Smaller still? Facts/Assertations (~1013 in literature) Nanopubs
    48. 48. Papers in the era of big-data goal: Executable Research ObjectsJuly 2012 Wilson GA, Dhami P, Feber A, Cortázar D, Suzuki Y, Schulz R, Schär P, Beck S: Resources for methylome analysis suitable for gene knockout studies of potential epigenome modifiers. GigaScience 2012, 1:3. (in press) GigaDB hosting all data + tools (84GB total): doi:10.5524/100035 + Partial (~80%) integration of workflow into our data platform. (all the data processing steps, but not the enrichment analysis) Data in ISA-Tab compliant formatNext stage… Papers fully integrating all data + all workflows in our platform.
    49. 49. Do you have interesting large-scale biological data sets? Submit to:• Rapid review/Open Access/High-visibility• Article Processing Charge covered by BGI• Hosting of any test datasets/workflows in GigaDB Interested in Reproducible Research?Take part in our session on: “Cloud and workflows for reproducible bioinformatics”
    50. 50. Thanks to:Laurie Goodman Alexandra BasfordTam Sneddon Shaoguang LiangTin-Lap Lee (CUHK) Qiong Luo (HKUST) scott@gigasciencejournal.comContact us: editorial@gigasciencejournal.com @gigascience Follow us: facebook.com/GigaScience blogs.openaccesscentral.com/blogs/gigablog/ www.gigasciencejournal.com
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×