Bio-IT World Asia Meeting, 7th June 2012              Scott EdmundsData dissemination in the era of “big data”William Gibs...
Is data “the new oil”?1.2 zettabytes (1021) of electronic data generated each year1 DataDeluge?1. Mervis J. U.S. science p...
Global Sequencing Capacity                        Data Production                          5.6 Tb / day                > 1...
BGI Sequencing Capacity           Sequencers                 Data Production137   Illumina/HiSeq 2000               5.6 Tb...
Now taking submissions…    Large-Scale Data:Journal/Database/Platform      In conjunction with:Editor-in-Chief: Laurie Goo...
Data-data everywhere?
Data Silo’s                          Interoperability               PaywallsMetadata           $       ©
There are many hurdles…          ?
There are many hurdles…Technical:   too large volumes             too heterogeneous             no home for many data type...
Technical challenges…Better handling of metadata…Novel tools/formats for data interoperability/handling.       Cloud     s...
Technical challenges… Tools making work more easily reproducible…Interoperability/Ease of use   WorkflowsData quality asse...
Technical challenges…More efficient handling of data…     Cloud?Do we need to keep everything?Compression?
Cultural challenges…
Data Re-useEffort($)           Usability
Need to lower the hurdles…Effort($)                  Usability
Better incentives?Effort($)              Usability
Incentives/creditCredit where credit is overdue:“One option would be to provide researchers who release data topublic repo...
Datacitation: Datacite and DOIsDigital Object Identifiers (DOIs)                                       offer a solution ...
Datacitation: Datacite and DOIs       Central metadata repository:• >1 million entries to date• Stability• Data discoverab...
Data publishing/DOI        New journal format combines standard manuscript        publication with an extensive database ...
Data Publishingwww.gigaDB.org
BGI Datasets Get DOI®sInvertebrate                                            Many released pre-publication…Ant           ...
For data citation to work, needs:• Proven utility/potential user base.• Acceptance/inclusion by journals.• Data+Citation: ...
Data+Citation: inclusion in the references
• Data submitted to NCBI databases:-   Raw data                      SRA:SRA046843-   Assemblies of 3 strains       Genban...
In the references…
Is the DOI…
And now in Nature Biotech…
Datacitation: tracking?          DataCite metadata in harvestable form (OAI-PMH)Plans in 2012 to link central metadata rep...
Final step: open licensing
Our first DOI:To maximize its utility to the research community and aid those fightingthe current epidemic, genomic data i...
“The way that the genetic data of the 2011 E. coli strain were disseminatedglobally suggests a more effective approach for...
Downstream consequences:1. Therapeutics (primers, antimicrobials) 2. Platform Comparisons (Loman et al., Nature Biotech 20...
The era of the data consumer?
The era of the data consumer??
The era of the data consumer?Free access to data – but analysis hubs/nodes for will form around it  ?
GDSAP: Genomic Data Submission              and Analytical platform                                 Big data              ...
GDSAP: Genomic Data Submission       and Analytical platform
GDSAP: Genomic Data Submission       and Analytical platform   mirror/open platform
Papers in the era of big-data        $1000 genome = million $ peer-review?     To review:                                 ...
Papers in the era of big-data       goal: Executable Research Objects                              Citable DOI
Papers in the era of big-data                           goal: Executable Research ObjectsStage 1:   Wilson GA, Dhami P, Fe...
Papers in the era of big-data   Interested in Reproducible Research?Take part in our session on: “Cloud and workflows for ...
Thanks to:Laurie Goodman       Alexandra BasfordTam Sneddon          Peter LiTin-Lap Lee (CUHK)   Qiong Luo (HKUST)       ...
Scott Edmunds: Data Dissemination in the era of "Big-Data"
Scott Edmunds: Data Dissemination in the era of "Big-Data"
Scott Edmunds: Data Dissemination in the era of "Big-Data"
Scott Edmunds: Data Dissemination in the era of "Big-Data"
Scott Edmunds: Data Dissemination in the era of "Big-Data"
Scott Edmunds: Data Dissemination in the era of "Big-Data"
Scott Edmunds: Data Dissemination in the era of "Big-Data"
Scott Edmunds: Data Dissemination in the era of "Big-Data"
Upcoming SlideShare
Loading in...5
×

Scott Edmunds: Data Dissemination in the era of "Big-Data"

1,217

Published on

Scott Edmunds talk at the Bio-IT World Asia meeting in Singapore on Data Dissemination in the era of "Big-Data". 7th June 2012

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,217
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Our facilities feature Sanger and next-generation sequencing technologies, providing the highest throughput sequencing capacity in the world. Powered by 137 IlluminaHiSeq 2000 instruments and 27 Applied BiosystemsSOLiD™ 4 Systems, we provide, high-quality sequencing results with industry-leading turnaround time. As of December 2010, our sequencing capacity is 5 Tb raw data per day, supported by several supercomputing centers with a total peak performance up to 102 Tflops, 20 TB of memory, and 10 PB storage. We provide stable and efficient resources to store and analyze massive amounts of data generated by next generation sequencing.
  • Our facilities feature Sanger and next-generation sequencing technologies, providing the highest throughput sequencing capacity in the world. Powered by 137 IlluminaHiSeq 2000 instruments and 27 Applied BiosystemsSOLiD™ 4 Systems, we provide, high-quality sequencing results with industry-leading turnaround time. As of December 2010, our sequencing capacity is 5 Tb raw data per day, supported by several supercomputing centers with a total peak performance up to 102 Tflops, 20 TB of memory, and 10 PB storage. We provide stable and efficient resources to store and analyze massive amounts of data generated by next generation sequencing.
  • Helps reproducibility, but some debate over whether it can help that much regarding scaling.
  • Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data todbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data todbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data todbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data todbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • Scott Edmunds: Data Dissemination in the era of "Big-Data"

    1. 1. Bio-IT World Asia Meeting, 7th June 2012 Scott EdmundsData dissemination in the era of “big data”William Gibson: "Information is the currency of the future world”Sir Tim Berners-Lee: "Data is a precious thing and will last longer than the systemsthemselves” www.gigasciencejournal.com
    2. 2. Is data “the new oil”?1.2 zettabytes (1021) of electronic data generated each year1 DataDeluge?1. Mervis J. U.S. science policy. Agencies rally to tackle big data. Science. 2012 Apr 6;336(6077):22.
    3. 3. Global Sequencing Capacity Data Production 5.6 Tb / day > 1500X of human genome / day Multiple Supercomputing Centers 157 TB Flops 20 TB Memory 14.7 PB Storage
    4. 4. BGI Sequencing Capacity Sequencers Data Production137 Illumina/HiSeq 2000 5.6 Tb / day27 LifeTech/SOLiD 4 > 1500X of human genome / day1 454 GS FLX+ 1372 Illumina iScan Multiple Supercomputing Centers1 Illumina MiSeq 157 TB Flops1 Ion Torrent 20 TB Memory 14.7 PB Storage
    5. 5. Now taking submissions… Large-Scale Data:Journal/Database/Platform In conjunction with:Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDAssistant Editor: Alexandra Basford, PhDLead BioCurator: Tam Sneddon, DphilData Platform: Peter Li, PhD www.gigasciencejournal.com
    6. 6. Data-data everywhere?
    7. 7. Data Silo’s Interoperability PaywallsMetadata $ ©
    8. 8. There are many hurdles… ?
    9. 9. There are many hurdles…Technical: too large volumes too heterogeneous no home for many data types too time consumingCultural: inertia no incentives to share unaware of how ?
    10. 10. Technical challenges…Better handling of metadata…Novel tools/formats for data interoperability/handling. Cloud solutions?
    11. 11. Technical challenges… Tools making work more easily reproducible…Interoperability/Ease of use WorkflowsData quality assessment
    12. 12. Technical challenges…More efficient handling of data… Cloud?Do we need to keep everything?Compression?
    13. 13. Cultural challenges…
    14. 14. Data Re-useEffort($) Usability
    15. 15. Need to lower the hurdles…Effort($) Usability
    16. 16. Better incentives?Effort($) Usability
    17. 17. Incentives/creditCredit where credit is overdue:“One option would be to provide researchers who release data topublic repositories with a means of accreditation.”“An ability to search the literature for all online papers that used aparticular data set would enable appropriate attribution for thosewho share. “Nature Biotechnology 27, 579 (2009)Prepublication data sharing(Toronto International Data Release Workshop)“Data producers benefit from creating a citable reference, as it canlater be used to reflect impact of the data sets.”Nature 461, 168-170 (2009)
    18. 18. Datacitation: Datacite and DOIsDigital Object Identifiers (DOIs)  offer a solution Mostly widely used identifier for Dataset scientific articles Yancheva et al (2007). Analyses on Researchers, authors, publishers sediment of Lake Maar. PANGAEA. know how to use them doi:10.1594/PANGAEA.587840 Put datasets on the same playing field as articles “increase acceptance of research data as Aims to: legitimate, citable contributions to the scholarly record”. “data generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”.
    19. 19. Datacitation: Datacite and DOIs Central metadata repository:• >1 million entries to date• Stability• Data discoverability• Open & harvestable• Potential to track & credit use
    20. 20. Data publishing/DOI New journal format combines standard manuscript publication with an extensive database to host all associated data, and integrated tools.  Data hosting will follow standard funding agency and community guidelines. DOI assignment available for submitted data to allow ease of finding and citing datasets, as well as for citation tracking. www.gigasciencejournal.com
    21. 21. Data Publishingwww.gigaDB.org
    22. 22. BGI Datasets Get DOI®sInvertebrate Many released pre-publication…Ant PLANTS- Florida carpenter ant Chinese cabbage Vertebrates- Jerdon’s jumping ant Cucumber Giant panda Macaque- Leaf-cutter ant Foxtail millet - Chinese rhesusRoundworm Pigeonpea - Crab-eatingSchistosoma Potato Mini-PigSilkworm Sorghum Naked mole rat PenguinHuman - Emperor penguinAsian individual (YH) - Adelie penguin- DNA Methylome Pigeon, domestic- Genome Assembly Polar bear- Transcriptome Sheep doi:10.5524/100004Cancer (14TB) Tibetan antelopeAncient DNA Microbe- Saqqaq Eskimo E. Coli O104:H4 TY-2482- Aboriginal Australian Cell-Line Chinese Hamster Ovary
    23. 23. For data citation to work, needs:• Proven utility/potential user base.• Acceptance/inclusion by journals.• Data+Citation: inclusion in the references.• Tracking by citation indexes.• Usage of the metrics by the community…
    24. 24. Data+Citation: inclusion in the references
    25. 25. • Data submitted to NCBI databases:- Raw data SRA:SRA046843- Assemblies of 3 strains Genbank:AHAO00000000-AHAQ00000000- SNPs dbSNP:1056306- CNVs-- InDels SV } dbVAR:nstd63• Submission to public databases complemented by its citable form in GigaDB (doi:10.5524/100012).
    26. 26. In the references…
    27. 27. Is the DOI…
    28. 28. And now in Nature Biotech…
    29. 29. Datacitation: tracking? DataCite metadata in harvestable form (OAI-PMH)Plans in 2012 to link central metadata repository with WoS - Will finally track and credit use! To be continued…
    30. 30. Final step: open licensing
    31. 31. Our first DOI:To maximize its utility to the research community and aid those fightingthe current epidemic, genomic data is released here into the public domainunder a CC0 license. Until the publication of research papers on theassembly and whole-genome analysis of this isolate we would ask you tocite this dataset as:Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang,Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun,Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ;Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482isolate genome sequencing consortium (2011)Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen.doi:10.5524/100001http://dx.doi.org/10.5524/100001 To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
    32. 32. “The way that the genetic data of the 2011 E. coli strain were disseminatedglobally suggests a more effective approach for tackling public healthproblems. Both groups put their sequencing data on the Internet, so scientiststhe world over could immediately begin their own analysis of the bugsmakeup. BGI scientists also are using Twitter to communicate their latestfindings.”“German scientists and their colleagues at the Beijing Genomics Institute in China havebeen working on uncovering secrets of the outbreak. BGI scientists revised their draftgenetic sequence of the E. coli strain and have been sharing their data with dozens ofscientists around the world as a way to "crowdsource" this data. By publishing their datapublicy and freely, these other scientists can have a look at the genetic structure, and tryto sort it out for themselves.”
    33. 33. Downstream consequences:1. Therapeutics (primers, antimicrobials) 2. Platform Comparisons (Loman et al., Nature Biotech 2012)3. Speed/legal-freedom“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia colistrain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take daysfor the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team coulduse data collected on the strain. Luckily, one team had released its data under a Creative Commons licence thatallowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort andpublish their work without wasting time on legal wrangling.”
    34. 34. The era of the data consumer?
    35. 35. The era of the data consumer??
    36. 36. The era of the data consumer?Free access to data – but analysis hubs/nodes for will form around it ?
    37. 37. GDSAP: Genomic Data Submission and Analytical platform Big data from theData, Data, Data… “Sequencing Oil Field” Data Modeling Pipeline design Tin-Lap Lee, CUHK Validation Commercial applications “Apps”
    38. 38. GDSAP: Genomic Data Submission and Analytical platform
    39. 39. GDSAP: Genomic Data Submission and Analytical platform mirror/open platform
    40. 40. Papers in the era of big-data $1000 genome = million $ peer-review? To review: (>6TBp, >1500 datasets) S3 = $15,000 EC2 (BLASTx) = $500,000Source: Folker Meyer/Wilkening et al. 2009, CLUSTER09. IEEE International Conference on Cluster Computing and Workshops
    41. 41. Papers in the era of big-data goal: Executable Research Objects Citable DOI
    42. 42. Papers in the era of big-data goal: Executable Research ObjectsStage 1: Wilson GA, Dhami P, Feber A, Cortázar D, Suzuki Y, Schulz R, Schär P, Beck S: Resources for methylome analysis suitable for gene knockout studies of potential epigenome modifiers. GigaScience 2012, 1:3. (in press) GigaDB hosting all data + tools (84GB total): doi:10.5524/100035 + Partial (~80%) integration of workflow into our data platform. (all the data processing steps, but not the enrichment analysis)Stage 2: Papers fully integrating all data + all workflows in our platform.
    43. 43. Papers in the era of big-data Interested in Reproducible Research?Take part in our session on: “Cloud and workflows for reproducible bioinformatics”Submit to:• Rapid review/Open Access/High-visibility• Article Processing Charge covered by BGI• Hosting of any test datasets/workflows in GigaDB
    44. 44. Thanks to:Laurie Goodman Alexandra BasfordTam Sneddon Peter LiTin-Lap Lee (CUHK) Qiong Luo (HKUST) scott@gigasciencejournal.comContact us: editorial@gigasciencejournal.com @gigascience Follow us: facebook.com/GigaScience blogs.openaccesscentral.com/blogs/gigablog/ www.gigasciencejournal.com
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×