Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

981 views

Published on

Scott Edmunds talk at the 7th Internation Conference on Genomics: "Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era. ICG7, Hong Kong 1st December 2012
"

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
981
On SlideShare
0
From Embeds
0
Number of Embeds
27
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Leading on from that, current and future plans include collaborating with Tin-Lap Lee at the Chinese University of Hong Kong to integrate an instance of the Galaxy bioinformatics platform with GigaDB so users can make full use of the data in GigaDB by linking it to other resources and we can incorporate fully executable papers. One such submission is a new SOAPdenovo pipeline. The SOAP tools have been wrapped in Galaxy, the workflow defined in MyExperiment and the data will be issued with a DOI and accessible via GigaDB. Utilizing the BGI cloud if necessary, users will then be able to reproduce all the steps described in the GigaScience paper to test, reanalyze, compare results etc.Since we would like GigaDB to be a host for data types that have no other home, such as imaging data, we are investigating adding other tools such as an image viewer and the like to support accessibility to and usability of the data. So, if you have a large-scale biological or biomedical dataset and/or a pipeline or software that you would like to submit to GigaScience we would love to hear from you so please come and talk to Scott or myself.
  • That just leaves me to thank the GigaScience team: Laurie, Scott, Alexandra, Peter and Jesse, BGI for their support - specifically Shaoguang for IT and bioinformatics support – our collaborators on the database, website and tools: Tin-Lap, Qiong, Senhong, Yan, the Cogini web design team, Datacite for providing the DOI service and the isacommons team for their support and advocacy for best practice use of metadata reporting and sharing.Thank you for listening.
  • Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

    1. 1. Scott Edmunds, GigaScience/BGI Hong KongICG7, Hong Kong, 1st December 2012 www.gigasciencejournal.com
    2. 2. The challenges integrating papers + data:Technical issues:•Data volumes: (1.2 zettabytes generated globally each year)•>Exponential growth of genomics data•Technical challenges (VMs/cloud, compression)Cultural issues:•Lack of incentives (Data DOIs)•Data licensing (CC-BY, CC0)•Journal/funder policies Source: 1. Mervis J. U.S. science policy. Agencies rally to tackle big data. Science. 2012 Apr 6;336(6077):22.
    3. 3. The challenges integrating papers + data:Technical issues:•Data volumes: (1.2 zettabytes generated globally each year)•>Exponential growth of genomics data•Technical challenges (VMs/cloud, compression)Cultural issues:•Lack of incentives (Data DOIs)•Data licensing (CC-BY, CC0)•Journal/funder policies Source: 1. Mervis J. U.S. science policy. Agencies rally to tackle big data. Science. 2012 Apr 6;336(6077):22. * T-Shirts available from Graham Steel / http://www.zazzle.co.uk/steelgraham
    4. 4. Why is this important? • Transparency • Reproducibility • Re-use“Faked research is endemic in China”Source: New Scientist, 17th Nov 2012: http://www.newscientist.com/article/mg21628910.300-fraud-fighter-faked-research-is-endemic-in-china.html
    5. 5. Why is this important? 475, 267 (2011)―Wide distribution of information is key to scientific progress,yet traditionally, Chinese scientists have not systematicallyreleased data or research findings, even after publication.――There have been widespread complaints from scientistsinside and outside China about this lack of transparency. ‖―Usually incomplete and unsystematic, [what little supportingdata released] are of little value to researchers and there isevidence that this drives down a papers citation numbers.‖Source: Nature 475, 267 (2011) http://www.nature.com/news/2011/110720/full/475267a.html?
    6. 6. Global Issue: increasing number of retractions >15X increase in last decade Strong correlation of ―retraction index‖ with higher impact factor 1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
    7. 7. Global Issue: unrepeatability of scientific results Out of 18 microarray papers, results from 10 could not be reproducedIoannidis et al., 2009. Repeatability of published microarray gene expression analyses.Nature Genetics 41: 149-155.
    8. 8. Sharing aids authors…Sharing DetailedResearch Data IsAssociated withIncreased Citation Rate.Piwowar HA, Day RS, Fridsma DB (2007)PLoS ONE 2(3): e308.doi:10.1371/journal.pone.0000308 Every 10 datasets collected contributes to at least 4 papers in the following 3-years. Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a
    9. 9. Rice v Wheat: consequences of publically available genome data. rice wheat 700 600 500 400 300 200 100 0
    10. 10. Our first DOI:To maximize its utility to the research community and aid those fightingthe current epidemic, genomic data is released here into the public domainunder a CC0 license. Until the publication of research papers on theassembly and whole-genome analysis of this isolate we would ask you tocite this dataset as:Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J;Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y;Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X;Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011)Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen.doi:10.5524/100001http://dx.doi.org/10.5524/100001 To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
    11. 11. Downstream consequences:1. Citations (~100) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons4. Example for faster & more open science ―Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and publish their work without wasting time on legal wrangling.‖
    12. 12. 1.3 The power of intelligently open dataThe benefits of intelligently open data were powerfullyillustrated by events following an outbreak of a severe gastro-intestinal infection in Hamburg in Germany in May 2011. Thisspread through several European countries and theUS, affecting about 4000 people and resulting in over 50deaths. All tested positive for an unusual and little-knownShiga-toxin–producing E. coli bacterium. The strain was initiallyanalysed by scientists at BGI-Shenzhen in China, workingtogether with those in Hamburg, and three days later a draftgenome was released under an open data licence. Thisgenerated interest from bioinformaticians on four continents. 24hours after the release of the genome it had been assembled.Within a week two dozen reports had been filed on an open-source site dedicated to the analysis of the strain. Theseanalyses provided crucial information about the strain’svirulence and resistance genes – how it spreads and whichantibiotics are effective against it. They produced results intime to help contain the outbreak. By July 2011, scientistspublished papers based on this work. By opening up their earlysequencing results to international collaboration, researchers inHamburg produced results that were quickly tested by a widerange of experts, used to produce new knowledge andultimately to control a public health emergency.
    13. 13. Not just (data) quantity, but quality1. Lack of sufficient metadata2. Lack of interoperability1. Long tail of curation (“Democratization” of “Big-Data”)
    14. 14. Not just (data) quantity, but qualityBetter handling of metadata…Novel tools/formats for data interoperability/handling. Cloud solutions?
    15. 15. Not just (data) quantity, but qualityTools making work more easily reproducible…Interoperability/Ease of use Workflows Data quality assessment
    16. 16. Large-Scale Data Journal/Database In conjunction with:Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDCommisioning Editor: Nicole Nogoy, PhDLead Curator: Tam Sneddon D.PhilData Platform: Peter Li, PhD www.gigasciencejournal.com
    17. 17. Addressing the reproducibility gap:Computable methods/workflow systemsBioinformaticsDevelopment Biomedical and bioinformatics research Publishing
    18. 18. Redefining what is a paper in the era of big-data? goal: Executable Research Objects Citable DOI
    19. 19. Integrating workflows into papers…
    20. 20. Anatomy of a Publication IdeaStudy Metadata DataAnalysisAnswer
    21. 21. Anatomy of a Data Publication IdeaStudy Metadata DataAnalysisAnswer
    22. 22. Publication• Background• Methods• Results (Data)• Conclusions/Discussion doi:10.1186/2047-217X-1-3
    23. 23. Data Publication• Background• Methods• Results (Data) doi:10.5524/100035• Conclusions/Discussion doi:10.1186/2047-217X-1-3
    24. 24. Methods + Data + Publication• Background• Methods DOI for workflows?• Results (Data) doi:10.5524/100035• Conclusions/Discussion doi:10.1186/2047-217X-1-3
    25. 25. Data Methods Analysisdoi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3 DOI: A + DOI: X = DOI: 1
    26. 26. Data Methods Analysisdoi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3 DOI: A + DOI: X = DOI: 1 DOI: B + DOI: X = DOI: 2
    27. 27. Data Methods Analysisdoi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3 DOI: A + DOI: X = DOI: 1 DOI: B + DOI: X = DOI: 2 DOI: A + DOI: Y = DOI: 3
    28. 28. Data Methods Analysisdoi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3 DOI: A + DOI: X = DOI: 1 DOI: B + DOI: X = DOI: 2 DOI: A + DOI: Y = DOI: 3 A, B, C… X, Y, Z… = 4, 5, 6…
    29. 29. Different shaped publishable objects Data PapersExecutable(Methods) Papers Analysis Papers
    30. 30. Different shaped publishable objects Different levels of granularity Experiment e.g. doi:10.5524/100001 Papers(e.g. ACRG project) e.g. doi:10.5524/100001-2 Data/ Datasets Micropubs (e.g. cancer type) e.g. doi:10.5524/100001-2000 Sample or doi:10.5524/100001_xyz(e.g. specimen xyz) Smaller still? Facts/Assertions (~1014 in literature) Nanopubs
    31. 31. Adding “value” publishing data• Scope for different shaped publishable objects• Scope for publishing methods/executable papers• Peer review of data problematic – Post publication peer review – Change criteria (assess on transparency/access only) – Better use of workflows/cloud/VMsDOIs are cheap*, data is precious: maximise its use * ish
    32. 32. • Transparency• Reproducibility• Re-use } = Credit
    33. 33. Thanks to: Shaoguang Liang (BGI-SZ)Laurie Goodman Tin-Lap Lee (CUHK)Tam Sneddon Huayen Gao (CUHK)Nicole Nogoy Qiong Luo (HKUST)Alexandra Basford Senghong Wang (HKUST)Peter Li Yan Zhou (HKUST)Jesse Si Zhe Cogini editorial@gigasciencejournal.comContact us: database@gigasciencejournal.com @gigascience Follow us: facebook.com/GigaScience blogs.openaccesscentral.com/blogs/gigablog/ www.gigadb.org www.gigasciencejournal.com

    ×