Scott Edmunds: Data publication in the data deluge
Upcoming SlideShare
Loading in...5
×
 

Scott Edmunds: Data publication in the data deluge

on

  • 1,641 views

Scott Edmunds talk at COASP 2012 in Budapest "Data publication in the data deluge", September 20th 2012

Scott Edmunds talk at COASP 2012 in Budapest "Data publication in the data deluge", September 20th 2012

Statistics

Views

Total Views
1,641
Views on SlideShare
1,544
Embed Views
97

Actions

Likes
1
Downloads
15
Comments
2

3 Embeds 97

https://twitter.com 92
https://si0.twimg.com 3
https://twimg0-a.akamaihd.net 2

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • slide 21 must be one of your proudest achievements... and rightly so. I'm referring to the CC0/PD license
    Are you sure you want to
    Your message goes here
    Processing…
  • Youtube version here: http://www.youtube.com/watch?v=O_oufmkyZNY&feature=share&list=FLP6zzMI19PWqlrH3Vl6FNcw
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • and an advanced search option…
  • Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data to dbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data to dbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data to dbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
  • Leading on from that, current and future plans include collaborating with Tin-Lap Lee at the Chinese University of Hong Kong to integrate an instance of the Galaxy bioinformatics platform with GigaDB so users can make full use of the data in GigaDB by linking it to other resources and we can incorporate fully executable papers. One such submission is a new SOAPdenovo pipeline. The SOAP tools have been wrapped in Galaxy, the workflow defined in MyExperiment and the data will be issued with a DOI and accessible via GigaDB. Utilizing the BGI cloud if necessary, users will then be able to reproduce all the steps described in the GigaScience paper to test, reanalyze, compare results etc.Since we would like GigaDB to be a host for data types that have no other home, such as imaging data, we are investigating adding other tools such as an image viewer and the like to support accessibility to and usability of the data. So, if you have a large-scale biological or biomedical dataset and/or a pipeline or software that you would like to submit to GigaScience we would love to hear from you so please come and talk to Scott or myself.
  • That just leaves me to thank the GigaScience team: Laurie, Scott, Alexandra, Peter and Jesse, BGI for their support - specifically Shaoguang for IT and bioinformatics support – our collaborators on the database, website and tools: Tin-Lap, Qiong, Senhong, Yan, the Cogini web design team, Datacite for providing the DOI service and the isacommons team for their support and advocacy for best practice use of metadata reporting and sharing.Thank you for listening.

Scott Edmunds: Data publication in the data deluge Scott Edmunds: Data publication in the data deluge Presentation Transcript

  • Data publication in the data delugeScott Edmunds, GigaScience/BGI Hong KongCOASP 2012, Budapest, 20th September 2012 www.gigasciencejournal.com
  • The Data Challenge:•1.2 zettabytes (1021) electronic data generated globally each year•>Exponential growth of genomics data (& growth in imaging andMS data following) Source: http://www.genome.gov/sequencingcosts/ (with apologies)•Issues with reproducibility, hosting, curation, interoperability•Need for better incentives to overcome these Source: 1. Mervis J. U.S. science policy. Agencies rally to tackle big data. Science. 2012 Apr 6;336(6077):22.
  • Large-Scale Data Journal/Database In conjunction with:Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDCommisioning Editor: Nicole Nogoy, PhDLead Curator: Tam Sneddon D.PhilData Platform: Peter Li, PhD www.gigasciencejournal.com
  • GigaDB is a new database integrated with the GigaScience journal to meet the needs of a new generation of biologicaland biomedical research as it enters the era of “big-data”… (see more)
  • GDSAP: Genomic Data Submission and Analytical platform
  • Anatomy of a Publication IdeaStudy Metadata DataAnalysisAnswer
  • Anatomy of a Data Publication IdeaStudy Metadata DataAnalysisAnswer
  • Issues for Data Publication Idea Cultural issues:Study Technical issues: Metadata DataAnalysisAnswer
  • Issues for Data Publication Idea Cultural issues:Study Metadata DataAnalysis Adoption held back by:Answer journal policies, citation, tracking…* T-Shirts available from Graham Steel / http://www.zazzle.co.uk/steelgraham
  • Issues for Data Publication IdeaStudy Technical issues: Metadata What do we do with the data? DataAnalysisAnswer
  • Issues for Data Publication IdeaStudy Technical issues: Metadata What do we do with the data? Data Lightweight:Analysis •Metadata only journals •Get someone else to host Heavyweight:Answer •Become a repository
  • To host or not to host? Against: supplementary files argument The Journal of Neuroscience Average size of a Journal of Neuroscience article and supplemental material in megabytes.Announcement Regarding Supplemental Material:Beginning November 1, 2010, The Journal ofNeuroscience will no longer allow authors to includesupplemental material when they submit new manuscriptsand will no longer host supplemental material on its web sitefor those articles.“While the size of articles has grown gradually over thepast decade, the supplemental material associated with atypical Journal article appears to be growing exponentiallyand is rapidly approaching the size of an article. Thesheer volume of supplemental material is adverselyaffecting peer review.” Maunsell J J. Neurosci. 2010;30:10599-10600
  • $1000 genome = million $ peer-review? To review: (>6TBp, >1500 datasets) S3 (storage) = $15,000 EC2 (analysis w/ BLASTx) = $500,000Source: Folker Meyer/Wilkening et al. 2009, CLUSTER09. IEEE International Conference on Cluster Computing and Workshops
  • $1000 genome = million $ peer-review? To review: (>6TBp, >1500 datasets) S3 (storage) = $15,000 EC2 (analysis w/ BLASTx) = $500,000 Source: Folker Meyer/Wilkening et al. 2009, CLUSTER09. IEEE International Conference on Cluster Computing and WorkshopsENCODE analysis Virtual Machine: Containing: input data, code bundles with scripts and processing steps, outputs AWS = ~$5,000 Source: James Taylor / http://encodeproject.org/ENCODE/integrativeAnalysis/VM
  • To host or not to host?For: reproducibilityThe Guardian, 14th September 2012: Replication is the only solution to scientific fraud.http://www.guardian.co.uk/commentisfree/2012/sep/14/solution-scientific-fraud-replication For: “data is the new oil” William Gibson: "Information is the currency of the future world” Sir Tim Berners-Lee: "Data is a precious thing and will last longer than the systems themselves” Move compute to the data: think EC2 rather than S3 DNA Nexus + 0.5PB SRA data = $15 million given by GoogleSource:DNA Nexus/SRA http://techcrunch.com/2011/10/12/dnanexus-raises-15-million-teams-with-google-to-host-massive-dna-database/
  • Overcoming cultural hurdles… ?
  • Overcoming cultural hurdles… Adventures in Data Citation doi:10.5524/100001
  • For data citation to work, needs:1. Proven utility/potential user base.2. Acceptance/inclusion by journals.3. Data+Citation: inclusion in the references.4. Tracking by citation indexes.5. Usage of the metrics by the community…
  • Datacitation 1: utility/user base.Establishment of data DOIs and use by databases: Shackleton NJ, Hall MA, Vincent E (2001): Mean stable carbon isotope ratios of Cibicidoides wuellerstorfi from sediment core MD95-2042 on the Iberian margin, North Atlantic. PANGAEA - Data Publisher for Earth & Environmental Science. http://doi.pangaea.de/10.1594/PANGAEA.58229 Cited in: Pahnke K, Zahn R: Southern Hemisphere Water Mass Conversion Linked with North Atlantic Climate Variability. Science 2005, 307:1741 -1746. Nocek B, Xu X, Savchenko A, Edwards A, Joachimiak A. 2007. PDB ID: 2P06 Crystal structure of a predicted coding region AF_0060 from Archaeoglobus fulgidus DSM 4304. 10.2210/pdb2p06/pdb. Cited in: Andreeva A, Howorth D, Chandonia J-M, Brenner SE, Hubbard TJP, Chothia C, Murzin AG: Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 2008, 36:D419-425.
  • BGI Datasets Get DOI®sInvertebrate Released pre-publicationAnt Paper Published in GigaScience- Florida carpenter ant Microbe- Jerdon’s jumping ant Vertebrates E. Coli O104:H4 TY-2482- Leaf-cutter ant Darwin’s Finch T2D gut metagenomeRoundworm Giant panda MacaqueSchistosoma -Chinese rhesus Cell-LinesSilkworm -Crab-eating Chinese Hamster Ovary Mini-Pig Mouse methylomesHuman Naked mole ratAsian individual (YH) Parrot, Puerto Rican PLANTS- DNA Methylome Penguin Chinese cabbage- Genome Assembly - Emperor penguin Cucumber- Transcriptome - Adelie penguin Foxtail milletCancer (14TB) Pigeon, domestic PigeonpeaSingle cell bladder cancer Polar bear PotatoHBV infected exomes Sheep SorghumAncient DNA Tibetan antelope- Saqqaq Eskimo- Aboriginal Australian
  • Our first DOI:To maximize its utility to the research community and aid those fightingthe current epidemic, genomic data is released here into the public domainunder a CC0 license. Until the publication of research papers on theassembly and whole-genome analysis of this isolate we would ask you tocite this dataset as:Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang,Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun,Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ;Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482isolate genome sequencing consortium (2011)Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen.doi:10.5524/100001http://dx.doi.org/10.5524/100001 To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
  • 1.3 The power of intelligently open dataThe benefits of intelligently open data were powerfullyillustrated by events following an outbreak of a severe gastro-intestinal infection in Hamburg in Germany in May 2011. Thisspread through several European countries and the US,affecting about 4000 people and resulting in over 50 deaths. Alltested positive for an unusual and little-known Shiga-toxin–producing E. coli bacterium. The strain was initially analysed byscientists at BGI-Shenzhen in China, working together withthose in Hamburg, and three days later a draft genome wasreleased under an open data licence. This generated interestfrom bioinformaticians on four continents. 24 hours after therelease of the genome it had been assembled. Within a weektwo dozen reports had been filed on an open-source sitededicated to the analysis of the strain. These analysesprovided crucial information about the strain’s virulence andresistance genes – how it spreads and which antibiotics areeffective against it. They produced results in time to helpcontain the outbreak. By July 2011, scientists published papersbased on this work. By opening up their early sequencingresults to international collaboration, researchers in Hamburgproduced results that were quickly tested by a wide range ofexperts, used to produce new knowledge and ultimately tocontrol a public health emergency.
  • Data Citation 2: acceptance by journals
  • Data Citation 2: acceptance by journals
  • Data+Citation 3: inclusion in the references
  • In the references…
  • Is the DOI…* Certain types of genomics data must also be deposited in INSDC databases (SRA & Genbank).
  • And in more journals… Hodkinson BP, Uehling JK, Smith ME (2012) Data from: Lepidostroma vilgalysii, a new basidiolichen from the New World. Dryad Digital Repository. doi:10.5061/dryad.j1g5dh23Cited in:Hodkinson BP, Uehling JK, Smith ME: Lepidostroma vilgalysii, a new basidiolichenfrom the New World. Mycological Progress 2012. Advance Online Publication. Roberts SB (2012) Herring Hepatic Transcriptome 34300 contigs.fa. Figshare. Available: hdl.handle.net/10779/084d34370fbda29bbc6​7b3c5ecb02 575. Accessed 2012 Jan 20. Cited in: Roberts SB, Hauser L, Seeb LW, Seeb JE (2012) Development of Genomic Resources for Pacific Herring through Targeted Transcriptome Pyrosequencing. PLoS ONE 7(2): e30908. doi:10.1371/journal.pone.0030908
  • For data citation to work, needs:1. Proven utility/potential user base. ✔2. Acceptance/inclusion by journals. ✔3. Data+Citation: inclusion in the references. ✔4. Tracking by citation indexes.5. Usage of the metrics by the community…
  • Datacitation 4: tracking?
  • Datacitation 4: tracking? ✗FAIL DataCite metadata in harvestable form (OAI-PMH) - lists some DataCite DOIs, but says:Datasets listed are the “result of approximations in the indexingalgorithms.”“Google Scholars intended coverage is for scholarly articles. Atthis point, we dont include datasets. “
  • Datacitation 4: tracking? ✗FAILDataCite metadata in harvestable form (OAI-PMH)✗ Working on it. Coming soon…
  • Datacitation 5: metrics?“As a result of diverse practices and toollimitations, data citations are currently verydifficult to track.”
  • Datacitation 5: metrics? ✗FAIL Research Remix, 29th May 2012: http://researchremix.wordpress.com/2012/05/29/dear-research- data-advocate-please-sign-the-petition-oamonday/“I’m afraid we are making promises to datacreators about attribution and reward that wecan’t keep. ”Make your data citeable!” is the cry.OK. So citeable is step one. Cited is step two. Butfor the citation to be useful, it has to be indexedso that citation metrics can be tracked andadmired and used.Who is indexing data citations right now? As faras I can tell: absolutely no one.”
  • Where data citation is in 2012:1. Proven utility/potential user base. ✔2. Acceptance/inclusion by journals. ✔3. Data+Citation: inclusion in the references. ✔4. Tracking by citation indexes. ✔/✗5. Usage of the metrics by the community… ✗
  • Overcoming technical hurdles… ?
  • Addressing the reproducibility gap:Computable methods/workflow systemsBioinformaticsDevelopment Biomedical and bioinformatics research Publishing
  • Redefining what is a paper in the era of big-data? goal: Executable Research Objects Citable DOI
  • Publication• Background• Methods• Results (Data)• Conclusions/Discussion doi:10.1186/2047-217X-1-3
  • Data Publication• Background• Methods• Results (Data) doi:10.5524/100035• Conclusions/Discussion doi:10.1186/2047-217X-1-3
  • Methods + Data + Publication• Background• Methods Doi for workflows?• Results (Data) doi:10.5524/100035• Conclusions/Discussion doi:10.1186/2047-217X-1-3
  • Data Methods Analysisdoi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3 DOI: A + DOI: X = DOI: 1
  • Data Methods Analysisdoi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3 DOI: A + DOI: X = DOI: 1 DOI: B + DOI: X = DOI: 2
  • Data Methods Analysisdoi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3 DOI: A + DOI: X = DOI: 1 DOI: B + DOI: X = DOI: 2 DOI: A + DOI: Y = DOI: 3
  • Data Methods Analysisdoi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3 DOI: A + DOI: X = DOI: 1 DOI: B + DOI: X = DOI: 2 DOI: A + DOI: Y = DOI: 3 A, B, C… X, Y, Z… = 4, 5, 6…
  • Different shaped publishable objects Data PapersExecutable(Methods) Papers Analysis Papers
  • Different shaped publishable objects Different levels of granularity Experiment e.g. doi:10.5524/100001 Papers(e.g. ACRG project) e.g. doi:10.5524/100001-2 Data/ Datasets Micropubs (e.g. cancer type) e.g. doi:10.5524/100001-2000 Sample or doi:10.5524/100001_xyz(e.g. specimen xyz) Smaller still? Facts/Assertions (~1013 in literature) Nanopubs
  • Adding “value” publishing data• Scope for different shaped publishable objects• Scope for publishing methods/executable papers• Peer review of data problematic – Post publication peer review – Change criteria (assess on transparency/access only) – Better use of workflows/cloud/VMsDOIs are cheap*, data is precious: maximise its use * ish
  • Adding “value” publishing dataDOIs are cheap*, data is precious: maximise its use * ish Source: Ross Mounce CC-BY http://rossmounce.co.uk/2012/09/04/the-gold-oa-plot-v0-2/
  • Thanks to: Shaoguang Liang (BGI-SZ)Laurie Goodman Tin-Lap Lee (CUHK)Tam Sneddon Huayen Gao (CUHK)Nicole Nogoy Qiong Luo (HKUST)Alexandra Basford Senghong Wang (HKUST)Peter Li Yan Zhou (HKUST)Jesse Si Zhe Cogini editorial@gigasciencejournal.comContact us: database@gigasciencejournal.com @gigascience Follow us: facebook.com/GigaScience blogs.openaccesscentral.com/blogs/gigablog/ www.gigadb.org www.gigasciencejournal.com