Scott Edmunds ISMB talk on Big Data Publishing
Upcoming SlideShare
Loading in...5
×
 

Scott Edmunds ISMB talk on Big Data Publishing

on

  • 2,595 views

Scott Edmunds talk on Big Data Publishing at the "What Bioinformaticians need to know about digital publishing beyond the PDF" workshop at ISMB 2013, July 22nd 2013

Scott Edmunds talk on Big Data Publishing at the "What Bioinformaticians need to know about digital publishing beyond the PDF" workshop at ISMB 2013, July 22nd 2013

Statistics

Views

Total Views
2,595
Views on SlideShare
1,553
Embed Views
1,042

Actions

Likes
0
Downloads
7
Comments
0

9 Embeds 1,042

http://www.homolog.us 813
https://twitter.com 142
http://cloud.feedly.com 70
http://digg.com 6
http://www.feedspot.com 4
http://tweetedtimes.com 3
http://www.newsblur.com 2
http://reader.aol.com 1
http://summary 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Over 20,000 users on the main server Over 500 papers citing the use of Galaxy Over 55 servers deployed on the Web
  • The investigation file is a high-level aggregator for related studies, contains all the information to understand the overall goals of an experiment, including investigators involved, associated publications, the experimental design, experimental factors, protocols, funding agencies and so on… Here there is an example of some of the elements in the Investigation file for the SOAPdenovo2 investigation.
  • An investigation can have one or more studies. A study is the central unit of the experimental description and it contains information on the subject(s) under study, their characteristics, and any treatments applied. In the SOAPdenovo2 case, there is a single study file, which describes the sample collection workflow. The elements can be associated with ontology terms. In the table shown, the source names are associated with a term from the NCBI Taxonomy to indicate their organism.
  • Each study has one or more associated assays. The assay is the test performed either on the subject or on material taken from the subject, which produce qualitative and/or quantitative measurements. The assay file in the SOAPdenovo2 case describes the different protocols applied, the raw data and how it is processed. The assay files aggregates this information and points to the specific data/analysis methods/scripts, i.e. resources of different types. In the example, the assay file points to an FTP site with the data, to a table in the paper, to the workflow available in the Galaxy-CBIIT instance.
  • The ISA representation is available as1) a tabular format (ISA-TAB), which is a spreadsheet-like format targeted for biologists/experimentalists. 2) an RDF representation (produced by the ISA2OWL project), following the semantic web/linked data approach. This representation is targeted to bioinformaticians/software developers, it facilitates the integration of data, rich querying and reasoning over the data. The ISA framework also has support for submission to public repositories, either by direct submission to databases that support the format (e.g. Metabolights, GIGA-DB) .
  • That just leaves me to thank the GigaScience team: Laurie, Scott, Alexandra, Peter and Jesse, BGI for their support - specifically Shaoguang for IT and bioinformatics support – our collaborators on the database, website and tools: Tin-Lap, Qiong, Senhong, Yan, the Cogini web design team, Datacite for providing the DOI service and the isacommons team for their support and advocacy for best practice use of metadata reporting and sharing. Thank you for listening.

Scott Edmunds ISMB talk on Big Data Publishing Scott Edmunds ISMB talk on Big Data Publishing Presentation Transcript

  • Big Data publishing Beyond dead trees, and a case study Scott Edmunds ISMB, 22nd July 2013 @gigascience
  • The problems with publishing • Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995 • Core scientific statements or assertions are intertwined and hidden in the conventional scholarly narratives • Lack of transparency, lack of credit for anything other than “regular” dead tree publication
  • Time to move beyond: 18121665 1869
  • Problem: growing replication gap 1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14 2. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950 Out of 18 microarray papers, results from 10 could not be reproduced Out of 18 microarray papers, results from 10 could not be reproduced More retractions: >15X increase in last decade At current % > by 2045 as many papers published as retracted
  • Motivation • Scholarly artefacts must be – Treated as first-class objects, in scientific investigations and in scholarly communications – Made machine-readable for the convenience of reasoning – Represented in an interoperable manner • Truly “add value” to publishing – Provide infrastructure to aid & reward replication • Trial of ISA+Nanopublication+RO – Three similar approaches that should complement each other for the representation of scholarly artifacts
  • Motivation • Scholarly artefacts must be – Treated as first-class objects, in scientific investigations and in scholarly communications ? :“data* generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”. “increase acceptance of research data* as legitimate, citable contributions to the scholarly record”. Data* Citation (*and more)
  • • Data • Software • Re-use… = Credit}
  • GigaSolution: deconstructing the paper Need to credit and reward: •Data/software availability •Metadata/curation •Interoperability •Availability of workflows •Transparent analyses Data Metadata Methods Analyses
  • GigaSolution: deconstructing the paper www.gigadb.org www.gigasciencejournal.com 20PB storage, 20.5K cores, 212TFlops, >1000 bioinformaticians Utilizes big-data infrastructure and expertise from: Combines and integrates: Open-access journal Data Publishing Platform Data Analysis Platform
  • What should we reward?
  • Different levels of granularity: Experiment (e.g. Rice 10K project) Datasets (e.g. species, variety) Sample (e.g. specimen xyz) e.g. doi:10.5524/100001 e.g. doi:10.5524/100001-2 e.g. doi:10.5524/100001-2000 or doi:10.5524/100001_xyz Smaller still? Papers Data/ Micropubs NanopubsFacts/Assertions (~1014 in literature) Reward different shaped publishable objects
  • Rewarding open data
  • Validationchecks Fail – submitter is provided error report Pass – dataset is uploaded to GigaDB. Submission Workflow Curator makes dataset public (can be set as future date if required) DataCite XML file Excel submission file Submitter logs in to GigaDB website and uploads Excel submission GigaDB DOI assigned Files Submitter provides files by ftp or Aspera XML is generated and registered with DataCite Curator Review Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues). DOI 10.5524/100003 Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011) Public GigaDB dataset
  • Reward open & transparent review End reviewer 3 Download parody videos, now!
  • Real-time open-review = paper in arXiv + blogged reviews Reward open & transparent review http://tmblr.co/ZzXdssfOMJfywww.gigasciencejournal.com/content/2/1/10
  • Cloud solutions? Reward better handling of metadata… Novel tools/formats for data interoperability/handling. BMC Research Awards 2013 Winner of open data award
  • Rewarding transparent methods/results Software availability Open code Accessible pipelines Sharable workflows Research Objects Easeofreplication
  • Implement workflows in a community-accepted format http://galaxyproject.org Over 36,000 main Galaxy server users Over 500 papers citing Galaxy use Over 55 Galaxy servers deployed Open source
  • galaxy.cbiit.cuhk.edu.hk
  • Research Objects An aggregation of scholarly artefacts: • Data used or results produced in an experiment study • Methods employed to produce and analyse that data • Provenance and setting information about the experiments • People involved in the investigation • Annotations about these resources, that are essential to the understanding and interpretation of the scientific outcomes captured by a research object.
  • Example
  • How are we supporting data reproducibility? Data sets Analyses Linked to Linked to DOI DOI Open-Paper Open-Review DOI:10.1186/2047-217X-1-18 >11000 accesses Open-Code 8 reviewers tested data in ftp server & named reports published DOI:10.5524/100044 Open-Pipelines Open-Workflows DOI:10.5524/100038 Open-Data 78GB CC0 data Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>5000 downloads Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2
  • 8 referees downloaded & tested data, then signed reports Reward open & transparent review
  • Post publication: bloggers pull apart code/reviews in blogs + wiki: SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?title=SOAPdenovo2 Homologus blogs: http://www.homolog.us/blogs/category/soapdenovo/ Reward open & transparent review
  • SOAPdenovo2 workflows implemented in galaxy.cbiit.cuhk.edu.hk
  • SOAPdenovo2 workflows implemented in galaxy.cbiit.cuhk.edu.hk Implemented entire workflow in our Galaxy server, inc.: • 3 pre-processing steps • 4 SOAPdenovo modules • 1 post processing steps • Evaluation and visualization tools Also will be available to download by >36K Galaxy users in
  • How much further can we take this?
  • How much further can we take this? ISA + RO + Nanopub case study Understand how each of the three models can support representation of the actual scholarly artefacts, which are essential first-class objects in scholarly communication Demonstrate added value to life science, publishing and scholarly communication communities on how these models should be used together to describe scholarly artefacts from life sciences domains
  • Data models (instructions for authors for digital publishing) • Research Object – An encapsulation of essential information related to experiments and investigations • The ISA (Investigation + Study + Assay) framework – includes a format and a set of software tools that enable its international user community to provide rich description of the experimental workflows in life science, environmental and biomedical domains. • Nanopublication – Dissemination of individual data (assertions) with/without an accompanying scholarly articles – Enables attribution to the scientists for sharing these their data
  • DataData Method/Experi mental protocol Method/Experi mental protocol FindingsFindings Types of resources in an RO Wfdesc/ISA- TAB/ISA2OWL Wfdesc/ISA- TAB/ISA2OWL Models to describe each resource type The Big Picture
  • The SOAPdenovo2 Case study The Data The method, as a Galaxy workflow The findings, Table2 in the paper (doi:10.1186/2047-217X-1-18)
  • investigation Investigation/Study/Assay infrastructure The investigation file is a high-level aggregator for related studies, contains all the information to understand the overall goals of an experiment, including investigators involved, associated publications, the experimental design, experimental factors, protocols, funding agencies and so on…
  • investigation Investigation/Study/Assay infrastructure An investigation can have one or more studies. A study is the central unit of the experimental description and it contains information on the subject(s) under study, their characteristics, and any treatments applied. study study
  • investigation Investigation/Study/Assay infrastructure Each study has one or more associated assays. The assay is the test performed either on the subject or on material taken from the subject, which produce qualitative and/or quantitative measurements. study study assay assay assay assay
  • investigation Investigation/Study/Assay infrastructure study study assay assay assay assay data analysis method scriptdata data ISA
  • investigation Study Design ISA framework SOAPdenovo2
  • investigation study Study Samples: A table rendering the sample collection workflow ISA framework SOAPdenovo2
  • investigation study assay ftp://public.genomics.org.cn/BGI/SOAPdenovo2 http://galaxy.cbiit.cuhk.edu.hk/ FTP ISA framework SOAPdenovo2
  • investigation study assay Representation available in: •Tabular format • Spreadsheet-like format • For biologists/experimentalists •RDF/OWL format for Semantic Web/Linked Data users • For bioinformaticians/software developers • Facilitating data integration, querying, reasoning •Support for submission to public repositories and data publication platforms •Tools support for curation, creation, storage, analysis… •Large and diverse life science user/collaborator communities ISA framework
  • RO + ISA Scientific Workflow-specific ROs ISA experiment and data description Scientific, computational Experiments, non-wet lab protocols Scientific, computational Experiments, non-wet lab protocols Focus on web-lab or non- computational experimental protocols Focus on web-lab or non- computational experimental protocols
  • An RO for the Case Study A Galaxy workflow A Galaxy workflow Some nanopub statements Some nanopub statements Input sequence data Input sequence data A Research Object The Research Object contains the following artefacts: • The inputs sequence data that are represented in ISA-TAB format • The Galaxy workflow that reflects the computational steps taken for generating the results used to produce Table 2 • Machine-readable descriptions about the workflow • The nanopublication statements that represents claims based on the content of Table 2 Descriptions about the workflow Descriptions about the workflow
  • An RO for the Case Study
  • Assertion Nanopublication URL Provenance PublicationInfo assertio n assertio n opm: was Derived From opm: was Derived From opm: wasGene- ratedBy opm: wasGene- ratedBy this nanopub this nanopub dcterms: created dcterms: created pav: authored- By pav: authored- By associa- tion associa- tion aa sio:statis- ticalAssoci ation sio:statis- ticalAssoci ation sio:has- measurem entValue sio:has- measurem entValue Associatio n_1_p_val ue Associatio n_1_p_val ue aa Sio:probab ility-value Sio:probab ility-value sio:has- value sio:has- value 6.56 e-5 ^^xsd:floa t 6.56 e-5 ^^xsd:floa t sio: refers-to sio: refers-to dcterms: DOI dcterms: DOI … Integrity KeyIntegrity Key An Individual association between concepts: •statement or declaration •measurement •hypothetical inference •quantitative or qualitative An Individual association between concepts: •statement or declaration •measurement •hypothetical inference •quantitative or qualitative Guarantee immutability after publication Guarantee immutability after publication Unique, persistent and resolvable identifier Unique, persistent and resolvable identifier How this assertion came to be, methods, evidence, context, etc. How this assertion came to be, methods, evidence, context, etc. • Detailed attribution for authors, institutions, lab technicians, curators • License info • Publication date • Detailed attribution for authors, institutions, lab technicians, curators • License info • Publication date
  • A Nanopublication-Centric View • Improvements of SOAPdenovo2 have also been observed in assembling GAGE [8] dataset (see Additional file 1: Supplementary Method 6 and Tables 2 and 3). As  shown in Tables 2 and 3, the correct assembly length of SOAPdenovo2  increased by approximately 3 to 80-fold comparing with that of SOAPdenovo1.
  • SOAPdenovo2 S. aureus pipeline
  • How do we generate a nanopub from this? …stay tuned for Tech Track talk #34 by Marco Roos ICC Lounge 81, Tuesday 23rd : 3.40pm-4.05pm
  • Final step: visualizationFinal step: visualization NC_010079.pdf gi_161510924_ref_NC_010063.1_.pdf CONTIGuator 2 (thanks Marco Galardini)CONTIGuator 2 (thanks Marco Galardini) https://github.com/combogenomics/CONTIGuator
  • Lessons learned: • Is possible to push button(s) & recreate a result from a paper • Reproducibility is COSTLY. How much are you willing to spend? • Learn a huge amount about the study, and provides lots of information not present in the paper • Much easier to do this before rather than after publication
  • steps • Complete the case study on the release of ISA-OWL, nanopubs & ROs • Extend the case study by including more than one datasets or ROs, in order to show how related or conflicting information can be more easily interlinked • Create community guidelines on how these three models should be used together, e.g. recommended patterns or vocabulary terms
  • www.gigasciencejournal.com Give us your data & pipelines!* Want to go beyond dead trees & the PDF? scott@gigasciencejournal.com editorial@gigasciencejournal.com database@gigasciencejournal.com Contact us: * APC’s currently generously covered by BGI in 2013
  • Ruibang Luo (BGI/HKU) Shaoguang Liang (BGI-SZ) Tin-Lap Lee (CUHK) Qiong Luo (HKUST) Senghong Wang (HKUST) Yan Zhou (HKUST) Thanks to: @gigascience facebook.com/GigaScience blogs.openaccesscentral.com/blogs/gigablog/ Peter Li Huayan Gao Chris Hunter Jesse Si Zhe Nicole Nogoy Laurie Goodman Marco Roos (LUMC) Mark Thompson (LUMC) Jun Zhao (Oxford) Susanna Sansone (Oxford) Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford) www.gigadb.org galaxy.cbiit.cuhk.edu.hk www.gigasciencejournal.com CBIITFunding from: Our collaborators:team: Case study:
  • DataData Method/Experi mental protocol Method/Experi mental protocol FindingsFindings Types of resources in an RO Wfdesc/ISA- TAB/ISA2OWL Wfdesc/ISA- TAB/ISA2OWL Models to describe each resource type The Big Picture