ISMB Workshop 2014

5,031 views

Published on

This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.

Published in: Data & Analytics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
5,031
On SlideShare
0
From Embeds
0
Number of Embeds
3,433
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ISMB Workshop 2014

  1. 1. What was the plan? A role for data standards, models and computational workflows in scholarly data publishing Alejandra González-Beltrán, PhD Philippe Rocca-Serra, PhD Oxford e-Research Centre, University of Oxford {alejandra.gonzalezbeltran,philippe.rocca-serra}@oerc.ox.ac.uk ISMB Workshop:What Bioinformaticians need to know about digital publishing beyond the PDF2 July15th, 2014 Boston, USA
  2. 2. Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment The experimental workflow
  3. 3. Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment The experimental workflow metadata
  4. 4. Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment The experimental workflow metadata
  5. 5. Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Interoperability The experimental workflow Reproducibility Data Review
  6. 6. The experimental workflow Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Reusability
  7. 7. The experimental plan - life sciences case experimental design! sample characteristic(s)! experimental variable(s)! 2-week systemic rat study using male Wistar rats (N=15 per dose group) 14 proprietary drug candidates from participating companies and 2 reference toxic compounds InnoMed PredTox Project
  8. 8. The experimental plan - life sciences case experimental design! sample characteristic(s)! experimental variable(s)! technology(s)! measurement(s)! protocols(s)! data file(s)! …!
  9. 9. The experimental plan - computational case •open peer-review •availability of •data •analysis scripts •documentation Evaluation of SOAPdenovo2 tool for the de novo assembly of genomes from small DNA segments reads by next generation sequencing, implementing improvements over SOAPdenovo1 assembler.
  10. 10. genome assembly algorithm genome size Predictor Variables! (Factor Name, Factor Type) The experimental plan - computational case
  11. 11. genome assembly algorithm genome size SOAPdenovo2 SOAPdenovo1 ALL-PATHS-LG Predictor Variables! (Factor Name, Factor Type) The experimental plan - computational case
  12. 12. genome assembly algorithm genome size SOAPdenovo2 SOAPdenovo1 ALL-PATHS-LG bacterial genome insect genome human genome Predictor Variables! (Factor Name, Factor Type) The experimental plan - computational case
  13. 13. genome assembly algorithm genome size SOAPdenovo2 SOAPdenovo1 ALL-PATHS-LG bacterial genome insect genome human genome bacterial genome insect genome human genome bacterial genome insect genome human genome Predictor Variables! (Factor Name, Factor Type) 3x3 factorial design 9 study groups The experimental plan - computational case
  14. 14. genome assembly algorithm genome size SOAPdenovo2 SOAPdenovo1 ALL-PATHS-LG bacterial genome insect genome human genome bacterial genome insect genome human genome bacterial genome insect genome human genome Predictor Variables! (Factor Name, Factor Type) The experimental plan - computational case S. aureus R. sphaeroides B. impatiens Chinese Han genome (or YH genome)
  15. 15. genome assembly algorithm genome size SOAPdenovo2 SOAPdenovo1 ALL-PATHS-LG bacterial genome insect genome human genome bacterial genome insect genome human genome bacterial genome insect genome human genome Predictor Variables! (Factor Name, Factor Type) The experimental plan - computational case Response Variables! genome coverage computation run time memory consumption
  16. 16. http://www.ama-rochester.org/WP/wp-content/uploads/2013/01/three-pillars.png
  17. 17. 17 A growing ecosystem of over 30 public and internal resources using the ISA metadata tracking framework (ISA-Tab and/or tools) to facilitate standards-compliant collection, curation, management and reuse of investigations in an increasingly diverse set of life science domains, including: ! • stem cell discovery • system biology • transcriptomics • toxicogenomics • also by communities working to build a library of cellular signatures ! • environmental health • environmental genomics • metabolomics • metagenomics • nanotechnology • proteomics
  18. 18. General-purpose, configurable format designed to support: ! • description of the experimental metadata, making the annotation explicit and discoverable ! • provenance tracking ! • use of community standards, such as minimal reporting guidelines and terminologies ! • designed to be converted to - a growing number of - other metadata formats, e.g. used by the European Bioinformatics Institute (EBI) repositories !
  19. 19. H. Sapiens H. Sapiens H. Sapiens H1 H1 H2 35 35 33 Years Years Years H1.sample1 H1.sample2 H2.sample1 Labeling Labeling H1.sample1.labeled H2.sample1.labeled h1-s1.cel h1-s2.cel h2-s1.cel Scanning Scanning Scanning ... H. Sapiens 33 Years H1 H2 H1.sample1 H1.sample2 H2.sample1 Labeling Labeling H1.sample1.labeled H2.sample1.labeled h1-s1.cel h1-s2.cel h2-s1.cel H. Sapiens 35 Years Scanning Scanning Scanning ... ... ...
  20. 20. H. Sapiens H. Sapiens H. Sapiens H1 H1 H2 35 35 33 Years Years Years H1.sample1 H1.sample2 H2.sample1 Labeling Labeling H1.sample1.labeled H2.sample1.labeled h1-s1.cel h1-s2.cel h2-s1.cel Scanning Scanning Scanning ... H. Sapiens 33 Years H1 H2 H1.sample1 H1.sample2 H2.sample1 Labeling Labeling H1.sample1.labeled H2.sample1.labeled h1-s1.cel h1-s2.cel h2-s1.cel H. Sapiens 35 Years Scanning Scanning Scanning ... ... ... obi:material entity obi:material sample obi:material processing obi:processed material obi:planned process isa:raw data file bfo:derives from
  21. 21. http://gigasciencejournal.com http://gigadb.org/dataset/100035
  22. 22. http://gigasciencejournal.com http://gigadb.org/dataset/100035
  23. 23. Experimental metadata or structured component (in-house curated, machine-readable formats) Article or narrative component (PDF and HTML) A new online-only publication for descriptions of scientifically valuable datasets in the life, environmental and biomedical sciences, but not limited to these! Credit for sharing your data Focused on reuse and reproducibility Peer reviewed, curated Promoting Community Data Repositories Open Access
  24. 24. SOAPdenovo2 http://isa-tools.github.io/soapdenovo2
  25. 25. SOAPdenovo2 http://isa-tools.github.io/soapdenovo2
  26. 26. SOAPdenovo2 http://isa-tools.github.io/soapdenovo2 Galaxy workflows to re-enact the data analysis
  27. 27. http://isa-tools.github.io/soapdenovo2 SOAPdenovo2 Nanopub: represents structured data along with its provenance in a single publishable and citable entity
  28. 28. http://isa-tools.github.io/soapdenovo2 SOAPdenovo2 ResearchObject: enables the aggregation of the digital resources contributing to findings of computational research, including results, data and software, as citable compound digital objects
  29. 29. Reproducing SOAPdenovo2 results Galaxy workflows S. aureus pipeline
  30. 30. Reproducing SOAPdenovo2 results Galaxy workflows
  31. 31. Reproducing SOAPdenovo2 results Galaxy workflows
  32. 32. 2241 400 30 119.0 11 106 24 68 0 Reproducing SOAPdenovo2 results Galaxy workflows
  33. 33. “genome coverage increased over the human data when comparing SOAPdenovo2 against SOAPdenovo1”! Response Variables! genome coverage memory consumption
  34. 34. OntoMaton:(a(Bioportal(powered( Ontology(widget(for(Google( Spreadsheets( Maguire(et(al,((2013( Bioinforma?cs( widget for ontology annotation and tagging on Google spreadsheets relying on BioPortal and Linked Open Vocabularies services
  35. 35. OntoMaton:(a(Bioportal(powered( Ontology(widget(for(Google( Spreadsheets( Maguire(et(al,((2013( Bioinforma?cs( widget for ontology annotation and tagging on Google spreadsheets relying on BioPortal and Linked Open Vocabularies services NanoMaton https://github.com/ISA-tools/NanoMaton Ontology for Biomedical Investigations SemanticsScience Integrated Ontology
  36. 36. Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Findable, Accessible, Interoperable, Reusable!FAIR data
  37. 37. Contributing to ! Metabolights and ISA • BBRSC UK-China Award & BGI funded Hackathon! • venue: BGI Hong-Kong! • Participants:! • Metabolights/BGI/ISA/Birmingham/Hong-Kong University! • Outcome: ! • ISAtab web viewer code! • Functional Specifications & Code for DoE Wizard API! • 4 datasets coded in ISA format! • Conversion Metabolights datasets to RDF
  38. 38. funders acknowledgements Scott Edmunds, GigaScience Peter Li, GigaScience Jun Zhao, Lancaster University María Susana Avila García, Oxford University Marco Roos, Leiden University Mark Thompson, Leiden University Ruibang Luo, University of Hong Kong Tin-Lap Lee, Chinese University of Hong Kong Tak-wah Lam, University of Hong Kong
  39. 39. Questions? You can email us... isatools@googlegroups.com View our blog http://isatools.wordpress.com Follow us onTwitter @isatools View our websites View our Git repo & contribute http://github.com/ISA-tools Thanks for your attention!

×