Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Metagenomic Data Provenance and 
Management using the ISA infrastructure 
overview, implementation patterns & software too...
Experimental 
Metadata 
Roadmap
Experimental 
Metadata 
Roadmap
Experimental 
Metadata 
Roadmap 
link to analysis platforms
Experimental 
Metadata 
Roadmap 
link to analysis platforms 
submission to public 
repositories
Experimental 
Metadata 
Roadmap 
link to analysis platforms 
submission to public 
repositories
Experimental 
Metadata 
Roadmap 
link to analysis platforms 
submission to public 
repositories 
data publication
Experimental Metadata 
Notes in lab notebooks 
(information for humans) Spreadsheets & tables 
RDF statements 
(informatio...
9
http://www.ama-rochester.org/WP/wp-content/uploads/2013/01/three-pillars.png
The community
12 
A growing ecosystem of over 30 public and internal resources using 
the ISA metadata tracking framework (ISA-Tab and/o...
The format
Why ISA format and Tools? 
investigation 
assay(s) assay(s) 
pointers to data file 
names/location 
external files in 
nat...
Essentials about ISA syntax 
15 
• 3 types of files 
• Investigation file: at max 1 (think executive summary) 
–Why? gener...
Essentials about ISA syntax 
• Material Transformations: 
– Input and Outputs of Protocols are Material Nodes (Source Name...
Basic coding patterns
Essentials about ISA syntax 
–Branching events: Tabular Representation 
Sample 
Material 
muscle 
biopsy 
liver 
biopsy 
h...
Essentials about ISA syntax 
–Pooling events: Tabular Representation 
Source 
Name 
Characteris0c 
s[organism] 
Protocol 
...
Essentials about ISA syntax 
Tagging with Terminologies 
• Implicit column order matters: 
! 
! 
! 
! 
! 
! 
• ISA tools (...
Experimental design and workflows
Parallel group design 
source: hOp://dx.doi.org/10.1016/S1569-­‐9056(02)00115-­‐X; figure 1 
22
Essentials about ISA syntax 
Representing interventions and treatments 
! 
• expressing treatments as sets of factor level...
Cross-over design 
24 
source: Roberts et al. Journal of the International Society of Sports Nutrition 2007 4:25 doi:10.11...
08/26/13 
Cross-over design 
25 
10.1371/journal.pone.0037479
08/26/13 
Cross-over design 
26 
! 
Treatment 
declaration
08/26/13 
Cross-over design 
27 
10.1371/journal.pone.0037479
08/26/13 
Assays NMR 
28
08/26/13 
Assays NMR 
29
08/26/13 
Assays NMR 
30
The software suite
1
ISA configurations 
Available from: 
http://isa-tools.org/configurations.html 
https://github.com/ISA-tools/Configuration-...
ISA configurations 
Available from: 
http://isa-tools.org/configurations.html 
https://github.com/ISA-tools/Configuration-...
ISAconfigurator Tables
ISAconfigurator Tables
Things to bear in mind with NGS data 
Important considerations for managing data 
and submitting to public repositories 
–...
Tools for creating ISA-Tab documents 
isacreator
isacreator 
Java desktop application 
Developed to be a user 
friendly way to enter 
standards-compliant 
metadata: it has...
ISAcreator features: automatic template generation
ISACreator Wizard: automatic template generation 
Prerequisites and Conditions of use: 
! 
-supports factorial design expe...
43 Importing your own spreadsheet: 
Mapping to third party table
ISAcreator features: visualizing experimental workflows 
Work completed during investigation of new approach for creation ...
OntoMaton: a BioPortal powered 
Ontology widget for Google Spreadsheets 
Maguire et al, 2013 
Bioinformatics 
Tools for cr...
Potential Issues and known hurdles 
• The problem of conflicting versions 
–especially high when working with big consorti...
Bioportal meets Google Spreadsheet 
47
Searching and Tagging 
Templates: 
https://drive.google.com/templates?type=spreadsheets&q=ontomaton
Searching and Tagging 
Templates: 
https://drive.google.com/templates?type=spreadsheets&q=ontomaton
50
2
3
Risa - ISA-Tab manipulation for analysis in R 
• RISA R-package 
53
• R"package"available"since"BioConductor"2.11" 
h:p://www.bioconductor.org/packages/release/bioc/html/Risa.html" 
• Func@o...
http://isatools.wordpress.com/2013/065/158/isacreator-available-in-genomespace/
http://isatools.wordpress.com/2013/065/168/isacreator-available-in-genomespace/
http://isatools.wordpress.com/2013/065/178/isacreator-available-in-genomespace/
4
Submission Tool 
https://github.com/ISA-tools/ISAcreator/wiki/ENASubmissionTool 
59
Pre-requirements: 
– registration to ENA/EBI Metagenomics 
– data upload by one of the methods provided by ENA 
http://www...
http://www.ebi.ac.uk/ena/about/sra_data_upload 
Pre-requirements: 
– registration to ENA/EBI Metagenomics 
– data upload b...
https://github.com/ISA-tools/ISAcreator/wiki/ENASubmissionTool 
62
https://github.com/ISA-tools/ISAcreator/wiki/ENASubmissionTool 
63
64
65
66
67 
ISA-Tab 
validation 
ISA-Tab 
to 
SRA 
conversion 
Submission 
to ENA 
ISA-Tab 
creation 
(SRA-xml schema)
68
69
5
http://gigasciencejournal.com 
http://gigadb.org/dataset/100035
http://gigasciencejournal.com 
http://gigadb.org/dataset/100035
• New open-access, online-only publication for descriptions of scientifically valuable datasets 
• Only content type: Data...
Data Descriptors served by Scientific Data 
Narrative Section! 
A brief article-like document like with:! 
•Title! 
•Abstr...
Data Descriptors served by Scientific Data 
Narrative Section! 
A brief article-like document like with:! 
•Title! 
•Abstr...
Training Material 
76 
http://isa-tools.org/training.html
http://isa-tools.org/training.html 
Hands-on Material 
• Software: 
–ISAcreator 1.7.8 (see pre-release) 
–ISAconfigurator ...
The Exemplar Datasets 
• BII-­‐S-­‐3: 
• Metagenome 
and 
Metatranscriptome 
on 
454
• BII-­‐S-­‐7: 
The Exemplar Datasets 
SubmiOed 
to 
ENA 
via 
ISAcreator: 
ERP000133 
• Targeted 
Gene 
Survey 
(16s 
RNA...
Experimental 
Metadata 
Roadmap 
link to analysis platforms 
submission to public 
repositories 
data publication
ebiteams 
funders 
81
Thanks for your attention! 
Questions? 
You can email us... 
isatools@googlegroups.com 
View our websites 
View our Git re...
Metagenomic Data Provenance and Management using the ISA infrastructure --- overview, implementation patterns & software t...
Upcoming SlideShare
Loading in …5
×

Metagenomic Data Provenance and Management using the ISA infrastructure --- overview, implementation patterns & software tools

5,583 views

Published on

Metagenomic Data Provenance and Management using the ISA infrastructure - overview, implementation patterns & software tools

Slides presented at EBI Metagenomics Bioinformatics course: http://www.ebi.ac.uk/training/course/metagenomics2014

  • What is the quickest way to lose 10 pounds? ★★★ http://ishbv.com/bkfitness3/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Metagenomic Data Provenance and Management using the ISA infrastructure --- overview, implementation patterns & software tools

  1. 1. Metagenomic Data Provenance and Management using the ISA infrastructure overview, implementation patterns & software tools Alejandra ! Gonzalez-Beltran, PhD Eamonn ! Maguire ! alejandra.gonzalezbeltran@oerc.ox.ac.uk eamonn.maguire@oerc.ox.ac.uk ! ! Metagenomics Bioinformatics, EMBL-EBI, Hinxton, UK September 2014 University of Oxford e-Research Centre, UK
  2. 2. Experimental Metadata Roadmap
  3. 3. Experimental Metadata Roadmap
  4. 4. Experimental Metadata Roadmap link to analysis platforms
  5. 5. Experimental Metadata Roadmap link to analysis platforms submission to public repositories
  6. 6. Experimental Metadata Roadmap link to analysis platforms submission to public repositories
  7. 7. Experimental Metadata Roadmap link to analysis platforms submission to public repositories data publication
  8. 8. Experimental Metadata Notes in lab notebooks (information for humans) Spreadsheets & tables RDF statements (information for machines) It is all about structuring experimental information to make it available to computers and software agents to enable: 8 ! provenance tracking assessment and evaluation accountability, reliability, trust, evidence conservation, preservation, storage, archiving and mining
  9. 9. 9
  10. 10. http://www.ama-rochester.org/WP/wp-content/uploads/2013/01/three-pillars.png
  11. 11. The community
  12. 12. 12 A growing ecosystem of over 30 public and internal resources using the ISA metadata tracking framework (ISA-Tab and/or tools) to facilitate standards-compliant collection, curation, management and reuse of investigations in an increasingly diverse set of life science domains, including: ! • stem cell discovery • system biology • transcriptomics • toxicogenomics • also by communities working to build a library of cellular signatures ! • environmental health • environmental genomics • metabolomics • metagenomics • nanotechnology • proteomics
  13. 13. The format
  14. 14. Why ISA format and Tools? investigation assay(s) assay(s) pointers to data file names/location external files in native or other for-mats data data investigation high level concept to link related studies study the central unit, containing information on the subject under study, its characteristics and any treatments applied. a study has associated assays assay test performed either on material taken from the sub-ject or on the whole initial subject, which produce quali-tative or quantitative meas-urements (data) H. Sapiens H. Sapiens H. Sapiens H. Sapiens 33 Years H1 H1 H2 35 35 33 Years Years Years ISA metadata specifications: ! • workflow and process orientated • compatible with checklist enforcement • compatible with external vocabulary resources • compatible by design with existing schemas ! H1.sample1 H1.sample2 H2.sample1 Labeling Labeling H1.sample1.labeled H2.sample1.labeled h1-s1.cel h1-s2.cel h2-s1.cel H1 H2 H1.sample1 H1.sample2 H2.sample1 Labeling Labeling H1.sample1.labeled H2.sample1.labeled h1-s1.cel h1-s2.cel h2-s1.cel H. Sapiens 35 Years MAGE-Tab Pride-xml SRA-xml
  15. 15. Essentials about ISA syntax 15 • 3 types of files • Investigation file: at max 1 (think executive summary) –Why? general study description –How? methods / protocol declaration –How? variable declarations (factors and response variable) –Who? contact and affiliation information • Study File: true table (think sorting, filtering) –What? Listing all biological materials collected over the study course. • Assay File: true table (think sorting, filtering) –Results! Listing all data files collected by a given assay –n files, as many as there are assay types declared
  16. 16. Essentials about ISA syntax • Material Transformations: – Input and Outputs of Protocols are Material Nodes (Source Name, Sample Name, Extract Name, Labeled Extract Name.) Material Node Characteristics[…] Factor Value[…] (independent variables) Material Type Comment[…] Parameter Value ! […] Performer (operator effect) Date (day effect) Material Protocol Process Data File Node ! DATA Derived Data File Raw Data File ! DATA ! Material 16
  17. 17. Basic coding patterns
  18. 18. Essentials about ISA syntax –Branching events: Tabular Representation Sample Material muscle biopsy liver biopsy human volunter 1 Source Name Characteris0c s[organism] Protocol REF Parameter Value[storage condi0on] Sample Name Characteris0cs[organ] volunteer 1 Homo sapiens sample collec8on heparinated tube, room temperature volunteer 1 -­‐ sample1 peripheral blood volunteer 1 Homo sapiens sample collec8on liquid nitrogen volunteer 1 -­‐ sample2 muscle volunteer 1 Homo sapiens sample collec8on liquid nitrogen volunteer 1 -­‐ sample3 liver Source Material peripheral blood 18
  19. 19. Essentials about ISA syntax –Pooling events: Tabular Representation Source Name Characteris0c s[organism] Protocol REF Parameter Value[storage condi0on] Sample Material Sample Name Characteris0cs[organ] animal 1 Mus musculus sample collec8on heparinated tube, room temperature pool1 salivary gland animal 2 Mus musculus sample collec8on heparinated tube, room temperature pool1 salivary gland animal 3 Mus musculus sample collec8on heparinated tube, room temperature pool1 salivary gland animal 1 animal 2 animal 3 Source Material salivary glands 19
  20. 20. Essentials about ISA syntax Tagging with Terminologies • Implicit column order matters: ! ! ! ! ! ! • ISA tools (ISAcreator - ISAconfigurator) provide Ontology term selection and term tagging facilities to help users. Source Name Characteris0cs [organism] Factor Value[comp ound agent] Factor Value[per turba0on agent] Factor Value[dose] Factor Value[dura 0on] Factor Value[was hout period Factor Value[dura 0on] Factor Value[perturba0o n agent] Factor Value[dose] Factor Value[dura0on] individual1 human Source Name Characteris0cs [organism] Term Source REF Term Accession Number Characteris0c s[dura0on] Unit Term Source REF Term Accession Number Factor Value[compound (htppt://purl] Term Source REF Term Accession Number individual1 Homo sapiens NCBITax 9606 12 week UO UO:wwerw ta aspirin CHEBI 1231354 20
  21. 21. Experimental design and workflows
  22. 22. Parallel group design source: hOp://dx.doi.org/10.1016/S1569-­‐9056(02)00115-­‐X; figure 1 22
  23. 23. Essentials about ISA syntax Representing interventions and treatments ! • expressing treatments as sets of factor levels • examples: treatment is a tadalafil supplementation • Factors will be ‘compound’, ‘dose’ and duration • (what?, how much?, how long for?) ! Characteris0c Factor ! Source Name s[organism] Protocol REF Value[compoun Factor Value[dose] Factor Value[dura0on] d] ! volunteer 1 Homo sapiens treatment tadalafil 250 mg/day 12 weeks ! volunteer 2 Homo sapiens treatment tadalafil 250 mg/day 12 weeks ! volunteer 3 Homo sapiens treatment placebo 20 mg/day 12 weeks ! • Implicit column order matters but this is independent from the ISA syntax specification
  24. 24. Cross-over design 24 source: Roberts et al. Journal of the International Society of Sports Nutrition 2007 4:25 doi:10.1186/1550-2783-4-25
  25. 25. 08/26/13 Cross-over design 25 10.1371/journal.pone.0037479
  26. 26. 08/26/13 Cross-over design 26 ! Treatment declaration
  27. 27. 08/26/13 Cross-over design 27 10.1371/journal.pone.0037479
  28. 28. 08/26/13 Assays NMR 28
  29. 29. 08/26/13 Assays NMR 29
  30. 30. 08/26/13 Assays NMR 30
  31. 31. The software suite
  32. 32. 1
  33. 33. ISA configurations Available from: http://isa-tools.org/configurations.html https://github.com/ISA-tools/Configuration-Files • Assembling workflow archetypes • Setting annotation requirements –for compliance with database schemas (SRA, MAGE, PRIDE) –for compliance with community based requirements (MIAME, MIAPE, MIMS, MIxS, …) • Guide users –Provide pre-assembled templates –Specify vocabulary support ISAconfigurator: Supporting tool https://github.com/ISA-tools/ISAconfigurator
  34. 34. ISA configurations Available from: http://isa-tools.org/configurations.html https://github.com/ISA-tools/Configuration-Files • Minimum information about any (x) sequence (MIxS) Guidelines as issued by Genomic Standards Consortium • ENA-GSC-MIxS checklist XML document: –based on MIxS guidelines –augmented with a number of regular expressions to further validate/ regularize input –fixing a number of units used to report measurement –issued July 2013 (version 3.0), July 2014 (version 4.0) • SRA 1.5 schema requirements (mandatory information and required terminology, e.g. Library Selection or Library Strategy) • All this information is used to derive ISA MIxS configurations allowing all those annotation requirements to be embedded in spreadsheet tables
  35. 35. ISAconfigurator Tables
  36. 36. ISAconfigurator Tables
  37. 37. Things to bear in mind with NGS data Important considerations for managing data and submitting to public repositories –be aware of support file formats • FastA,FastQ,SFF,..... –be aware of the need to demultiplex reads –SRA schema evolves and updates are needed • e.g. Study replaced by Project • Updates to the ISAconverter • Mapping from ISA is straightforward as brings a number of element ISA already supported
  38. 38. Tools for creating ISA-Tab documents isacreator
  39. 39. isacreator Java desktop application Developed to be a user friendly way to enter standards-compliant metadata: it has lots of features... But these are just some of them… we also have a data entry wizard and an import utility...
  40. 40. ISAcreator features: automatic template generation
  41. 41. ISACreator Wizard: automatic template generation Prerequisites and Conditions of use: ! -supports factorial design experiments, meaning sets of discrete factor levels combined together, to define a treatment 2x2 factorial design as in 2 compounds and 2 time points 2x2x3 factorial design as in 2 compounds, 2 time points, 2 doses -assumes one sample collection event (all samples collected at sacrifice time) -supports some but not all currently available assay types -supports fractional factorial design -supports unbalanced factor group population sizes (ethical considerations for high dose toxic exposures) -generates automatically sample identifiers, human readable & meaning full labels and , if requested, barcodes ! -does not support ‘crossover design’, which have to be coded manually -does not support sample collection timeline management (under development)
  42. 42. 43 Importing your own spreadsheet: Mapping to third party table
  43. 43. ISAcreator features: visualizing experimental workflows Work completed during investigation of new approach for creation of glyphs with use of taxonomy for guidance. See Maguire et al, Taxonomy-Based Glyph Design – with a Case Study on Visualizing Workflows of Biological Experiments, IEEE Transactions on Visualization and Computer Graphics, 2012 44
  44. 44. OntoMaton: a BioPortal powered Ontology widget for Google Spreadsheets Maguire et al, 2013 Bioinformatics Tools for creating ISA-Tab documents ! ! ! ! http://www.slideshare.net/proccaserra/ontomaton-icbo2013alternative-ordertwv3 http://isatools.wordpress.com/2012/07/13/introducing-ontomaton-ontology-search-tagging- for-google-spreadsheets/
  45. 45. Potential Issues and known hurdles • The problem of conflicting versions –especially high when working with big consortia –distributed, decentralised groups of users • Lack of version control and history • Absence of collaborative features ! –Looking for new solutions while retaining the features ! = + + LOV
  46. 46. Bioportal meets Google Spreadsheet 47
  47. 47. Searching and Tagging Templates: https://drive.google.com/templates?type=spreadsheets&q=ontomaton
  48. 48. Searching and Tagging Templates: https://drive.google.com/templates?type=spreadsheets&q=ontomaton
  49. 49. 50
  50. 50. 2
  51. 51. 3
  52. 52. Risa - ISA-Tab manipulation for analysis in R • RISA R-package 53
  53. 53. • R"package"available"since"BioConductor"2.11" h:p://www.bioconductor.org/packages/release/bioc/html/Risa.html" • Func@onality"for"parsing"ISAFTab"datasets"into"R"objects," saving"and"upda@ng"them." • It"bridges"the"ISAFTab"metadata"to"analysis"pipelines"of" specific"assay"types,"by"building"objects"for"use"in"other"R" packages"downstream" – "currently"considering"mass"spectrometry"(xmcs"package,"xcmsSet)" and"DNA"microarray"(Biobase"package,"ExpressionSet)" " 1 2 Collect Samples 3 4 Run Assays 5 Experiment Design Analysis 54 SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4 SAMPLE5 SAMPLE6 SAMPLE7 SAMPLE8 SAMPLE9 SAMPLE10 SAMPLE11 SAMPLE 1 SAMPLE 2 SAMPLE 3 SAMPLE 4 SAMPLE 5 SAMPLE 6 SAMPLE 7 SAMPLE 8 SAMPLE 9 SAMPLE 10 SAMPLE 11 FILE 1 FILE 2 FILE 3 FILE 4 FILE 5 FILE 6 FILE 7 FILE 8 FIL FIL FIL Arabidopsis thaliana Treatment groups 70% 90% 100% 6
  54. 54. http://isatools.wordpress.com/2013/065/158/isacreator-available-in-genomespace/
  55. 55. http://isatools.wordpress.com/2013/065/168/isacreator-available-in-genomespace/
  56. 56. http://isatools.wordpress.com/2013/065/178/isacreator-available-in-genomespace/
  57. 57. 4
  58. 58. Submission Tool https://github.com/ISA-tools/ISAcreator/wiki/ENASubmissionTool 59
  59. 59. Pre-requirements: – registration to ENA/EBI Metagenomics – data upload by one of the methods provided by ENA http://www.ebi.ac.uk/ena/about/sra_data_upload 60
  60. 60. http://www.ebi.ac.uk/ena/about/sra_data_upload Pre-requirements: – registration to ENA/EBI Metagenomics – data upload by one of the methods provided by ENA 61
  61. 61. https://github.com/ISA-tools/ISAcreator/wiki/ENASubmissionTool 62
  62. 62. https://github.com/ISA-tools/ISAcreator/wiki/ENASubmissionTool 63
  63. 63. 64
  64. 64. 65
  65. 65. 66
  66. 66. 67 ISA-Tab validation ISA-Tab to SRA conversion Submission to ENA ISA-Tab creation (SRA-xml schema)
  67. 67. 68
  68. 68. 69
  69. 69. 5
  70. 70. http://gigasciencejournal.com http://gigadb.org/dataset/100035
  71. 71. http://gigasciencejournal.com http://gigadb.org/dataset/100035
  72. 72. • New open-access, online-only publication for descriptions of scientifically valuable datasets • Only content type: Data Descriptor, narrative + structured parts • Initially focused on the life, environmental and biomedical sciences • Data Descriptor will be complementary to traditional research journals and data repositories • Designed to foster data sharing and reuse, and ultimately to accelerate scientific discovery www.nature.com/scientificdata
  73. 73. Data Descriptors served by Scientific Data Narrative Section! A brief article-like document like with:! •Title! •Abstract! •Background & Summary! •Methods! •Technical Validation! •Usage Notes ! •Figures & Tables ! •References Structured Section! Detailed descriptions of the experimental procedures used to produce the data •Following community-defined minimum information requirements • for a level of detail sufficient to reproduce the experiments •Using ontologies & controlled-vocabularies • To maximise consistency of the descriptions www.nature.com/scientificdata
  74. 74. Data Descriptors served by Scientific Data Narrative Section! A brief article-like document like with:! •Title! •Abstract! •Background & Summary! •Methods! •Technical Validation! •Usage Notes ! •Figures & Tables ! •References Structured Section! Detailed descriptions of the experimental procedures used to produce the data •Following community-defined minimum information requirements • for a level of detail sufficient to reproduce the experiments •Using ontologies & controlled-vocabularies • To maximise consistency of the descriptions www.nature.com/scientificdata
  75. 75. Training Material 76 http://isa-tools.org/training.html
  76. 76. http://isa-tools.org/training.html Hands-on Material • Software: –ISAcreator 1.7.8 (see pre-release) –ISAconfigurator 1.6 • Configurations: –ISA-ENA-MIxS Configuration –default MultiAssay Configuration • ISA-Tab formatted datasets –BII-S-3: Western Channel Water Samples metagenome and meta transcriptome –BII-S-7: Human gut microbiome targeted gene survey • Google Templates and Ontomaton • ISA mapping file
  77. 77. The Exemplar Datasets • BII-­‐S-­‐3: • Metagenome and Metatranscriptome on 454
  78. 78. • BII-­‐S-­‐7: The Exemplar Datasets SubmiOed to ENA via ISAcreator: ERP000133 • Targeted Gene Survey (16s RNA) on 454
  79. 79. Experimental Metadata Roadmap link to analysis platforms submission to public repositories data publication
  80. 80. ebiteams funders 81
  81. 81. Thanks for your attention! Questions? You can email us... isatools@googlegroups.com View our websites View our Git repo & contribute http://github.com/ISA-tools View our blog http://isatools.wordpress.com Follow us on Twitter @isatools

×