Biodiversity, the field of science interested in documenting The Earth’s life form wherever they are. For Vertebrates and many macroscopic species, the outlook seems grim as seen in recent headlines, here exemplified by BBC news title dating back September 30th. This is all the more troubling as we only start to have the molecular tools to probe life very diverse niches.
We are all too aware of the advances in sequencing technologies, with Illumina instruments dominating the market. While those instruments are fast they are still bulky and competitors are working hard at developing new alternatives whose size (here is Oxford Nanopore Minion USB connected nanopore sequencing ) for which a first dataset has been published in a BMC GigaScience.
For a long time, scientists have been limited in their exploration by the ‘lense’ through which their were looking. This can not be more explicitly demonstrated in the world of microbiology where only what that could be grown in lab conditions would be characterized. The advent of fast, accurate sequencing techniques opened entirely new horizons to life exploration. Here are few examples, from our happy scientists at the zoology department in Oxford, collecting new deep sea samples, to colleagues monitoring extreme habitats such as mining waters. Other projects such as Tara Ocean recapitulate some of the sea trails followed by the XV century explorers in an attempt to provide a snapshot of marine biodiversity. Finally, biodiversity is within too as shown in this famous Nature article and by projects such as the American Gut.
When it comes to biodiversity studies relying on sequencing techniques, there are in fact 2 main approaches: global or targeted. In the first case, one will try to sequence as much as possible, and this means as deep as possible to trawl the rarest (i.e. less abundant) species. But deep sequencing is expensive and requires long machine time, which can be an issue with a limited number of instruments and a vast number of samples to process. Another approach is much more parsimonious but only provides an indirect measure of biodiversity. The technique relies on identifying a genomic region specific to a genre, but variable enough to estimate the spread of subspecies within that genre. Such genomic region are often coding genes, common ancestors which have accumulated mutations and can be used as a proxy to estimate distance between relatives. For the Bacteria, 16sRNA gene is used, for Fungi, hyper-variable regions of gene ITS are the prime tool and COI gene is often used for Eukaryotes.
This brings the need to disambiguate 2 very distinct (even though related in their metaphor) of the notion of ‘Barcode’. You remember we mentioned that instrument occupancy was still a bottleneck (as well as reagent costs). There, multiplexing techniques offer an extremely valuable solution for speeding up throughput. Once more, the advances in computational treatment of sequencing reads meant it was possible to devise library construction techniques allowing pooling of tagged samples so one single reaction well could be used to produce signal. Since individual genomic DNA for each sample has been tagged with a ‘multiplex identifier’ (mid) colloquially called ‘barcode’, it is possible to apply a deconvolution protocol and group together all sequencing reads associated to the tag and therefore a sample. This is first meaning of ‘barcode sequencing’ in the field.
But Barcode is also met in the project ‘Barcode of life’ . Here the aim, is to defined a true nucleic acid profile (if possible in single gene region) which would uniquely define a given species. This slide shows the overall workflow and ambition.
(all on the slide)
Fine, huge amounts of sequencing data are being generated but those will be of little value if contextual data is missing. The criticality of such annotation has been outlined in a NatBiotech paper from 2011 by Yilmaz et al., who published the MIXS/MIMARKS minimal information specification. This work was carried out under the Genomic Standards Consortium (GSC) initiative.
The MIXS/MIMARKS checklist provides a framework detailing which metadata to collect, with specific requirements for specific sample types. It is meant to facilitate exchange of data between centres collecting and archiving environmental samples. We will now show how these guidelines have been implemented by the ISA Team, that generated a set of configurations defining data collection templates.
A quick introduction of ISA tool suites, support data collection, persistence and conversion to a set of formats supported by Public Repositories. Ecosystem revolving around the ISA-TAB format Support for massively parallel datasets Gradient from left to right – configuration (annotation guidelines), curation tools to analysis and usage – people can choose the path that is more convenient for their use case. More recently, we became involved with Publishers (NPG and BMC GigaScience)
The main job consisted in 2 steps: i. create the ISA configurations from MIMARKS guidelines. This meant binning metadata tags defined by GSC to the relevant ISA syntactic element. For instance, MIXS geo_loc (geographical location) has been mapped to the ISA Source Name element while ‘collection device’ has been mapped as a Parameter Value associated to a ‘Protocol’. A screenshot shown here illustrate the ‘distribution’ of MIMARKS tags over an ISA workflow , showing here only the annotation related to library preparation and data acquisition. ii. Step 2 consisted in adjusting the ISA SRA converter and mapping the metadata into SRA schema objects. This is where we realized that the same information (MIMARKS) can be mapped differently to the same schema (SRA).
The example we consider here is that of an environmental gene survey performed on the same sample but using 3 different sets of PCR primers to amplify genomic regions targeting 3 different Genera. Following ISA templates, the interpretation of the conversion retains that feature, i.e. all libraries have been derived from the same samples. However, other tools will create 3 distinct SRA samples. Identity will have to be assumed. The experience has been used to fully describe these types of assays in a BFO based ontological framework in order to ensure semantic accuracy and avoid the pitfalls. The following 2 slides present a graphical representation (in the form of cMAP) of ‘targeted gene survey’ assay by exploiting the OBI assay design pattern and augmenting it to accommodate the specifics of the procedure.
This shows the component corresponding the biomaterial sample collection and preservation,
This shows the component corresponding the biomaterial processing to generate sequencing libraries, preceding the data acquisition and treatment processes, which ultimately, produced information artifact about a population.
The representation can therefore be exploited to convert ISA spreadsheet for this type of information and totally clarify the semantics of the tables. Such mapping can be fed to the ISA RDF conversion module (LinkedISA) as the means to make biodiversity data more linked. Obviously, this pattern is independent from ISA based representation but the same representation can be used as mapping template, thus providing a patterns to consistently represent such data.
Conclusions: (all on the slide, really)
In digging into the details of sequence based biodiversity assay, we have identified a potential issue in existing representation affecting the ability to accurately assess true sample size. This may result in inconsistencies between declared sample size in experimental reports and sample sizes computed from deposited data. While remedial heuristics can be devised to compensate, they have a cost. Those methods will have to rely on computing distance metrics based on vectors of metadata values and try to infer identity of origin. They key question will be to understand how it may influence downstream data analysis
This leads to the discussion of future direction of work PCO and OBI could look into. These could range from capturing the specifics of sampling procedures used in environmental and biodiversity studies. A number of protocols and guidelines , such as the” Marine macrofauna grab sampling method” to give an example, development could also look into clarifying the actual measurement produced from such studies. Ideally, working under the foundry, as people are growing more familiar with development conventions and practices, it makes cross talk more productive , with term dispatch and composition protocol being more refined and detailed. This also encourages cross domain development and outreach to existing and sometimes overlapping efforts. OBI and IAO are currently outlining a plan for alignment, these are encouraging signals for the community.
A big thank to Ramona Walls, Paula Mabee and RCN Phenotype group for organizing and leading these twins events. Al the participants of the PCO meeting (Robert Garulnick, Pier Luigi Buttigieg, Adam among others…) Jie Zheng and obi folks , of course all my colleagues of the ISA Team (Alejandra, Eamonn, Susanna and Milo) and you for your attention
I have to insist of a Heartfelt acknowledgment as it meant swapping this (Oxford floods in February) to this (Arizona desert, February, same year) It was nice to be somewhere dry and in such a great company
Modeling a Microbial Community and Biodiversity Assay with OBI and PCO OBO Foundry Ontologies: The Interoperability Gains of a Modular Approach
Modeling a microbial community
and biodiversity assay with OBI
and PCO: the gains of a modular
ICBO2014, in Houston Oct 6-9
Philippe Rocca-Serra, Ramona Walls, Jacob Parnell, Rachel Gallery, Jie
Zheng, Susanna Assunta Sansone and Alejandra Gonzalez-Beltran
Biodiversity in the
• Grim headlines
• True for many
• Mankind only now
starts to build tools
exploration of diversity
Exploring the world biodiversity
• Game changing progress in sequencing
– Oxford Nanopore Minion
Biodiversity studies with molecular
• Shotgun sequencing:
– Sequencing as much as possible (probing is
limited by sequencing depth available, the
rarer the species, the deeper the sequencing
needs to be)
• Targeted sequencing:
– Reliance on a ‘marker gene’ whose variability
will be used to estimate distance between
‘Barcode’ as in Multiplexed
genomic DNA isolated from individual sample is
-ligated to a unique short DNA tag (i.e called the barcode)
-PCR amplification and sequencing
-output of a single collection of reads which can be subsequently sorted
using the DNA short-hand by computational mean – deconvolution process
‘Barcode’ as in Barcode of Life
• What is a barcode or what is a barcoding
– Metaphors are impenetrable to computers.
– Need to make representation unambiguous
– Barcoding, meaning a technique for
processing more samples in one go ->
another word for multiplexing
– Barcoding, meaning the creation of a unique
profile as a means to identify types of living
Heaps of sequence data for
• What is the value in
the absence of
• Essential annotation
to ascertain identity
and origin, sampling
Helping Data Management
• MIXS Guidelines checklist
• SRA xml schema, Genbank records…
• Tabular Templates for Data Collection
• Wealth of RDF conversion tools
– R2RML W3C data standards
• Using the same xml and same guidelines,
nevertheless ambiguities subsist
ISA templates for Microbial
• Integrating MIXS checklist in the ISA
• Mapping MIXS entities into SRA XML
– Properties of sample
– Properties of sample processing
– Properties of resulting libraries
– Properties of data processing
• Library Experiment Sample unicity
• Use Case: creation of libraries for
Bacteria,Fungi,Eukaryota with specific genes
(16sRNA, ITS, COI)
• ISA conversion to ENA:
– 1 sample -> 3 libraries
• SRA/ENA submission:
– 3 libraries -> 3 samples
Working with OBI, PCO,SO, CHEBI
Drawn using CMAPtools: http://cmap.ihmc.us
Working with OBI, PCO,SO, CHEBI
Drawn using CMAPtools: http://cmap.ihmc.us
OBI-PCO based representation
• ‘targeted gene survey’
• has part some ‘library preparation’ (OBI_0000711)
• ‘polymerase chain reaction’ (OBI_0000415) is_part_of ‘library preparation’ (OBI_0000711)
• ‘polymerase chain reaction’(OBI_0000415)
• has_specified_input some ‘forward pcr primer’ (OBI_0000722)
• has_specified_input some ‘reverse pcr primer’ (OBI_0001951)
• has_specified_input some ‘multiplexing sequence identifier’
• has_specified_input some ‘DNA extract’ (OBI_0001051)
• ‘library preparation’ (OBI_0000711) ‘has_specified_output’ some ‘single fragment library’
• ‘library preparation’ (OBI_0000711) precedes ‘DNA sequencing’(OBI_0000626)
• ‘library sequence deconvolution’ is_preceded_by ‘DNA sequencing’(OBI_0000626)
• ‘library sequence deconvolution’ is_followed_by ‘(OBI_0200187)’
• ‘sequence analysis data transformation’ (OBI_0200187) has_specified_output some ‘data
item’ (IAO_0000027) and is about ‘population quality’ (PCO_0000003)
• We have clarified the OWL representation of
several assays commonly used in biodiversity
• We have outlined good practice for serializing
biodiversity experimental process both using ISA,
SRA and RDF format
• We have shown how synergies obtained from
resources of the OBO Foundry can greatly benefit
fast development of fit for purpose tabular data
collection templates which greatly help compliance
with annotation standard guidelines.
Why does it matter?
• Correct sample size assessment
• Assessing independence of samples and
• Is it really possible to ascertain identity of
samples by solely relying a metadata?
• How can such uncertainties affect
downstream analysis / meta analysis?
• Sample Collection Protocols and
Procedures as applied in biodiversity
studies (field studies, “Marine macrofauna
grab sampling method” and so forth)
• Clarify the reporting of actual results
• Keeping working with PCO and OBO
Foundry related efforts.
• Dr. Ramona Walls (iPlant, Uni of Arizona)
• Pr. Paula Mabee (Uni South Dakota)
• RCN: Phenotype Ontology Research Coordination
Network , National Science Foundation (NSF-DEB-
0956049), (2010 - 2015)
• Dr. Jie Zheng and OBI companions
• PCO coworkers and RCN workshop participants
• ISA Team