BioSamples Database Linked DataBioSamples Database Linked Data
Marco Brandizi, Functional Genomics Team
SWAT4LS Tutorial, Dec 9th, 2013
Find this presentation at http://tiny.cc/bsdswt13
• A reference system, where to search/browse information about biological
samples used/useable for biomedical experiments
• Focused on the sample context (i.e., independent on the specific assay
type/technology)
• Supports heterogeneous experiments
– Single place assay repositories can link (reference samples,
authoritative source for repositories like
Metagenomics/ENA/ArrayExpress)
– Single place for searches and related-to or same-as relationships
(e.g., see the 'myEquivalents' project)
• Allows for consistency/standardisation of sample attributes/annotations
• Common IT interfaces to access sample information and links to specific
data/repositories (e.g., web, XML/REST, RDF)
Why a BioSamples Database (aka BioSD)?
• Yet another type of interface, potentially useful to application developers
and Linked Data tools
• Integration with similar/related data-sets (see example queries below!)
• Exploitation of ontologies (see below!)
– Standardisation
– A little semantics goes a long way
• Modelling of certain aspects enhanced
– e.g., numbers, intervals, dates, units are detected from string value
labels and triplified.
• Who knows?
– Apps!
– See Hackaton ideas below!
Why Linked Data for BioSD?
The BioSD Model
Sample Groups
Submission
External links
Samples
http://www.ebi.ac.uk/biosamples
The BioSD Model
Group's (or Submission's) samples
Sample's (or Groups') attribute types
and values
External links
BioSD Data (External Data Sources)
SPARQL Source: http://tinyurl.com/o95xa5v
Tag Cloud made with http://www.wordle.net
SPARQL Source: http://tinyurl.com/ocyb2ld
BioSD Data (Common Attribute Types)
SPARQL Source: http://tinyurl.com/pjgdtzs
Tag Cloud made with http://www.wordle.net
BioSD Linked Data Model (Main Entities)
Please have a look at:
http://tinyurl.com/lo33ncc
BioSD Linked Data Model (Sample Attributes)
Please have a look at:
http://tinyurl.com/n5oyvyd
Find Samples and attributes
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX biosd-terms: <http://rdf.ebi.ac.uk/terms/biosd/>
PREFIX sio: <http://semanticscience.org/resource/>
SELECT DISTINCT ?smp ?pvLabel ?propTypeLabel
WHERE
{
?smp
a biosd-terms:Sample;
biosd-terms:has-bio-characteristic | sio:SIO_000332 ?pv. # is about
?pv
rdfs:label ?pvLabel;
biosd-terms:has-bio-characteristic-type ?pvType.
?pvType
rdfs:label ?propTypeLabel.
}
• Exercise: use FILTER()/REGEX() to find organism=homo sapiens
• Exercise: Find sample provenance repositories and their links
– Hint: explore the sample's links (?smp) and see how RepositoryWebRecord
looks like
Try it at: http://www.ebi.ac.uk/rdf/services/biosamples/sparql
Excercise Solution: see examples on such page
Samples about a given organism
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX biosd-terms: <http://rdf.ebi.ac.uk/terms/biosd/>
SELECT DISTINCT ?smp ?pvLabel ?propTypeLabel
WHERE {
?smp biosd-terms:has-bio-characteristic ?pv.
?pv biosd-terms:has-bio-characteristic-type ?pvType;
rdfs:label ?pvLabel.
?pvType a ?pvTypeClass.
# Listeria
?pvTypeClass
rdfs:label ?propTypeLabel;
# '*' gives you transitive closure, even when inference is didsbled
rdfs:subClassOf* <http://purl.obolibrary.org/obo/NCBITaxon_1637>
}
• Exercise: Use the Bioportal Service to first find all subclasses of 'alchool' (obo:CHEBI_30879)
and then search samples annotated with such subclasses
– Hint: Use SERVICE <http://sparql.bioontology.org/ontologies/sparql/?apikey=KEY>
Try it at: http://www.ebi.ac.uk/rdf/services/biosamples/sparql
Excercise Solution: see one of the examples on such page
Geo-located Samples/Sample Groups
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX biosd-terms: <http://rdf.ebi.ac.uk/terms/biosd/>
PREFIX sio: <http://semanticscience.org/resource/>
SELECT DISTINCT ?item ?latVal ?longVal WHERE {
?item biosd-terms:has-bio-characteristic ?latPv, ?longPv.
?latPv
biosd-terms:has-bio-characteristic-type [ rdfs:label ?latLabel];
sio:SIO_000300 ?latVal. # sio:has value
FILTER ( REGEX ( ?latLabel, "latitude", "i" ) ).
?longPv
biosd-terms:has-bio-characteristic-type [ rdfs:label ?longLabel ];
sio:SIO_000300 ?longVal. # sio:has value
FILTER ( REGEX ( ?longLabel, "longitude", "i" ) ).
}
• Find all samples having an attribute of type temperature, with a numerical value and a unit
specified. Hint: use sio:SIO_000221 (has unit), sio:SIO_000300 (has value)
• Find samples/groups annotated with intervals, which use the properties biosd-terms:has-low-
value and has-high-value and optionally have a unit.
Try it at: http://www.ebi.ac.uk/rdf/services/biosamples/sparql
Excercise Solutions: see examples on that page
Expressed Genes and Samples
• For http://purl.uniprot.org/uniprot/P04637 (P53 in Human)
• Find the EFO classes for which it is up-regulated in the Atlas (p-value < 1E-9)
• And show the atlas expression value label . Hints:
– Start from the example http://tinyurl.com/kvvhw6b,
– Use the Atlas endpoint: http://www.ebi.ac.uk/rdf/services/atlas/sparql
• Find the samples having attributes that are instances of such EFO classes
• Which comes from a repository other than 'ArrayExpress'
• Hints:
– Use SERVICE <http://www.ebi.ac.uk/rdf/services/biosamples/sparql> and a sub-query
– Search property values linked to prop. types that are instances of the e.f. found by the
Atlas
– Then link to the samples, the samples to the submissions, the submissions to the web
records
●
OR JUST HAVE A LOOK: http://tinyurl.com/ln3m7nv (will take a while...)
Ideas for the Hackaton
• Refer to http://tinyurl.com/mo7wgye for details
• From geo-located samples (samples annotated with latitude/longitude) to Google
maps, e.g, by using Exhibit (http://www.simile-widgets.org/exhibit/)
• Take similar datasets (e.g., MAASTRO, Breast Cancer Data, your data), unify the
schemas (e.g., using CONSTRUCT), define federated queries
• Use the Shape or OpenPHACTS validator to define sensible rules for BioSD and
similar data-sets, e.g., must contain an organism, should have a treatment
• Design/build an App (or Web widget) that asks for eligibility criterion, i.e., pairs of
attribute value/type, and translate it into a SPARQL query (or a more complex
search based on SPARQL) to find samples
– Use common ontologies for auto-completion over property types
– Use string-based auto-completion for values
– Consider numerical values, intervals, units
– Do approximate matching, i.e., matching 8/10 of specified pairs is good.
Acknowledgements
• BioSD Team - Alvis Brazma, Tony Burdett, Adam
Faulconbridge, Mike Gostev, Helen Parkinson, Rui Perreria,
Ugis Sarkans, Drashtti Vasant
• Tony Burdett for the help with Zooma
• Simon Jupp, Andy Jenkinson, James Malone, for their great
help with developing and setting up BioSD/RDF
– The rest of the Linked Data team @EBI
(http://www.ebi.ac.uk/rdf)
• BiomedBridges FP7 project (http://www.biomedbridges.eu), for
funding us
And you all!
Sorry, we have 2.7M samples, but not all of them...
(Source: http://en.wikipedia.org/wiki/File:Assorted_computer_mice_-_MfK_Bern.jpg)
Contact info:
www.ebi.ac.uk/biosamples
www.marcobrandizi.info
• biosd-terms (http://tiny.cc/biosd_terms)
– a small application ontology defining specific classes and properties, e.g.,
sample, sample group, has-knowledgeable-person
• Experimental Factors Ontology (EFO)
– mainly to define/annotate sample attributes
• Ontology for Biomedical Investigations (OBI)
• Information Artefacts Ontology (IAO)
• Semantic Science Ontology (SIO)
– to define main classes in BioSD/RDF
• Bibliographic Ontology (BIBO)
– We link publications about submissions/sample sets
• Dublin Core, schema.org, FOAF
– for general categories and in the Linked Data spirit
• Linked automatically by Zooma: many more (e.g., CHEBI, NCBI-Tax, GO)
Main Ontologies used in BioSD / Linked Data