Ontologies and Semantic Web technologies play an important role in the life sciences to help make data more interoperable and reusable. There are now many publicly available ontologies that enable biologists to describe everything from gene function through to animal physiology and disease.
Various efforts such as the Open Biomedical Ontologies (OBO) foundry provide central registries for biomedical ontologies and ensure they remain interoperable through a set of common shared development principles.
At EMBL-EBI we contribute to the development of biomedical ontologies and make extensive use of them in the annotation of public datasets. Biological data typically comes with rich and often complex metadata, so the ontologies provide a standard way to capture “what the data is about” and gives us hooks to connect to more data about similar things.
These ontology annotations have been put to good use in a number of large-scale data integration efforts and there’s an increasing recognition of the need for ontologies in making data FAIR (Findable, Accessible, Interoperable and Reusable).
EMBL-EBI build a number of integrative data platforms where ontologies are at the core of our domain models. One example is the Open Targets platform, where data about disease from 18 different databases can be aggregated and grouped based on therapeutic areas in the ontology and used to identify potential drug targets.
The ontologies team at EMBL-EBI provide a suite of services that are aimed at making ontologies more accessible for both humans and machines. We work with scientific data curators and software developers to integrate ontologies and semantics into both the data generation and data presentation workflows. We provide:
– An ontology lookup service (OLS) that provides search and visualisation services to over 200+ ontologies
– Services for automating the annotation of metadata and learning from previous annotations (Zooma)
– An ontology mapping and alignment service (OXO)
– Tools for working with metadata and ontologies in spreadsheets (Webulous)
– Software for enriching documents in search engines to support “semantic” query expansion
I’ll present how we are using these services at EMBL-EBI to scale up the semantic annotation of metadata. I’ll talk about our open source technology stack and describe how we utilise a polyglot persistence approach (graph databases, triples stores, document stores etc) to optimize how we deliver ontologies and semantics to our users.
Technical Coordinator / Ontology Project Lead
Samples, Phenotypes and Ontologies Team
EMBL-EBI European Bioinformatics Institute
Ontology services for connecting
Connected Data London,
October 4th, 2019
What is EMBL-EBI?
• Europe’s home for biological data services, research
• A trusted data provider for the life sciences
• Part of the European Molecular Biology Laboratory,
an intergovernmental research organisation
• International: 650 members of staff from 66 nations
From molecules to medicine
We are always seeking new ways to read
and understand DNA
New technologies provide ways to collect,
compare and visualise molecular
Bioinformatics enables new applications:
• molecular medicine
• environmental sciences
There‘s a lot of metadata...
tissues cell lines diseases
How many ways can you say “female”?
18-day pregnant females female (lactating) individual female worker caste (female)
2 yr old female female (pregnant) lgb*cc females sex: female
400 yr. old female female (outbred) mare female, other
adult female female parent female (worker) female child
asexual female female plant monosex female femal
castrate female female with eggs ovigerous female 3 female
cf.female female worker oviparous sexual females female (phenotype)
cystocarpic female female, 6-8 weeks old worker bee female mice
dikaryon female, virgin female enriched female, spayed
dioecious female female, worker pseudohermaprhoditic female femlale
diploid female female(gynoecious) remale metafemale
f femele semi-engorged female sterile female
famale female, pooled sexual oviparous female normal female
femail femalen sterile female worker sf
female females strictly female vitellogenic replete female
female - worker females only tetraploid female worker
female (alate sexual) gynoecious thelytoky hexaploid female
female (calf) healthy female female (gynoecious) female (f-o)
hen probably female (based on morphology)
female (note: this sample was originally provided as a "male" sample to us and therefore labeled this way in the brawand et al. paper
and original geo submission; however, detailed data analyses carried out in the meantime clearly show that this sample stems from a
Courtesy of N. Silvester, European Nucleotide Archive, EMBL-
Need for terminology standards
• Need to ensure we’re all talking about the same thing
• The biomedical science community has been busy
building ontologies and terminology standard
• Over 100 freely-available ontologies from the Open
Biological Ontology (OBO) community
• Most developed with formal semantics in OWL
• Many more terminology standards in use in biomedicine
EBI Ontologies Team
• Build services to make
ontologies accessible for
humans and machines
• Ensure a consistent set of
interoperable ontologies are
used across public datasets to
• Scale up the process to millions
of data points
• Work with software and
database developers to utilise
Data to knowledge
The end result is integrated data with
Ontology driven search
• Semantic query across 20 integrated datasets to identify
potential new drug targets for disease
Aligning data to our ontologies
Organism: Homo sapiens
cell type: Mast cell
Disease: Type II diabetes mellitus
Cell type ontology
Where do you start?
• How do I access ontologies?
• How do I annotate data with ontologies?
• Which ontologies should I use?
• What about data that doesn’t map easily?
• How can I translate from one ontology to another?
• How can I extend an ontology?
• How do I build “ontology aware” applications?
The Ontology Toolkit
Open Source Software
Ontology Lookup Service
• Ontology search engine
• Ontology term history tracking
• Ontology visualisation
• RESTful API
Repository of over 200 pre-selected biomedical ontologies (5+ million terms)
• Provides unified mechanism to access
• 6,000 users / 50 million hits per month
The problem with just an ontology lookup
…knowing what you’re looking for
Data annotation services
• Supporting data curation to map to the “right”
• Based on what other databases are doing
• Collect mappings from 10 databases at EBI
and use as a training set to predict how new
unseen data should map to ontologies
• Using previously curated data sources
• Using only ontologies
• Curators review output and feedback into Zooma
• We’re increasingly seeing data that is described using
• But we don’t always agree on the ontologies to use
Datasource 1 Datasource 2
Ontology Mapping Service (OxO)
Ontology Mapping Service (OxO)
• Graph database (Neo4j) of mappings from a number of public source
• Mappings are often semantically vague (exact, broader, narrower,
• We use the graph to infer potential new mappings, and identify
conflicting sources of mappings
Under the hood we use Neo4j
• We import OWL ontologies into Neo4j
• Simplify the OWL representation that is optimized for common queries
• Model for the application needs
• Scalable applications that are more developer friendly than triple stores
Powerful yet simple queries
• Get the full partonomy and classification of “heart” with
WHERE n.label = “heart”
Using ontologies in our search indexes
Enrich your search
index with ontology
• For text search we compute the closure of all
relationships into our text index
Semantic search and data integration with
Publishing the data
• EBI RDF platform contains 7 EBI databases connected
by shared ontologies
• SPARQL access to a subset of EBI data
• But maintenance is hard as it’s not the source of truth for
Aligning schemas to a single model is hard
Gene (via identifiers.
RNA transcript (via
rdfs:seeAlso (not currently linking
to identifiers.org but soon)
gene expression ratio
Gene Expression Atlas
sio:'is attribute of'
GO BP GO MF GO CC
* * *
Biomodels can be found
Protein (via identifiers.
Gene function Systems
Is JSON-LD the answer?
e.g. Most services produce JSON via REST
BioSchemas & Schema.org
• Low cost investment (markup in HTML)
• Community growing for Life science
• JSON-LD emerging as popular microformat language
• EBI BioSamples database has over 10 million pages
marked up with semantic markup
• Great potential for datasets discovery (finding data
generated from the same samples)
• But not clear who will do the crawling and build the
What we’ve learnt along the way
• The data we see is getting better as the ontologies have matured and
consensus has grown around which ontologies should be used
• Crowdsourcing through tools like Zooma and OxO has good economies of
scale with respect to data curation
• Retrofitting the semantics in this way has limits, there’s still a long tail of data
that we miss.
• OWL semantics are essential for building and maintaining our ontologies, but
we’ve had to devise custom ways to utilise the ontologies when building
applications and populating databases
• Developers want more conventional access to semantics (i.e. REST+JSON)
Warren ReadOla Ajigboye
• EMBL and OpenTargets
• CORBEL This project receives funding from the
European Union’s Horizon 2020 research and
innovation programme under grant agreement No
• EJP cofund
• EXCELERATE ELIXIR-EXCELERATE is funded by
the European Commission within the Research
Infrastructures programme of Horizon 2020, grant
agreement number 676559.
• Funding for Human Cell Atlas from Chan-Zuckerberg
Paola Roncaglia Henriette Harmse