Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ontology Services for the Biomedical Sciences


Published on

Ontologies and Semantic Web technologies play an important role in the life sciences to help make data more interoperable and reusable. There are now many publicly available ontologies that enable biologists to describe everything from gene function through to animal physiology and disease.

Various efforts such as the Open Biomedical Ontologies (OBO) foundry provide central registries for biomedical ontologies and ensure they remain interoperable through a set of common shared development principles.

At EMBL-EBI we contribute to the development of biomedical ontologies and make extensive use of them in the annotation of public datasets. Biological data typically comes with rich and often complex metadata, so the ontologies provide a standard way to capture “what the data is about” and gives us hooks to connect to more data about similar things.

These ontology annotations have been put to good use in a number of large-scale data integration efforts and there’s an increasing recognition of the need for ontologies in making data FAIR (Findable, Accessible, Interoperable and Reusable).

EMBL-EBI build a number of integrative data platforms where ontologies are at the core of our domain models. One example is the Open Targets platform, where data about disease from 18 different databases can be aggregated and grouped based on therapeutic areas in the ontology and used to identify potential drug targets.

The ontologies team at EMBL-EBI provide a suite of services that are aimed at making ontologies more accessible for both humans and machines. We work with scientific data curators and software developers to integrate ontologies and semantics into both the data generation and data presentation workflows. We provide:

– An ontology lookup service (OLS) that provides search and visualisation services to over 200+ ontologies

– Services for automating the annotation of metadata and learning from previous annotations (Zooma)

– An ontology mapping and alignment service (OXO)

– Tools for working with metadata and ontologies in spreadsheets (Webulous)

– Software for enriching documents in search engines to support “semantic” query expansion

I’ll present how we are using these services at EMBL-EBI to scale up the semantic annotation of metadata. I’ll talk about our open source technology stack and describe how we utilise a polyglot persistence approach (graph databases, triples stores, document stores etc) to optimize how we deliver ontologies and semantics to our users.

Published in: Data & Analytics
  • Login to see the comments

  • Be the first to like this

Ontology Services for the Biomedical Sciences

  1. 1. Simon Jupp Technical Coordinator / Ontology Project Lead Samples, Phenotypes and Ontologies Team EMBL-EBI European Bioinformatics Institute Ontology services for connecting biomedical data Connected Data London, October 4th, 2019
  2. 2. What is EMBL-EBI? • Europe’s home for biological data services, research and training • A trusted data provider for the life sciences • Part of the European Molecular Biology Laboratory, an intergovernmental research organisation • International: 650 members of staff from 66 nations
  3. 3. From molecules to medicine We are always seeking new ways to read and understand DNA New technologies provide ways to collect, compare and visualise molecular information Bioinformatics enables new applications: • molecular medicine • agriculture • food • environmental sciences
  4. 4. Data resources at EMBL-EBI
  5. 5. There‘s a lot of metadata... tissues cell lines diseases
  6. 6. How many ways can you say “female”? 18-day pregnant females female (lactating) individual female worker caste (female) 2 yr old female female (pregnant) lgb*cc females sex: female 400 yr. old female female (outbred) mare female, other adult female female parent female (worker) female child asexual female female plant monosex female femal castrate female female with eggs ovigerous female 3 female cf.female female worker oviparous sexual females female (phenotype) cystocarpic female female, 6-8 weeks old worker bee female mice dikaryon female, virgin female enriched female, spayed dioecious female female, worker pseudohermaprhoditic female femlale diploid female female(gynoecious) remale metafemale f femele semi-engorged female sterile female famale female, pooled sexual oviparous female normal female femail femalen sterile female worker sf female females strictly female vitellogenic replete female female - worker females only tetraploid female worker female (alate sexual) gynoecious thelytoky hexaploid female female (calf) healthy female female (gynoecious) female (f-o) hen probably female (based on morphology) female (note: this sample was originally provided as a "male" sample to us and therefore labeled this way in the brawand et al. paper and original geo submission; however, detailed data analyses carried out in the meantime clearly show that this sample stems from a female individual)", Courtesy of N. Silvester, European Nucleotide Archive, EMBL- EBI
  7. 7. Need for terminology standards • Need to ensure we’re all talking about the same thing • The biomedical science community has been busy building ontologies and terminology standard • Over 100 freely-available ontologies from the Open Biological Ontology (OBO) community • Most developed with formal semantics in OWL • Many more terminology standards in use in biomedicine Tibia?
  8. 8. EBI Ontologies Team • Build services to make ontologies accessible for humans and machines • Ensure a consistent set of interoperable ontologies are used across public datasets to maximise interoperability • Scale up the process to millions of data points • Work with software and database developers to utilise the ontologies Data to knowledge
  9. 9. The end result is integrated data with semantic search Expression Atlas GWAS catalog
  10. 10. Ontology driven search • Semantic query across 20 integrated datasets to identify potential new drug targets for disease
  11. 11. Aligning data to our ontologies Organism: Homo sapiens cell type: Mast cell Disease: Type II diabetes mellitus Organism part: pancreas CL:0000097 Cell type ontology Where do you start?
  12. 12. Typical questions • How do I access ontologies? • How do I annotate data with ontologies? • Which ontologies should I use? • What about data that doesn’t map easily? • How can I translate from one ontology to another? • How can I extend an ontology? • How do I build “ontology aware” applications?
  13. 13. The Ontology Toolkit Open Source Software
  14. 14. Ontology Lookup Service • Ontology search engine • Ontology term history tracking • Ontology visualisation • RESTful API Repository of over 200 pre-selected biomedical ontologies (5+ million terms) • Provides unified mechanism to access multiple ontologies • 6,000 users / 50 million hits per month
  15. 15. Visualisation tools
  16. 16. The problem with just an ontology lookup …knowing what you’re looking for
  17. 17. Data annotation services • Supporting data curation to map to the “right” terms • Based on what other databases are doing • Collect mappings from 10 databases at EBI and use as a training set to predict how new unseen data should map to ontologies “Heart” UBERON:0000948 + Context (where, when?)
  18. 18.
  19. 19. • Using previously curated data sources
  20. 20. • Using only ontologies • Curators review output and feedback into Zooma Reviewers
  21. 21. • We’re increasingly seeing data that is described using ontologies • But we don’t always agree on the ontologies to use Datasource 1 Datasource 2 Human Phenotype Ontology SNOMED-CTMappings Ontology Mapping Service (OxO)
  22. 22. Ontology Mapping Service (OxO) • Graph database (Neo4j) of mappings from a number of public source • Mappings are often semantically vague (exact, broader, narrower, related) • We use the graph to infer potential new mappings, and identify conflicting sources of mappings
  23. 23. Under the hood we use Neo4j • We import OWL ontologies into Neo4j • Simplify the OWL representation that is optimized for common queries • Model for the application needs • Scalable applications that are more developer friendly than triple stores
  24. 24. Powerful yet simple queries • Get the full partonomy and classification of “heart” with CYPHER MATCH (n)-[r:SUBCLASSOF|PARTOF*]->(parents) WHERE n.label = “heart” RETURN parents
  25. 25. Using ontologies in our search indexes Enrich your search index with ontology goodness • For text search we compute the closure of all relationships into our text index
  26. 26. Semantic search and data integration with ontologies
  27. 27. Publishing the data • EBI RDF platform contains 7 EBI databases connected by shared ontologies • SPARQL access to a subset of EBI data • But maintenance is hard as it’s not the source of truth for the data
  28. 28. Aligning schemas to a single model is hard Gene (via identifiers. org/ensembl) RNA transcript (via uniprot:Protein rdfs:seeAlso (not currently linking to but soon) discretized differential gene expression ratio (sio: SIO_001078) Gene Expression Atlas Ensembl sio:'is attribute of' (sio:SIO_000011) Uniprot Gene Ontology GO BP GO MF GO CC uniprot:classifiedWith bq:occursIn Organisms Organism/taxon ChEMBL Assay (?) chem bl:hasTarget ? bq:isVersionOf uniprot:organism rdfs:seeAlso 1 1 1 * 1 * * * 1 1 BioModels SBMLModel Reaction Species Compartment bq:is bq:isVersionOf bq:isVersionOf bq:is bq:isVersionOf bq:isHomologTo bq:hasPart ChEBI Reactome Pathway bq:isVersionOf bq:isVersionOf SBO bq:is Relationships within Biomodels can be found at https://github. com/sarala/ricordo- rdfconverter/wiki/SBML- RDF-Schema rdfs:seeAlso Structure PDB 1 rdfs:seeAlso Target (?) uniprot:transcribedFrom Protein (via identifiers. org/ensembl) uniprot:translatedTo bq:isVersionOf Genes Drugs Species Protein Protein Structure Reactions Gene function Systems Disease
  29. 29. Is JSON-LD the answer? e.g. Most services produce JSON via REST API
  30. 30. Ensembl REST API
  31. 31. Slight tweak to make RDF compatible "@context" : { "@vocab" : "", "obo" : "", "dcterms" : "", "faldo" : "", "biotype" : { "@id" : "", "@type" : "@vocab" }, "protein_coding" : "obo:SO_0001217", "id" : "dcterms:identifier", "homo_sapiens" : "", "species" : { "@id" : "obo:OBO_0100026", "@type" : "@vocab" }, "description" : "dcterms:description", "display_name" : "" } Using JSON-LD to assign ontology semantics to existing data
  32. 32. Ensembl JSON as RDF triples "@context" : { "@vocab" : "", "obo" : "", "dcterms" : "", "faldo" : "", "biotype" : { "@id" : "", "@type" : "@vocab" }, "protein_coding" : "obo:SO_0001217", "id" : "dcterms:identifier", "homo_sapiens" : "", "species" : { "@id" : "obo:OBO_0100026", "@type" : "@vocab" }, "description" : "dcterms:description", "display_name" : "" }
  33. 33. BioSchemas & • Low cost investment (markup in HTML) • Community growing for Life science • • JSON-LD emerging as popular microformat language • EBI BioSamples database has over 10 million pages marked up with semantic markup • Great potential for datasets discovery (finding data generated from the same samples) • But not clear who will do the crawling and build the indexes…
  34. 34. What we’ve learnt along the way • The data we see is getting better as the ontologies have matured and consensus has grown around which ontologies should be used • Crowdsourcing through tools like Zooma and OxO has good economies of scale with respect to data curation • Retrofitting the semantics in this way has limits, there’s still a long tail of data that we miss. • OWL semantics are essential for building and maintaining our ontologies, but we’ve had to devise custom ways to utilise the ontologies when building applications and populating databases • Developers want more conventional access to semantics (i.e. REST+JSON)
  35. 35. Ontology team Helen Parkinson Warren ReadOla Ajigboye Funding • EMBL and OpenTargets • CORBEL This project receives funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 654248. • EJP cofund • EOSC-Life • EXCELERATE ELIXIR-EXCELERATE is funded by the European Commission within the Research Infrastructures programme of Horizon 2020, grant agreement number 676559. • Funding for Human Cell Atlas from Chan-Zuckerberg Initiative Paola Roncaglia Henriette Harmse Simon Jupp Zoe Pendlington Nicolas Matentzoglu David Osumi-Sutherland