12th June, 2016
BioHackathon 2016 Symposium, Japan
Facilitating Semantic Alignment of EBI
Resources
Simon Jupp
Ontology Project Lead
Samples, Phenotypes and
Ontologies Team
www.ebi.ac.uk
SPOT team - Adding value with ontologies
Data
Exploration
and
Cleanup
Data
structuring
Ontology
Annotatio
n
Data cleaning
and mapping
Ontology
building
Structured data
Data Enrichment Services
• Building an interoperability
toolkit for Europe (Elixir)
• Micro-service architecture
• Technology-agnostic
• Pushing boundaries of ontology
“embedding”
New ontology lookup service!
Building an ontology toolkit
Data
Exploration
and
Cleanup
Data
structuring
Ontology
Annotatio
n
Data cleaning
and mapping
Ontology
building
Webulous
OxO mapping service
Building metadata rich resources
• Ontology markup of experimental
variables/samples
• Focus on Phenotype/Disease
annotation
• Linking common to rare disease
ArrayExpress
Gene Expression atlas
0
20
40
60
80
100
89
77 78
100 99
EFO mapped coverage
OpenTargets Data Mapping Process
Reactome Metabolic pathways DOID
GWAS catalog
Common Disease
(GWAS) EFO
Atlas Expression EFO
Uniprot
Rare Disease (Expert-
reviewed OMIM)
OMIM + own controlled
vocab
European Variation
Archive Rare Disease
OMIM + Orphanet +
SNOMED + Genetic
Alliance + HPO
ChEMBL Bioactivity data
ATC classification (14
terms)
EuropePMC Literature Mining UMLS
IMPC Mouse Models MPO + HPO
Cancer Gene Census Somatic Mutations
own controlled vocab +
NCIT
Acquire
Clean
Map to
Ontology
Curate
Add new
terms
Iterate
Experiment Factor Ontology – Data Driven
Application Ontology
• EFO is an application ontology, built for use in production services in
OWL
• Imports from ~10 ontologies, isolates us from external churn
• Cross referenced to 25 additional ontologies
• Continuous integration build process, reasoning, manual error checking, multi-
editor environment
Chemical Entities of
Biological Interest
(ChEBI)
Gene Ontology
Cell Type
Anatomy
Phenotype
Disease
Ontologies Data
Managing data evolution in production
Ontology
Annotation
Provenance: who, when, context
Disease
Anatomy
Cell types
Gene function
(GO, HP, MP,
UBERON, DO,
ORDO)
Phenotype
…
Ontologies in applications
Smarter searching
Data visualisation
Data analysis
Data integration
Open Targets
Which other diseases are associated with PDE4D?
View diseases
grouped in therapeutic
areas or organised in
a tree
View more information about
PDE4D
Filter by
therapeutic
area
BioSolr
“BioSolr aims to significantly advance the state of the art
with regards to indexing and querying biomedical data
with freely available open source software”
flaxsearch/BioSolr
Solr documents with
ontology annotation
Enriched Solr with ontology content
(synonyms, structure, relations)
Solr/Elastic plugin Query expansion and
hierarchical faceting
Making it all FAIR
Data resources at EMBL-EBI
Genes, genomes & variation
RNA Central
Array
Express
Expression Atlas
Metabolights
PRIDE
InterPro Pfam UniProt
ChEMBL SureChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene, protein & metabolite expression
Protein sequences, families &
motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
BioStudies
Gene Ontology
Experimental Factor
Ontology
Literature &
ontologies
Product of previous biohackathons
EBI RDF Platform
Successes
• Novel queries possible over
EBI datasets
• Production quality RDF
releases
• Community of users
• Highly available public
SPARQL endpoints
• 500+ users (10-50 million
hits per month)
• Lot of interest from industry
• Catalyst for new RDF efforts
Lessons
● Public SPARQL endpoints
problematic
● Query federation not
performant
● Inference support limited
● Not scalable for all EBI data
e.g. Variation, ENA
● Lack of expertise in service
teams
● Too much overhead to get
started quickly in this space
Challenges for RDF at EMBL-EBI
• Most EBI resources publish data in forms that support
common use cases (pre-integrated)
• Individuals teams do the hard work so you don’t have to
• RDF representation not optimised for performance
• Barrier to building real (killer) applications
• Technology not mature enough / developer frameworks
lacking
• Doing RDF shouldn’t mandate a technology choice anyway
• RDF not yet a “core” activity for EMBL-EBI
Where we are going next with RDF
• Virtualised infrastructure for RDF
• Simpler cloud deployment
• Building a single EBI RDF cache
• Simpler to manage
• More interesting queries
• Exploring cheaper paths to RDF
• RDF from REST + JSON-LD
• Via Wikidata
• RDFa and schema.org (bioschemas)
Acknowledgements
• Sample Phenotypes and Ontologies Team
• Olga Vrousgou, Thomas Liener, Dani Welter, Catherine
Leroy, Sira Sarntivijai, Ilinca Tudose, Tony Burdett, Helen
Parkinson
• Funding
• European Molecular Biology Laboratory (EMBL)
• European Union projects: DIACHRON, BioMedBridges and
CORBEL, Excelerate
Topic and interest for the hackathon
• Ontology Mapping
• Disease (rare, common, phenotypes)
• Data annotation (automated, machine learning, text
mining)
• Virtualised RDF data deployment
• RDF on the fly
• RDF over Mongo, Neo4j, Solr, Elastic
• REST + JSON-LD

Facilitating semantic alignment.-biohackathon-jupp

  • 1.
    12th June, 2016 BioHackathon2016 Symposium, Japan Facilitating Semantic Alignment of EBI Resources Simon Jupp Ontology Project Lead Samples, Phenotypes and Ontologies Team www.ebi.ac.uk
  • 2.
    SPOT team -Adding value with ontologies Data Exploration and Cleanup Data structuring Ontology Annotatio n Data cleaning and mapping Ontology building Structured data
  • 3.
    Data Enrichment Services •Building an interoperability toolkit for Europe (Elixir) • Micro-service architecture • Technology-agnostic • Pushing boundaries of ontology “embedding” New ontology lookup service!
  • 4.
    Building an ontologytoolkit Data Exploration and Cleanup Data structuring Ontology Annotatio n Data cleaning and mapping Ontology building Webulous OxO mapping service
  • 5.
    Building metadata richresources • Ontology markup of experimental variables/samples • Focus on Phenotype/Disease annotation • Linking common to rare disease ArrayExpress Gene Expression atlas 0 20 40 60 80 100 89 77 78 100 99 EFO mapped coverage
  • 6.
    OpenTargets Data MappingProcess Reactome Metabolic pathways DOID GWAS catalog Common Disease (GWAS) EFO Atlas Expression EFO Uniprot Rare Disease (Expert- reviewed OMIM) OMIM + own controlled vocab European Variation Archive Rare Disease OMIM + Orphanet + SNOMED + Genetic Alliance + HPO ChEMBL Bioactivity data ATC classification (14 terms) EuropePMC Literature Mining UMLS IMPC Mouse Models MPO + HPO Cancer Gene Census Somatic Mutations own controlled vocab + NCIT Acquire Clean Map to Ontology Curate Add new terms Iterate
  • 7.
    Experiment Factor Ontology– Data Driven Application Ontology • EFO is an application ontology, built for use in production services in OWL • Imports from ~10 ontologies, isolates us from external churn • Cross referenced to 25 additional ontologies • Continuous integration build process, reasoning, manual error checking, multi- editor environment Chemical Entities of Biological Interest (ChEBI) Gene Ontology Cell Type Anatomy Phenotype Disease
  • 8.
    Ontologies Data Managing dataevolution in production Ontology Annotation Provenance: who, when, context Disease Anatomy Cell types Gene function (GO, HP, MP, UBERON, DO, ORDO) Phenotype …
  • 9.
    Ontologies in applications Smartersearching Data visualisation Data analysis Data integration
  • 10.
    Open Targets Which otherdiseases are associated with PDE4D? View diseases grouped in therapeutic areas or organised in a tree View more information about PDE4D Filter by therapeutic area
  • 11.
    BioSolr “BioSolr aims tosignificantly advance the state of the art with regards to indexing and querying biomedical data with freely available open source software” flaxsearch/BioSolr Solr documents with ontology annotation Enriched Solr with ontology content (synonyms, structure, relations) Solr/Elastic plugin Query expansion and hierarchical faceting
  • 12.
  • 13.
    Data resources atEMBL-EBI Genes, genomes & variation RNA Central Array Express Expression Atlas Metabolights PRIDE InterPro Pfam UniProt ChEMBL SureChEMBL ChEBI Molecular structures Protein Data Bank in Europe Electron Microscopy Data Bank European Nucleotide Archive European Variation Archive European Genome-phenome Archive Gene, protein & metabolite expression Protein sequences, families & motifs Chemical biology Reactions, interactions & pathways IntAct Reactome MetaboLights Systems BioModels Enzyme Portal BioSamples Ensembl Ensembl Genomes GWAS Catalog Metagenomics portal Europe PubMed Central BioStudies Gene Ontology Experimental Factor Ontology Literature & ontologies Product of previous biohackathons
  • 14.
    EBI RDF Platform Successes •Novel queries possible over EBI datasets • Production quality RDF releases • Community of users • Highly available public SPARQL endpoints • 500+ users (10-50 million hits per month) • Lot of interest from industry • Catalyst for new RDF efforts Lessons ● Public SPARQL endpoints problematic ● Query federation not performant ● Inference support limited ● Not scalable for all EBI data e.g. Variation, ENA ● Lack of expertise in service teams ● Too much overhead to get started quickly in this space
  • 15.
    Challenges for RDFat EMBL-EBI • Most EBI resources publish data in forms that support common use cases (pre-integrated) • Individuals teams do the hard work so you don’t have to • RDF representation not optimised for performance • Barrier to building real (killer) applications • Technology not mature enough / developer frameworks lacking • Doing RDF shouldn’t mandate a technology choice anyway • RDF not yet a “core” activity for EMBL-EBI
  • 16.
    Where we aregoing next with RDF • Virtualised infrastructure for RDF • Simpler cloud deployment • Building a single EBI RDF cache • Simpler to manage • More interesting queries • Exploring cheaper paths to RDF • RDF from REST + JSON-LD • Via Wikidata • RDFa and schema.org (bioschemas)
  • 17.
    Acknowledgements • Sample Phenotypesand Ontologies Team • Olga Vrousgou, Thomas Liener, Dani Welter, Catherine Leroy, Sira Sarntivijai, Ilinca Tudose, Tony Burdett, Helen Parkinson • Funding • European Molecular Biology Laboratory (EMBL) • European Union projects: DIACHRON, BioMedBridges and CORBEL, Excelerate
  • 19.
    Topic and interestfor the hackathon • Ontology Mapping • Disease (rare, common, phenotypes) • Data annotation (automated, machine learning, text mining) • Virtualised RDF data deployment • RDF on the fly • RDF over Mongo, Neo4j, Solr, Elastic • REST + JSON-LD