Tony Burdett's slides from his talk at Connected Data London. Tony is a Senior Software Engineer at The European Bioinformatics Institute. He presented the complexity of data at the EMBL-EBI and what is their solution to make sense of all this data.
Ensuring Technical Readiness For Copilot in Microsoft 365
Connecting life sciences data at the European Bioinformatics Institute
1. 12th July, 2016
Connecting life sciences data at the
European Bioinformatics Institute
Tony Burdett
Technical Co-ordinator –
Samples, Phenotypes and
Ontologies Team
www.ebi.ac.uk
3. What is EMBL-EBI?
• Europe’s home for biological data services, research
and training
• A trusted data provider for the life sciences
• Part of the European Molecular Biology Laboratory,
an intergovernmental research organisation
• International: 570 members of staff from 57 nations
• Home of the ELIXIR Technical hub.
4. OUR MISSION
To provide freely
available data and
bioinformatics services
to all facets of the
scientific community in
ways that promote
scientific progress
5. Big data, big demand
~18.5 million
requests to EMBL-EBI
websites every day
60 petabytes
of EMBL-EBI storage capacity
EMBL-EBI handles
9.2 million
jobs on average per
month
Scientists at over
5 million
unique sites use
EMBL-EBI websites
6. Atlas
what happens
where
From molecules to medicine
Biology is changing:
• Lower-cost sequencing
• More data produced
• New types of data
• Emphasis on systems biology
Bioinformatics enables new
applications:
• molecular medicine
• agriculture
• food
• environmental sciences
7. Data resources at EMBL-EBI
Genes, genomes & variation
RNA Central
Array
Express
Expression Atlas
Metabolights
PRIDE
InterPro Pfam UniProt
ChEMBL SureChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene, protein & metabolite expression
Protein sequences, families &
motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
BioStudies
Gene Ontology
Experimental Factor
Ontology
Literature &
ontologies
8. Database interactions
• Collaborative community
facilitates social,
scientific and technical
interactions
• Right: internal
interactions between
data resources as
determined by the
exchange of data.
• Width of each internal
arc weighted according
to the number of different
data types exchanged.
9. Biology 101 – Central Dogma
Dhorspool at en.wikipedia [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)
or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons
10. Sadly, it’s not *quite* that simple…
User:Dhorspool [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)
or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons
16. How do we turn data into Linked Data
(Example from the Gene Expression Atlas)
Relational Data to RDF graph conversion
• Give “things” URIs
• Type “things” with ontologies
• Link “things” to other related “things”
17. Modeling data vs biology
• Typing and semantics is the main strength of RDF, so we
focused on this aspect
• A lot of ontologies for the life sciences
• However, most model biology
• What does an Ensembl entry represent? Is an Ensembl
identifier really an instance of a Sequence Ontology Gene
class?
ensembl:ENSMUSG00000001467
rdf:type
so:’protein coding gene’
Codiad
18. Database Entry or Real World Entity?
• Practically it makes sense to treat database entries as
proxies for the real world entity they represent
• Alternative introduces a layer of indirection that would only
make linking resources harder
• It means we can use biologically meaningful relationships
• But this may or may not work for all use cases
ensembl:ENSMUSG00000001467
rdf:type
so:’protein coding gene’
ensembl:ENSMUST00000001507
rdf:type
so:’transcript’
so:’transcribed from’
19. Knowledge representation challenges
• The semantics of our data is complex
• The provenance models are even more complex
• The relationship are hard to define
• Balancing use-cases with representation is a major
challenge
• The harder you try to get representation correct, the harder it
is for users to query
• Performance drops off for simple queries
21. EBI RDF Platform
Successes
• Novel queries possible over
EBI datasets
• Production quality RDF
releases
• Community of users
• Highly available public
SPARQL endpoints
• 500+ users (10-50 million
hits per month)
• Lot of interest from industry
• Catalyst for new RDF efforts
Lessons
● Public SPARQL endpoints
problematic
● Query federation not
performant
● Inference support limited
● Not scalable for all EBI data
e.g. Variation, ENA
● Lack of expertise in service
teams
● Too much overhead to get
started quickly in this space
22. Ontologies for life sciences
22
Genotype Phenotype
Sequence
Proteins
Gene products Transcript
Pathways
Cell type
BRENDA tissue /
enzyme source
Development
Anatomy
Phenotype
Plasmodium
life cycle
-Sequence types
and features
-Genetic Context
- Molecule role
- Molecular Function
- Biological process
- Cellular component
-Protein covalent bond
-Protein domain
-UniProt taxonomy
-Pathway ontology
-Event (INOH pathway
ontology)
-Systems Biology
-Protein-protein
interaction
-Arabidopsis development
-Cereal plant development
-Plant growth and developmental stage
-C. elegans development
-Drosophila development FBdv fly
development.obo OBO yes yes
-Human developmental anatomy, abstract
version
-Human developmental anatomy, timed version
-Mosquito gross anatomy
-Mouse adult gross anatomy
-Mouse gross anatomy and development
-C. elegans gross anatomy
-Arabidopsis gross anatomy
-Cereal plant gross anatomy
-Drosophila gross anatomy
-Dictyostelium discoideum anatomy
-Fungal gross anatomy FAO
-Plant structure
-Maize gross anatomy
-Medaka fish anatomy and development
-Zebrafish anatomy and development
-NCI Thesaurus
-Mouse pathology
-Human disease
-Cereal plant trait
-PATO PATO attribute and value.obo
-Mammalian phenotype
- Human phenotype
-Habronattus courtship
-Loggerhead nesting
-Animal natural history and life history
eVOC (Expressed
Sequence Annotation
for Humans)
23. Ontologies as Graphs
• OWL ontologies aren’t graphs, but…
… can be represented as an RDF graph
… people want to use them as graphs
• Plenty of RDF databases around
• But incomplete w.r.t. OWL semantics
• SPARQL is an acquired taste
24. Ontology repository use-cases
• Search for ontology terms
• labels, synonyms, descriptions
• Querying the structure
• Get parent/child terms
• Querying transitive closure
• Get ancestor/descendant terms
• Querying across relations
• Partonomy or development stages
• We can satisfy these requirements with Neo4J
25. OWL to Neo4j schema
Label every node by type (e.g. class, property or individual) and ontology id
Label every relation by name
include additional index for “special relations” like partonomy and subsets
26. Powerful yet simple queries
• Get the transitive closure for “heart” following parent and
partonomy relations from the UBERON anatomy ontology
MATCH path =
(n:Class)-
[r:SUBCLASSOF|RelatedTree*]
->(parent)<-
[r2:SUBCLASSOF|RelatedTree]
-(sibling:Class)
WHERE n.ontology_name = {0}
AND n.iri = {1}
27. Final thoughts – Neo4j and JSON-LD?
• A lot of frameworks now make it trivial to produce good
APIs
• What’s currently missing is how to integrate data from two or
more independent APIs
• Hard to crawl independent datasets for connections without
a human to interpret semantics
• Still a need to express a schema alongside the data
• W3C standard like RDF/RDFS/SKOS/OWL provide the
basic vocabularies and semantics for expressing data
schemas
• JSON-LD is bridging the gap from JSON to RDF
28. Acknowledgements
• Sample Phenotypes and Ontologies
• Simon Jupp, Olga Vrousgou, Thomas Liener, Dani Welter,
Catherine Leroy, Sira Sarntivijai, Ilinca Tudose, Helen
Parkinson
• Funding
• European Molecular Biology Laboratory (EMBL)
• European Union projects: DIACHRON, BioMedBridges and
CORBEL, Excelerate