Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Connecting life sciences data at the European Bioinformatics Institute


Published on

Tony Burdett's slides from his talk at Connected Data London. Tony is a Senior Software Engineer at The European Bioinformatics Institute. He presented the complexity of data at the EMBL-EBI and what is their solution to make sense of all this data.

Published in: Technology
  • Be the first to comment

Connecting life sciences data at the European Bioinformatics Institute

  1. 1. 12th July, 2016 Connecting life sciences data at the European Bioinformatics Institute Tony Burdett Technical Co-ordinator – Samples, Phenotypes and Ontologies Team
  2. 2. Bioinformatics is the science of storing, retrieving and analysing large amounts of biological information.
  3. 3. What is EMBL-EBI? • Europe’s home for biological data services, research and training • A trusted data provider for the life sciences • Part of the European Molecular Biology Laboratory, an intergovernmental research organisation • International: 570 members of staff from 57 nations • Home of the ELIXIR Technical hub.
  4. 4. OUR MISSION To provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress
  5. 5. Big data, big demand ~18.5 million requests to EMBL-EBI websites every day 60 petabytes of EMBL-EBI storage capacity EMBL-EBI handles 9.2 million jobs on average per month Scientists at over 5 million unique sites use EMBL-EBI websites
  6. 6. Atlas what happens where From molecules to medicine Biology is changing: • Lower-cost sequencing • More data produced • New types of data • Emphasis on systems biology Bioinformatics enables new applications: • molecular medicine • agriculture • food • environmental sciences
  7. 7. Data resources at EMBL-EBI Genes, genomes & variation RNA Central Array Express Expression Atlas Metabolights PRIDE InterPro Pfam UniProt ChEMBL SureChEMBL ChEBI Molecular structures Protein Data Bank in Europe Electron Microscopy Data Bank European Nucleotide Archive European Variation Archive European Genome-phenome Archive Gene, protein & metabolite expression Protein sequences, families & motifs Chemical biology Reactions, interactions & pathways IntAct Reactome MetaboLights Systems BioModels Enzyme Portal BioSamples Ensembl Ensembl Genomes GWAS Catalog Metagenomics portal Europe PubMed Central BioStudies Gene Ontology Experimental Factor Ontology Literature & ontologies
  8. 8. Database interactions • Collaborative community facilitates social, scientific and technical interactions • Right: internal interactions between data resources as determined by the exchange of data. • Width of each internal arc weighted according to the number of different data types exchanged.
  9. 9. Biology 101 – Central Dogma Dhorspool at en.wikipedia [CC BY-SA 3.0 ( or GFDL (], via Wikimedia Commons
  10. 10. Sadly, it’s not *quite* that simple… User:Dhorspool [CC BY-SA 3.0 ( or GFDL (], via Wikimedia Commons
  11. 11. Nope, not that simple either… Proteome Metabolome Genome tissue CE- MS antibody array LC-MS/MS m/z 600 800 1000 1200 1400 1600 10 20 30 40 50 60 70 80 90 100 Intensity 609.256 b6 755.422 y8 882.357 b9 852.476 y9 995.435 b10 1092.506 b11 1181.252 y12 1318.578 b13 1587.759 b16 1715.817 b18 858.408 b18 ++ 794.380 b16 ++ 0 miRNA array mRNA array PathwaysProtein Interaction Drug targets
  12. 12. Connections between Databases Gene (via identifiers. org/ensembl) RNA transcript (via uniprot:Protein rdfs:seeAlso (not currently linking to but soon) discretized differential gene expression ratio (sio: SIO_001078) Gene Expression Atlas Ensembl sio:'is attribute of' (sio:SIO_000011) Uniprot Gene Ontology GO BP GO MF GO CC uniprot:classifiedWith bq:occursIn Organisms Organism/taxon ChEMBL Assay (?) chem bl:hasTarget ? bq:isVersionOf uniprot:organism rdfs:seeAlso 1 1 1 * 1 * * * 1 1 BioModels SBMLModel Reaction Species Compartment bq:is bq:isVersionOf bq:isVersionOf bq:is bq:isVersionOf bq:isHomologTo bq:hasPart ChEBI Reactome Pathway bq:isVersionOf bq:isVersionOf SBO bq:is Relationships within Biomodels can be found at https://github. com/sarala/ricordo- rdfconverter/wiki/SBML- RDF-Schema rdfs:seeAlso Structure PDB 1 rdfs:seeAlso Target (?) uniprot:transcribedFrom Protein (via identifiers. org/ensembl) uniprot:translatedTo bq:isVersionOf
  13. 13. We get REALLY good at doing this…
  14. 14. We get REALLY good at doing this…
  15. 15.
  16. 16. How do we turn data into Linked Data (Example from the Gene Expression Atlas) Relational Data to RDF graph conversion • Give “things” URIs • Type “things” with ontologies • Link “things” to other related “things”
  17. 17. Modeling data vs biology • Typing and semantics is the main strength of RDF, so we focused on this aspect • A lot of ontologies for the life sciences • However, most model biology • What does an Ensembl entry represent? Is an Ensembl identifier really an instance of a Sequence Ontology Gene class? ensembl:ENSMUSG00000001467 rdf:type so:’protein coding gene’ Codiad
  18. 18. Database Entry or Real World Entity? • Practically it makes sense to treat database entries as proxies for the real world entity they represent • Alternative introduces a layer of indirection that would only make linking resources harder • It means we can use biologically meaningful relationships • But this may or may not work for all use cases ensembl:ENSMUSG00000001467 rdf:type so:’protein coding gene’ ensembl:ENSMUST00000001507 rdf:type so:’transcript’ so:’transcribed from’
  19. 19. Knowledge representation challenges • The semantics of our data is complex • The provenance models are even more complex • The relationship are hard to define • Balancing use-cases with representation is a major challenge • The harder you try to get representation correct, the harder it is for users to query • Performance drops off for simple queries
  20. 20. Connecting Gene and Protein in EBI RDF
  21. 21. EBI RDF Platform Successes • Novel queries possible over EBI datasets • Production quality RDF releases • Community of users • Highly available public SPARQL endpoints • 500+ users (10-50 million hits per month) • Lot of interest from industry • Catalyst for new RDF efforts Lessons ● Public SPARQL endpoints problematic ● Query federation not performant ● Inference support limited ● Not scalable for all EBI data e.g. Variation, ENA ● Lack of expertise in service teams ● Too much overhead to get started quickly in this space
  22. 22. Ontologies for life sciences 22 Genotype Phenotype Sequence Proteins Gene products Transcript Pathways Cell type BRENDA tissue / enzyme source Development Anatomy Phenotype Plasmodium life cycle -Sequence types and features -Genetic Context - Molecule role - Molecular Function - Biological process - Cellular component -Protein covalent bond -Protein domain -UniProt taxonomy -Pathway ontology -Event (INOH pathway ontology) -Systems Biology -Protein-protein interaction -Arabidopsis development -Cereal plant development -Plant growth and developmental stage -C. elegans development -Drosophila development FBdv fly development.obo OBO yes yes -Human developmental anatomy, abstract version -Human developmental anatomy, timed version -Mosquito gross anatomy -Mouse adult gross anatomy -Mouse gross anatomy and development -C. elegans gross anatomy -Arabidopsis gross anatomy -Cereal plant gross anatomy -Drosophila gross anatomy -Dictyostelium discoideum anatomy -Fungal gross anatomy FAO -Plant structure -Maize gross anatomy -Medaka fish anatomy and development -Zebrafish anatomy and development -NCI Thesaurus -Mouse pathology -Human disease -Cereal plant trait -PATO PATO attribute and value.obo -Mammalian phenotype - Human phenotype -Habronattus courtship -Loggerhead nesting -Animal natural history and life history eVOC (Expressed Sequence Annotation for Humans)
  23. 23. Ontologies as Graphs • OWL ontologies aren’t graphs, but… … can be represented as an RDF graph … people want to use them as graphs • Plenty of RDF databases around • But incomplete w.r.t. OWL semantics • SPARQL is an acquired taste
  24. 24. Ontology repository use-cases • Search for ontology terms • labels, synonyms, descriptions • Querying the structure • Get parent/child terms • Querying transitive closure • Get ancestor/descendant terms • Querying across relations • Partonomy or development stages • We can satisfy these requirements with Neo4J
  25. 25. OWL to Neo4j schema Label every node by type (e.g. class, property or individual) and ontology id Label every relation by name include additional index for “special relations” like partonomy and subsets
  26. 26. Powerful yet simple queries • Get the transitive closure for “heart” following parent and partonomy relations from the UBERON anatomy ontology MATCH path = (n:Class)- [r:SUBCLASSOF|RelatedTree*] ->(parent)<- [r2:SUBCLASSOF|RelatedTree] -(sibling:Class) WHERE n.ontology_name = {0} AND n.iri = {1}
  27. 27. Final thoughts – Neo4j and JSON-LD? • A lot of frameworks now make it trivial to produce good APIs • What’s currently missing is how to integrate data from two or more independent APIs • Hard to crawl independent datasets for connections without a human to interpret semantics • Still a need to express a schema alongside the data • W3C standard like RDF/RDFS/SKOS/OWL provide the basic vocabularies and semantics for expressing data schemas • JSON-LD is bridging the gap from JSON to RDF
  28. 28. Acknowledgements • Sample Phenotypes and Ontologies • Simon Jupp, Olga Vrousgou, Thomas Liener, Dani Welter, Catherine Leroy, Sira Sarntivijai, Ilinca Tudose, Helen Parkinson • Funding • European Molecular Biology Laboratory (EMBL) • European Union projects: DIACHRON, BioMedBridges and CORBEL, Excelerate
  29. 29. Questions?