Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Knowledge graphs ilaria maresi the hyve 23apr2020

265 views

Published on

Data for drug discovery and healthcare is often trapped in silos which hampers effective interpretation and reuse. To remedy this, such data needs to be linked both internally and to external sources to make a FAIR data landscape which can power semantic models and knowledge graphs.

Published in: Health & Medicine
  • Be the first to comment

Knowledge graphs ilaria maresi the hyve 23apr2020

  1. 1. Knowledge graphs and semantic models for drug discovery and healthcare on Thurs, 23rd April 2020 at 16:00 BST Hosted by: Ian Harrow, Pistoia Alliance Speaker: Ilaria Maresi, The Hyve FAIR/OM projects Community of Interest webinar series
  2. 2. This webinar is being recorded
  3. 3. Audience Q&A Please use the questions box
  4. 4. ©PistoiaAlliance Ilaria Maresi • Ilaria is a Data Engineer at The Hyve, specialising in Semantic Modelling and Knowledge Graphs with applications in healthcare and drug discovery. • As a mathematician by training, Ilaria came to the bioinformatics field through her interest in the intersection of biology, mathematics and engineering. • In her free time, she tries to get away from her computer, and enjoys cooking and spending time outside.
  5. 5. ©PistoiaAlliance Knowledge graphs and semantic models for drug discovery and healthcare Ilaria Maresi, Data Engineer ilaria@thehyve.nl
  6. 6. ©PistoiaAlliance Agenda ● About me ● The problem ● What are semantic models and knowledge graphs? ○ Creating ○ Querying ● Semantic models in action ○ Clinical trials ○ Drug discovery landscape ● Wrap up ● Q&A
  7. 7. ©PistoiaAlliance We enable open science by developing and implementing open source solutions and FAIRifying data in life sciences
  8. 8. ©PistoiaAlliance About me Data Engineer at The Hyve ● Semantic modelling ● Knowledge graphs ● ETL pipelines ● FAIR data services Utrecht, The Netherlands
  9. 9. ©PistoiaAlliance The problem
  10. 10. ©PistoiaAlliance Drug discovery process Source: PhRMA Biopharmaceutical R&D: The Process Behind New Medicines https://www.phrma.org/en/Report/Biopharmaceutical-R- and-D-The-Process-Behind-New-Medicines
  11. 11. ©PistoiaAlliance Differing and unlinked identifiers Source: PhRMA Biopharmaceutical R&D: The Process Behind New Medicines https://www.phrma.org/en/Report/Biopharmaceutical-R- and-D-The-Process-Behind-New-Medicines CMP102401 acetylsalicylic acid Aspirin Query I want to see the trajectory of compound xxx
  12. 12. ©PistoiaAlliance Differing terminology Source: PhRMA Biopharmaceutical R&D: The Process Behind New Medicines https://www.phrma.org/en/Report/Biopharmaceutical-R- and-D-The-Process-Behind-New-Medicines omics experiment lab test genomics assay Query I want to see all the genomics data related to compound xxx
  13. 13. ©PistoiaAlliance Differing terminology Source: PhRMA Biopharmaceutical R&D: The Process Behind New Medicines https://www.phrma.org/en/Report/Biopharmaceutical-R- and-D-The-Process-Behind-New-Medicines non-small cell lung cancer NSCLC non-small cell malignant neoplasm of lung Query I want to see all terminated studies for non-small cell lung cancer that study compound xxx terminated recruiting terminated
  14. 14. ©PistoiaAlliance What are semantic models and Knowledge Graphs?
  15. 15. ©PistoiaAlliance Semantic models are ... “Semantic models of data sources represent the implicit meaning of the data by specifying the concepts and the relationships within the data. Such models are the key ingredients to automatically publish the data into knowledge graphs” – USC semantic modelling
  16. 16. ©PistoiaAlliance What about knowledge graphs? A knowledge graph is several things: ● Database: contains actual data ● Graph: data items and concepts are connected via relationships (i.e nodes and edges) ● Semantic: the meaning of the data is encoded in the graph, allowing for meaning to be inferred ● Alive: constantly refreshed with new data & can be extended and revised as new data comes in
  17. 17. ©PistoiaAlliance The power of a knowledge graph ● Linking concepts that are same or similar ○ Model level – entities ○ Data (instance) level – identifiers ● Harmonising data sources without transforming or forcing a common standard on the data ● The more diverse your data, the more powerful your KG will be ○ Google Knowledge Graph ○ Diffbot ○ Wikidata
  18. 18. ©PistoiaAlliance A knowledge graph example DBPedia: ● Community effort to extract structured content from the information created in various Wikimedia projects ● Data coming in various formats and from various sources ● KG unifies data and enables querying
  19. 19. ©PistoiaAlliance Using RDF to create semantic models ● Resource Description Framework (RDF) ● Information encoded in triples ● Almost everything needs a Uniform Resource Identifier (URI) Subject Object predicate schema:Patient schema:MedicalCondition schema:healthCondition patient_1231 AnginaPectoris schema:healthCondition
  20. 20. ©PistoiaAlliance Using RDF to create semantic models patient_1231 AnginaPectoris schema:healthCondition Heart schema:associatedAnatomy AnatomicalStructure rdf:type 12/12/1950 schema:birthDate Ranolazine schema:drug Ranexa schema:nonProprietaryName
  21. 21. ©PistoiaAlliance Querying a knowledge graph ● SPARQL Protocol and RDF Query Language (SPARQL)
  22. 22. ©PistoiaAlliance Querying a knowledge graph For a single gene (CDX2) these are all its associated identifiers in wikidata: Ensembl ID Entrez ID NCBI Gene ID
  23. 23. ©PistoiaAlliance Linked data is FAIR(er) data ● Findable: uniform resource identifiers that persist across an organization ● Accessible: triple store access can be open or only to select users ● Interoperable: RDF gives meaning to data to both humans and machines (include community standard ontologies!) ● Reusable: RDF is independent of tools/systems and can encode provenance of (meta)data
  24. 24. ©PistoiaAlliance Semantic models in action
  25. 25. ©PistoiaAlliance A semantic model of clinical studies Problem ○ Data across stages of clinical studies exists mostly in silos and is not harmonised ○ Cross domain analytics is currently burdensome ○ Unclear what processes data has gone through Goals ○ Data Conformance Layer to semantically represent source data on clinical trials ○ Leverage external data ○ Represent provenance in model
  26. 26. ©PistoiaAlliance The solution ● 1300+ triples ● 65 classes ● 153 properties ● 13 ontologies
  27. 27. ©PistoiaAlliance Ontologies and interoperability ● Use ontologies that fit domain and data ● BioPortal ● It’s possible to use multiple ontologies in one model! ● FAIR principles I2 & I3
  28. 28. ©PistoiaAlliance The solution ● Harmonising terms (clinical trial vs. clinical study vs. medical trial) ● Metadata annotation (alt labels for compound) ● Controlled vocabularies (indication) ● Provenance (what activity generated the dataset and when)
  29. 29. ©PistoiaAlliance Knowledge graph: from R&D to market Problem ○ “How much data do we have and how is it connected?” ○ Data across stages of R&D is unlinked Goals ○ Understand quantity and qualitative details of existing data assets ○ Map relationships between data assets
  30. 30. ©PistoiaAlliance The solution ● Semantic model representing data flows and major entities ● Encode provenance Compound registry Electronic Lab Notebook Data Warehouse Outside vendor compound_id assay_results assay_results
  31. 31. ©PistoiaAlliance The solution ● Instantiate semantic model with (meta)data ● Queryable knowledge graph with ~14M triples ● Different “views” ● Evolving model
  32. 32. ©PistoiaAlliance The solution Query: how many genomics assays exist across all sectors of my organisation? RNA Seq experiment Single-cell RNA-seq assay Whole genome sequencing WGS experiment RSQ_12965 AssayRNA_31 SC4050_EM 8019240201G Exp_WGS_4128019240202G
  33. 33. ©PistoiaAlliance The solution Query: how many genomics assays exist across all sectors of my organisation? RNA Seq experiment Single-cell RNA-seq assay Whole genome sequencing Genomics Assay WGS experiment RSQ_12965 AssayRNA_31 SC4050_EM 8019240201G Exp_WGS_412 owl:sameAs rdf:type rdf:type rdf:type rdf:type rdfs:subClassOf rdfs:subClassOf rdfs:subClassOf rdfs:subClassOf owl:equivalentClass 8019240202G rdf:type
  34. 34. ©PistoiaAlliance Wrap up
  35. 35. ©PistoiaAlliance Potential pitfalls ● RDF is flexible – triples could contradict your model and inference could easily propagate wrong information in graph ○ Use SHACL (Shapes Constraint Language) validation ● Overloading triple store = slow query times ○ Restrict triples to important information ● URI schema needs to be maintained ○ Automated URI generation and URI update ● Sometimes Knowledge Graph is overkill ○ Semantic models or common data model may suffice
  36. 36. ©PistoiaAlliance Why you should consider KGs ● Connecting data and representing as a KG enables: ○ Querying across an organisation ○ Querying across stages of drug discovery ○ Adoption of common terminologies ○ Cleaner data ○ Clear provenance of data ● Linked data is FAIR(er) data!
  37. 37. ©PistoiaAlliance Why you should consider KGs ● Machine learning and knowledge graphs ○ ML can be used to develop a KG ○ Conversely: KG can be used for ML algorithm: “...knowledge graphs are a step towards enabling machines to more deeply understand data ... that don’t fit neatly into the rows and columns of a relational database” [1] [1] https://www.forbes.com/sites/bernardmarr/2019/06/26/knowledge-graphs-and-machine-learning-the-future-of-ai-analytics/#7b825f0c3a36
  38. 38. ©PistoiaAlliance Materials Data engineering tools for knowledge graphs ● Data Discovery Toolkit ● Model Repository ● Knowledge Base ● Data Mapping Framework ● Data Visualization Framework FAIR data governance is like a fractal ● Purpose and scope ○ Metrics & KPIs ○ New insights ● Retrospective vs prospective A Data Engineer’s Guide to Semantic Models (coming soon!) ● Getting started on semantic models ● RDF, SPARQL, SHACL Common Data Models for FAIR biomedical data ● Choosing data models ● OMOP CDM Leveraging the OMOP Common Data Model for Clinical Trials ● Repurposing OMOP for clinical trials
  39. 39. ©PistoiaAlliance Audience Q&A Please use the questions box
  40. 40. Semantics of data matrices and the STATO ontology Join us for the next FAIR/OM CoI webinar: Speaker: Philippe Rocca-Serra, Oxford University Thurs 28th May at 16:00 BST
  41. 41. info@pistoiaalliance.org @pistoiaalliance www.pistoiaalliance.org Thanks for your attention

×