FAIR & AI Ready Knowledge Graphs
for Explainable Predictions
Michel Dumontier, PhD
Distinguished Professor of Data Science
Founder and Director, Institute of Data Science
Maastricht University
SeWeBMeDA :: 26-05-2024
@micheldumontier::KAUST-Hackathon:2023-02-07
2
Vast amounts of (largely open) data are now available
3
4
A common rejection module (CRM) for acute rejection across multiple organs identifies novel therapeutics
for organ transplantation
Khatri et al. JEM. (2013) 210 (11): 2205
DOI: 10.1084/jem.20122709
Main Findings:
1. CRM of 11 overexpressed genes predicted future injury to a graft
2. Mice treated with existing drugs against specific CRM genes extended graft survival
3. Retrospective EHR data analysis supports treatment prediction
Key Observations:
1. Meta-analysis offers a more reliable estimate of the direction and magnitude of the effect
2. Existing data can be used to generate and validate new hypotheses
significant effort is needed
to find the right data, make
sense of them, and use
them for a new purpose
5
Data scientists could be more productive
6
Surprisingly low reproducibility of landmark studies
39% (39/100) in psychology1
21% (14/67) in pharmacology2
11% (6/53) in cancer3
unsatisfactory in machine learning4
1
doi:10.1038/nature.2015.17433 2
doi:10.1038/nrd3439-c1 3
doi:10.1038/483531a 4
https://openreview.net/pdf?id=By4l2PbQ-
Most published research findings are false.
- John Ioannidis, Stanford University
PLoS Med 2005;2(8): e124.
8
https://doi.org/10.1038/d41573-019-00074-z
Probability to launch
Nature Reviews Drug Discovery 18, 495-496 (2019)
9
?
Do we have access to the data we need to
build good AI models?
To what extent will AI models improve the
success of developing effective treatments?
Can they provide adequate explanations?
11
data are challenging to reuse:
● difficult to obtain
● poorly described
● in different formats
● hard to integrate with
other data
models have significant
limitations:
● built from limited, biased, or
erroneous data
● aren't robust to different
inputs
● difficult to explain decisions
and explanations aren't
satisfying
Translational
Failure
high quality, linked,
machine accessible,
machine interpretable
(meta)data from
multiple sources and
data types
Trustworthy,
data-driven and
knowledge-aligned,
explainable models and
predictions
Translational
Success
Human Machine collaboration
is crucial to our future work
13
Machines
need to be able to discover and reuse data
(and arguably any digital resource)
14
Research Directions
i) organize knowledge to answer questions about what we know
and what we don’t know (but should)
ii) build models to predict, explain, and justify
iii) human-AI collaboration to create, maintain, correct, and
complete knowledge
Knowledge Infrastructure Explainable Predictions
Neurosymbolic AI
http://www.nature.com/articles/sdata201618
18
Making FAIR Data
Data
use ontologies +
vocabularies
Use standard
data format
Standardized
Metadata
3. Transform
2. Describe
add provenance,
license for data +
metadata
4. Publish
use standard
metadata format
use ontologies +
vocabularies
1. Collect
Persistent Data Identifier
Persistent Metadata
Identifier
Data Repository
Standardized
Metadata
Standardized
Data
Standardized
Data
Communities are publishing recipes
to make FAIR data
How do we know it’s FAIR?
• FAIR Enough is a system to
perform automated assessment of
the technical quality of the
FAIRness implementation.
• Uses a collections of metrics,
implemented as web services.
• Fast owing to parallel execution
• Keeps track of past assessments
to monitor status
• Offers search and query services
• Anybody can extend via service
based framework
• Open source and Docker
deployable
https://fair-enough.semanticscience.org
@micheldumontier::KAUST-Hackathon:202
3-02-07
https://lod-cloud.net/
• Since 2007, last updated 2014
• 30+ biomedical data sources
• 10B+ interlinked statements
• NCBI, EBI, SIB, DBCLS, NCBO, and many others
(chem2bio2rdf) produce this content
chemicals/drugs/formulations,
genomes/genes/proteins, domains
Interactions, complexes & pathways
animal models and phenotypes
Disease, genetic markers, treatments
Terminologies & publications
Belleau et al. JBI 2008. 41(5):706-716.
Callahan et al. ESWC 2013. 200-212
Linked Data for the Life
Sciences
Bio2RDF is an open source project that uses semantic web
technologies to make it easier to reuse biomedical data.
It provides Linked Data and a queryable RDF knowledge graph.
the Triple as
a base unit of knowledge representation
“diclofenac is a drug”
subject object
predicate
diclofenac drug
is a
formalization
“diclofenac is a drug”
drugbank:DB00586 drugbank:Drug
rdf:type
RDF N-Triples format (standardized, machine interpretable):
<https://bio2rdf.org/drugbank:DB00586>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<https://bio2rdf.org/drugbank_vocabulary:Drug> .
"diclofenac"
rdfs:label
diclofenac drug
is a
1. Use RDF
2. Assign/reuse
identifiers
3. Use or
develop
vocabularies
Biomedical Linked Data
has detailed provenance
linked to other resources
semantically typed
http(s) identifier
rich descriptions
An Interoperable Biomedical Knowledge Graph
diclofenac NSAID
db:category
drug-gene
association
pharmgkb:drug
CYP2C9*1
pharmgkb:haplotype
30
Reproducible ML: new uses for existing drugs
Exploration: drug-target-disease networks
https://doi.org/10.7717/peerj-cs.281
https://doi.org/10.7717/peerj-cs.106
Custom Knowledge Portal: EbolaKB
https://doi.org/10.1093/database/bav049
Information Retrieval: Phenotypes of knock-out
mouse models for the targets of a selected drug
het.io DRKG
ROBOKOP
PheKnowLator
Integrative KGs Semantic KGs
Knowledge Collaboratory (for small data)
An AI-powered (NLP model or GPT) web user interface to annotate biomedical
text (NER, RE), create standard-compliant statements (BioLink model) that can be
made publicly available as author-signed nanopublications.
📝 collaboratory.semanticscience.org/annotate
Technology to publish assertions using RDF
Contains RDF triples to specify the assertion,
its provenance, and digital object metadata
Digitally signed by agent
TrustyURI hash to provide globally unique,
persistent, immutable, verifiable identifier and
payload
35
Robust, Reproducible, Explainable Predictions
Neurosymbolic Artificial Intelligence (NAI)
NAI aims to combine symbolic reasoning methods (logic-based reasoning & rules) with
sub-symbolic methods (neural networks, deep learning) to create models with high
predictive performance and explainability.
Specifically:
● Integrates knowledge from different modalities
● Able to perform complex reasoning (e.g. deduction, induction, synthesis)
● Able to learn from examples/small data and big data
● Offer explanations (e.g. causal account of the phenomenon) and justifications
(the evidence that supports the claims)
● Robust to noise and nonsense
● Handles cases out of the learning distribution
Predict new drug applications in a documented and
reproducible manner
38
AUC 0.90 across all therapeutic indications
Scripts not available. Feature tables
available.
Not reproducible!
Result: ROCAUC 0.83
Celebi R, Rebelo Moreira J, Hassan AA, Ayyar S, Ridder L,
Kuhn T, Dumontier M. 2020. Towards FAIR protocols and
workflows: the OpenPREDICT use case. PeerJ Computer
Science 6:e281 https://doi.org/10.7717/peerj-cs.281
Explainable AI
• XAI methods such as SHAP
provide information about
feature importance for the model
and in individual predictions
• When applied to OpenPredict,
it’s too complicated
understand the contributions of
derived features
• However, it is clearer when
using a single feature predictor
XPREDICT
Identify a set of ranked paths through the KG that offer plausible explanations for
a predicted drug indication based on their similarities to known drug-disease pairs.
1. Path generation
2. Path ranking
3. Explanation generation
@micheldumontier::KAUST-Hackathon:202
3-02-07
Graph Representation Learning
We want to automatically discover effective
representations needed for classification from raw data.
In graph representation learning, we encode the
topology, node attributes, and edge information into
low-dimensional vectors (or embeddings).
These vector can then be used as features to train
classifiers for link prediction, node classification, graph
classification, etc
GNN
Graph Neural Networks (GNNs) iteratively update node representations by
aggregating features from neighboring nodes and possibly edges.
Several methods (e.g. Saliency Maps) exist to extract a model-wise explanation for
link prediction, node/graph classification.
Graph Neural Networks
But I don’t find these explanations salient at all.
Embeddings don’t lend themselves to meaningful
explanations
because each dimension does not necessarily
represent or correspond with something we
recognize
Methodology
1. Use Graph Neural Networks to capture
semantics, graph structure and relationships
between nodes
2. Apply Saliency Maps on predictions made by
GNNs to identify relevant nodes for a
specific prediction; this provides valuable
insights into the graph's topology and
highlights the most important components
3. Saliency Maps assign a score to each node in
the network, which can be used to rank
paths involving genes, pathways, diseases,
and compounds
Experiments
Task: Link Prediction: Predict (Compound, treats, Disease)
Dataset: Drug Repurposing Knowledge Graph
Models: GraphSAGE, Graph Convolutional Network (GCN), Graph Attention
Network (GAT)
Metrics:
• Precision: ratio of true positives to predicted positives
• Recall: ratio of true positives to actual positives
• Hits@5 & Hits@10: for each test instance we predicted the tail and assessed
whether the rank of the predicted tail is below 5 or 10
Prediction & Explanation for:
Donepezil treats Alzheimer
The primary goal of Alzheimer's drugs,
including Donepezil, is to maintain elevated
acetylcholine (ACh) levels, thereby
compensating for the loss of functioning
cholinergic brain cells
Explanation subgraph emphasizes the crucial
role of Donepezil binding to
acetylcholinesterase (AChE) and
butyrylcholinesterase (BChE)
Slide 48 of 14
Summary
Finding effective drugs remains an outstanding challenge in biomedicine.
Much progress has been made in constructing semantic biomedical knowledge
graphs, and new infrastructure is available to publish, discover, and reuse
biomedical knowledge in a manner that is FAIR and de-centralized.
Embeddings-based prediction approaches show promising performance
(notwithstanding data bias), but remain difficult to explain.
Neurosymbolic systems retain the semantics of the knowledge, and show promise
in generating predictions that can recover logical explanations.
Michel Dumontier, PhD
Distinguished Professor of Data Science
Founder and Director, Institute of Data Science
Department of Advanced Computing Sciences
Maastricht University
michel.dumontier@maastrichtuniversity.nl
Knowledge Infrastructure
Explainable Predictions
Neurosymbolic AI
FAIR & AI
Ready
Knowledge
Graphs
for
Explainable
Predictions

FAIR & AI Ready KGs for Explainable Predictions.pdf

  • 1.
    FAIR & AIReady Knowledge Graphs for Explainable Predictions Michel Dumontier, PhD Distinguished Professor of Data Science Founder and Director, Institute of Data Science Maastricht University SeWeBMeDA :: 26-05-2024
  • 2.
  • 3.
  • 4.
    4 A common rejectionmodule (CRM) for acute rejection across multiple organs identifies novel therapeutics for organ transplantation Khatri et al. JEM. (2013) 210 (11): 2205 DOI: 10.1084/jem.20122709 Main Findings: 1. CRM of 11 overexpressed genes predicted future injury to a graft 2. Mice treated with existing drugs against specific CRM genes extended graft survival 3. Retrospective EHR data analysis supports treatment prediction Key Observations: 1. Meta-analysis offers a more reliable estimate of the direction and magnitude of the effect 2. Existing data can be used to generate and validate new hypotheses
  • 5.
    significant effort isneeded to find the right data, make sense of them, and use them for a new purpose 5
  • 6.
    Data scientists couldbe more productive 6
  • 7.
    Surprisingly low reproducibilityof landmark studies 39% (39/100) in psychology1 21% (14/67) in pharmacology2 11% (6/53) in cancer3 unsatisfactory in machine learning4 1 doi:10.1038/nature.2015.17433 2 doi:10.1038/nrd3439-c1 3 doi:10.1038/483531a 4 https://openreview.net/pdf?id=By4l2PbQ- Most published research findings are false. - John Ioannidis, Stanford University PLoS Med 2005;2(8): e124.
  • 8.
  • 9.
    9 ? Do we haveaccess to the data we need to build good AI models? To what extent will AI models improve the success of developing effective treatments? Can they provide adequate explanations?
  • 11.
    11 data are challengingto reuse: ● difficult to obtain ● poorly described ● in different formats ● hard to integrate with other data models have significant limitations: ● built from limited, biased, or erroneous data ● aren't robust to different inputs ● difficult to explain decisions and explanations aren't satisfying Translational Failure
  • 12.
    high quality, linked, machineaccessible, machine interpretable (meta)data from multiple sources and data types Trustworthy, data-driven and knowledge-aligned, explainable models and predictions Translational Success
  • 13.
    Human Machine collaboration iscrucial to our future work 13
  • 14.
    Machines need to beable to discover and reuse data (and arguably any digital resource) 14
  • 15.
    Research Directions i) organizeknowledge to answer questions about what we know and what we don’t know (but should) ii) build models to predict, explain, and justify iii) human-AI collaboration to create, maintain, correct, and complete knowledge
  • 16.
    Knowledge Infrastructure ExplainablePredictions Neurosymbolic AI
  • 18.
  • 19.
    Making FAIR Data Data useontologies + vocabularies Use standard data format Standardized Metadata 3. Transform 2. Describe add provenance, license for data + metadata 4. Publish use standard metadata format use ontologies + vocabularies 1. Collect Persistent Data Identifier Persistent Metadata Identifier Data Repository Standardized Metadata Standardized Data Standardized Data
  • 20.
    Communities are publishingrecipes to make FAIR data
  • 23.
    How do weknow it’s FAIR? • FAIR Enough is a system to perform automated assessment of the technical quality of the FAIRness implementation. • Uses a collections of metrics, implemented as web services. • Fast owing to parallel execution • Keeps track of past assessments to monitor status • Offers search and query services • Anybody can extend via service based framework • Open source and Docker deployable https://fair-enough.semanticscience.org
  • 24.
  • 25.
    • Since 2007,last updated 2014 • 30+ biomedical data sources • 10B+ interlinked statements • NCBI, EBI, SIB, DBCLS, NCBO, and many others (chem2bio2rdf) produce this content chemicals/drugs/formulations, genomes/genes/proteins, domains Interactions, complexes & pathways animal models and phenotypes Disease, genetic markers, treatments Terminologies & publications Belleau et al. JBI 2008. 41(5):706-716. Callahan et al. ESWC 2013. 200-212 Linked Data for the Life Sciences Bio2RDF is an open source project that uses semantic web technologies to make it easier to reuse biomedical data. It provides Linked Data and a queryable RDF knowledge graph.
  • 26.
    the Triple as abase unit of knowledge representation “diclofenac is a drug” subject object predicate diclofenac drug is a
  • 27.
    formalization “diclofenac is adrug” drugbank:DB00586 drugbank:Drug rdf:type RDF N-Triples format (standardized, machine interpretable): <https://bio2rdf.org/drugbank:DB00586> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://bio2rdf.org/drugbank_vocabulary:Drug> . "diclofenac" rdfs:label diclofenac drug is a 1. Use RDF 2. Assign/reuse identifiers 3. Use or develop vocabularies
  • 28.
    Biomedical Linked Data hasdetailed provenance linked to other resources semantically typed http(s) identifier rich descriptions
  • 29.
    An Interoperable BiomedicalKnowledge Graph diclofenac NSAID db:category drug-gene association pharmgkb:drug CYP2C9*1 pharmgkb:haplotype
  • 30.
    30 Reproducible ML: newuses for existing drugs Exploration: drug-target-disease networks https://doi.org/10.7717/peerj-cs.281 https://doi.org/10.7717/peerj-cs.106 Custom Knowledge Portal: EbolaKB https://doi.org/10.1093/database/bav049 Information Retrieval: Phenotypes of knock-out mouse models for the targets of a selected drug
  • 31.
  • 32.
    Knowledge Collaboratory (forsmall data) An AI-powered (NLP model or GPT) web user interface to annotate biomedical text (NER, RE), create standard-compliant statements (BioLink model) that can be made publicly available as author-signed nanopublications. 📝 collaboratory.semanticscience.org/annotate
  • 33.
    Technology to publishassertions using RDF Contains RDF triples to specify the assertion, its provenance, and digital object metadata Digitally signed by agent TrustyURI hash to provide globally unique, persistent, immutable, verifiable identifier and payload
  • 35.
  • 36.
  • 37.
    Neurosymbolic Artificial Intelligence(NAI) NAI aims to combine symbolic reasoning methods (logic-based reasoning & rules) with sub-symbolic methods (neural networks, deep learning) to create models with high predictive performance and explainability. Specifically: ● Integrates knowledge from different modalities ● Able to perform complex reasoning (e.g. deduction, induction, synthesis) ● Able to learn from examples/small data and big data ● Offer explanations (e.g. causal account of the phenomenon) and justifications (the evidence that supports the claims) ● Robust to noise and nonsense ● Handles cases out of the learning distribution
  • 38.
    Predict new drugapplications in a documented and reproducible manner 38 AUC 0.90 across all therapeutic indications Scripts not available. Feature tables available. Not reproducible! Result: ROCAUC 0.83 Celebi R, Rebelo Moreira J, Hassan AA, Ayyar S, Ridder L, Kuhn T, Dumontier M. 2020. Towards FAIR protocols and workflows: the OpenPREDICT use case. PeerJ Computer Science 6:e281 https://doi.org/10.7717/peerj-cs.281
  • 39.
    Explainable AI • XAImethods such as SHAP provide information about feature importance for the model and in individual predictions • When applied to OpenPredict, it’s too complicated understand the contributions of derived features • However, it is clearer when using a single feature predictor
  • 40.
    XPREDICT Identify a setof ranked paths through the KG that offer plausible explanations for a predicted drug indication based on their similarities to known drug-disease pairs. 1. Path generation 2. Path ranking 3. Explanation generation @micheldumontier::KAUST-Hackathon:202 3-02-07
  • 42.
    Graph Representation Learning Wewant to automatically discover effective representations needed for classification from raw data. In graph representation learning, we encode the topology, node attributes, and edge information into low-dimensional vectors (or embeddings). These vector can then be used as features to train classifiers for link prediction, node classification, graph classification, etc
  • 43.
    GNN Graph Neural Networks(GNNs) iteratively update node representations by aggregating features from neighboring nodes and possibly edges. Several methods (e.g. Saliency Maps) exist to extract a model-wise explanation for link prediction, node/graph classification. Graph Neural Networks
  • 44.
    But I don’tfind these explanations salient at all.
  • 45.
    Embeddings don’t lendthemselves to meaningful explanations because each dimension does not necessarily represent or correspond with something we recognize
  • 46.
    Methodology 1. Use GraphNeural Networks to capture semantics, graph structure and relationships between nodes 2. Apply Saliency Maps on predictions made by GNNs to identify relevant nodes for a specific prediction; this provides valuable insights into the graph's topology and highlights the most important components 3. Saliency Maps assign a score to each node in the network, which can be used to rank paths involving genes, pathways, diseases, and compounds
  • 47.
    Experiments Task: Link Prediction:Predict (Compound, treats, Disease) Dataset: Drug Repurposing Knowledge Graph Models: GraphSAGE, Graph Convolutional Network (GCN), Graph Attention Network (GAT) Metrics: • Precision: ratio of true positives to predicted positives • Recall: ratio of true positives to actual positives • Hits@5 & Hits@10: for each test instance we predicted the tail and assessed whether the rank of the predicted tail is below 5 or 10
  • 48.
    Prediction & Explanationfor: Donepezil treats Alzheimer The primary goal of Alzheimer's drugs, including Donepezil, is to maintain elevated acetylcholine (ACh) levels, thereby compensating for the loss of functioning cholinergic brain cells Explanation subgraph emphasizes the crucial role of Donepezil binding to acetylcholinesterase (AChE) and butyrylcholinesterase (BChE) Slide 48 of 14
  • 49.
    Summary Finding effective drugsremains an outstanding challenge in biomedicine. Much progress has been made in constructing semantic biomedical knowledge graphs, and new infrastructure is available to publish, discover, and reuse biomedical knowledge in a manner that is FAIR and de-centralized. Embeddings-based prediction approaches show promising performance (notwithstanding data bias), but remain difficult to explain. Neurosymbolic systems retain the semantics of the knowledge, and show promise in generating predictions that can recover logical explanations.
  • 50.
    Michel Dumontier, PhD DistinguishedProfessor of Data Science Founder and Director, Institute of Data Science Department of Advanced Computing Sciences Maastricht University michel.dumontier@maastrichtuniversity.nl Knowledge Infrastructure Explainable Predictions Neurosymbolic AI FAIR & AI Ready Knowledge Graphs for Explainable Predictions