Biomedical AI applications increasingly rely on multi-domain and heterogeneous data, especially in areas such as personalised medicine and systems biology. Biomedical Ontologies are a golden opportunity in this area because they add meaning to the underlying data which can be used to support heterogeneous data integration, provide scientific context to the data augmenting AI performance, and afford explanatory mechanisms allowing the contextualization of AI predictions. In particular, ontologies and knowledge graphs support the computation of semantic similarity between objects, providing an understanding of why certain objects are considered similar or different. This is a basic aspect of explainability and is at the core of many machine learning applications. However, when data covers multiple domains, it may be necessary to integrate different ontologies to cover the full semantic landscape of the underlying data.
In this talk I will present our recent work on building an integrated knowledge graph that is based on the semantic annotation and interlinking of heterogeneous data into a holistic semantic landscape that supports semantic similarity assessments. In this talk I will discuss the challenges in building the knowledge graph from public resources, the methodology we are using and the road-ahead in biomedical ontology and knowledge graph alignment as AI becomes an integral part of biomedical research.
Powering Biomedical Artificial Intelligence with a Holistic Knowledge Graph (SeWebMeDa2022)
1. Powering Biomedical Artificial Intelligence
with a Holistic Knowledge Graph
5th Workshop on Semantic Web solutions for large-scale biomedical data analytics
(SeWeBMeDa)
Catia Pesquita
clpesquita@ciencias.ulisboa.pt
2. Why biomedical AI needs ontologies
1. Large amounts of data in non-standard
formats which need to be converted, interpreted,
and merged into readable formats.
2. Heterogeneous and complex data which
current ML approaches are processing without context
3. Lack of sufficiently large datasets to train DL
in some scenarios
4. Lack of explainability
Catia Pesquita, LASIGE, ULisboa 2
Ontologies are key to tackle ALL challenges
3. What happens when data is
complex, heterogeneous and
multi-domain?
Catia Pesquita, LASIGE, ULisboa 3
4. One ontology
is not enough
Systems Biology,
Systems Medicine
and Personalized
Medicine require
holistic
representations
4
JD Ferreira, DC Teixeira, C Pesquita. Biomedical Ontologies: Coverage, Access and Use. 2020 Systems Medicine Integrative, Qualitative and
Computational Approaches, Academic Press, Elsevier
Catia Pesquita, LASIGE, ULisboa
5. What we have
Ontology Alignment
• OA systems find the optimal set of mappings
between entities in two ontologies
• BioPortal has a mappings repository
Logical Definitions
• define concepts in terms of other more elementary
(atomic) concepts (building blocks)
• OBO LDs cover multiple ontologies
Catia Pesquita, LASIGE, ULisboa 5
6. Available mappings are not enough
55M mappings:
• ~2/3 Naïve string matching (label or URI)
• ~1/3 UMLS
97% of mappings are skos:closeMatch
or skos:relatedMatch
6
Catia Pesquita, LASIGE, ULisboa
7. OBO logical definitions are also not enough
7
Schlegel, D. R., Seppälä, S., & Elkin, P. L. (2016). Definition Coverage in the OBO Foundry Ontologies: The Big Picture. In ICBO/BioCreative.
Catia Pesquita, LASIGE, ULisboa
8. State-of-the-art OM systems are not enough
8
Simple equivalences
Anatomy: 94.1% F-measure
Large Bio: 72-84% F-measure
Biodiversity: 81-84% F-measure
Catia Pesquita, LASIGE, ULisboa
9. What do we need to create holistic
representations?
9
Cover multiple
domains
Align multiple ontologies
Scalability
Ensure rich
semantic integration
Related but not equal
domains
Complex relations
involving more than one
ontology
Provide high quality
alignments
Support human
interaction
Visualize the context of a
mapping
Balance cognitive
overload and
informativeness
MC Silva, D Faria, and C Pesquita. Integrating knowledge graphs for explainable artificial intelligence in biomedicine. Ontology Matching workshop 2021
Catia Pesquita, LASIGE, ULisboa
10. Rethinking biomedical ontology alignment
10
Cover multiple
domains
Pairwise ontology
alignment
Holistic ontology
alignment of multiple
ontologies
Ensure rich
semantic integration
Simple equivalence
ontology alignment
Complex ontology
alignment
Provide high quality
alignments
User
validation
Human-in-the-loop
Interactive Alignment
MC Silva, D Faria, and C Pesquita. Integrating knowledge graphs for explainable artificial intelligence in biomedicine. Ontology Matching workshop 2021
Catia Pesquita, LASIGE, ULisboa
11. Holistic Matching (CIA) runs in half the time, aims for high recall within
similar domains and high precision across domains
11
Silva, M.C., Faria, D. & Pesquita, C. (2022). Matching Multiple Ontologies to Build a Knowledge Graph for Personalized Medicine.
ESWC2022 (See it on Wednesday 11:30)
Catia Pesquita, LASIGE, ULisboa
12. Beyond simple equivalence matching
12
Two ontologies
Similar domain
Different granularity
Subsumption Matching
Two ontologies
Similar domain
Different model
Complex Matching
Respiratory
tract infection
Respiratory
finding
Respiratory
tract infection
Respiratory
tract
Infection
abnormal immune
system cell
morphology
abnormal
immune
system
cell
morphology
Multiple ontologies
Different domains
Compound Matching
Catia Pesquita, LASIGE, ULisboa
13. Complex Matching with Targeted Pattern Mining
Can we improve the performance of COM by
using targeted pattern mining-based
algorithms?
Requires shared (or matched) instances
Patterns are used a priori, to tailor
Association Rule Mining
More effective, as we optimise the
search and selection of each pattern
More efficient, as we reduce the
search space
13
Lima, B., Faria, D. & Pesquita, C. (2021). Pattern-Guided Association Rule Mining for Complex Ontology Alignment. ISWC2021 P &D.
Catia Pesquita, LASIGE, ULisboa
14. Complex Matching with targeted pattern mining finds good
precision mappings – many new
Manual evaluation of the OAEI cmt-conference alignment
14
Lima, B., Faria, D. & Pesquita, C. (2021). Pattern-Guided Association Rule Mining for Complex Ontology Alignment. ISWC2021 P &D.
Catia Pesquita, LASIGE, ULisboa
15. Compound Matching for ontology triples
Can we find mappings involving multiple ontologies using lexical
approaches and search space pruning?
15
HP:0001650
aortic stenosis
PATO:0001847
constricted
Step 1
FMA:3734
aorta
Step 2
HP:0001650
(…) stenosis
Filter unmapped source classes
Remove mapped words from class labels.
Selection of best scoring mappings
Oliveira, D., & Pesquita, C. (2018). Improving the interoperability of biomedical ontologies with compound alignments. J. Biomedical Semantics.
Catia Pesquita, LASIGE, ULisboa
16. Complex Matching finds (mostly) high precision ternary mappings –
many new
16
Ontology sets Mappings Correct (New) Incorrect
MP-CL-PATO 448 47.1% (17.6%) 17.6%
MP-GO-PATO 875 86.9% (22.3%) 4.1%
MP-NBO-PATO 169 70.4% (20.7%) 0.0%
MP-UBERON-PATO 1413 83.5% (24.7%) 2.9%
WBP-GO-PATO 272 44.9% (33.3%) 4.7%
HP-FMA-PATO 1270 81.5% (44.1%) 4.1%
Example of correct new mapping for MP-UBERON-PATO
“absent thoracic vertebrae” (MP:0004655) → “thoracic vertebra” (UBERON:0002347) and
“lacks all parts of type” (PATO:0002000).
Mappings were evaluated
against existing logical
definitions and a subset of
new mappings was evaluated
by two specialists
Oliveira, D., & Pesquita, C. (2018). Improving the interoperability of biomedical ontologies with compound alignments. J. Biomedical Semantics.
Catia Pesquita, LASIGE, ULisboa
18. Graph visualization and interaction supports context and is both
preferred and more frequently used
• Running user evaluations with domain experts is a challenge
• Still not true human-in-the-loop
18
Guerreiro, A., Faria, D. & Pesquita, C. (2021) VOWLMap: Graph-based Ontology Alignment Visualization and Editing. VOILA workshop (ISWC 2021)
Catia Pesquita, LASIGE, ULisboa
22. Knowledge Graph Embeddings for ICU
readmission prediction
22
• Prevents 40% of too early ICU releases while
ensuring all patients that are indeed released do
not return to the ICU
Catia Pesquita, LASIGE, ULisboa
Carvalho, R.., D. Oliveira., C. Pesquita "Knowledge Graph Embeddings for ICU readmission prediction." (2022).https://doi.org/10.21203/rs.3.rs-1507573/v1
23. Predicting Gene-Disease Associations
Can using more than one
ontology and exploring
logical definitions improve
the prediction of gene-
disease associations based
on KG embeddings?
23
Nunes, S., Sousa, R. T., & Pesquita, C. (2021). Predicting Gene-Disease Associations with Knowledge Graph Embeddings over Multiple Ontologies.
Bio-Ontologies COSI-ISMB/ECCB.
Catia Pesquita, LASIGE, ULisboa
24. Predicting Gene-Disease Associations
24
Nunes, S., Sousa, R. T., & Pesquita, C. (2021). Predicting Gene-Disease Associations with Knowledge Graph Embeddings over Multiple Ontologies.
Bio-Ontologies COSI-ISMB/ECCB.
Weighted Average F-measure
Catia Pesquita, LASIGE, ULisboa
26. Trust in AI
• the user successfully comprehends how the model arrives at an
outcome 🡪 represent inputs, outputs and processes
• the model’s outcomes/workings match the user’s prior knowledge
🡪 build a shared context
26
Jacovi, Alon, et al. "Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in AI." Proceedings of the 2021 ACM conference on
fairness, accountability, and transparency. 2021.
Catia Pesquita, LASIGE, ULisboa
27. Trust in AI
• the user successfully comprehends how the model arrives at an
outcome → represent a model’s processes represent inputs,
outputs and
• the model’s outcomes/workings match the user’s prior knowledge
→ evaluation
27
Catia Pesquita, LASIGE, ULisboa
28. Trust in AI for biomedical and clinical
applications
• the user successfully comprehends how the model arrives at an
outcome → represent a model’s processes represent inputs,
outputs and
• the model’s outcomes/workings match the user’s prior knowledge
→ evaluation
28
Catia Pesquita, LASIGE, ULisboa
29. Trust in AI for biomedical and clinical
applications
• the user successfully comprehends how the model arrives at an
outcome → represent a model’s inputs, outputs and processes
• the model’s outcomes/workings match the user’s prior knowledge
→ evaluation
29
Catia Pesquita, LASIGE, ULisboa
30. Trust in AI for biomedical and clinical
applications
• the user successfully comprehends how the model arrives at an
outcome → represent a model’s inputs, outputs and processes
• the model’s outcomes/workings match the user’s prior knowledge
→ represent the scientific context
30
Catia Pesquita, LASIGE, ULisboa
31. Knowledge Science can help mitigate bias in
biomedical AI
31
10.1038/s43856-021-00028-w
Mitigating bias in machine learning for medicine
Catia Pesquita, LASIGE, ULisboa
32. Knowledge Science for Trust in Biomedical AI
Assessing trustworthiness requires data, domain and user
context
Data context: represent data provenance and
transformations/processing
Domain/background knowledge: represent the scientific
context of the data and application
User context: different users will trust based on different
expectations
32
Catia Pesquita, LASIGE, ULisboa
33. Data Context
Augment explanations with the data creation and processing
context
• Provide a rich contextual semantic layer to the underlying
data using domain ontologies and knowledge graphs.
• Preserve uncertainty and highlight potential ambiguity and
incompleteness at the data level
33
Catia Pesquita, LASIGE, ULisboa
34. Data Context is key for trust
25% of the works that developed ML approaches to diagnose COVID-19 in
adults based on chest X-rays and CT scans used pediatric (ages 1-5)
pneumonia images as control.
34
Roberts, Michael, et al. "Common pitfalls and recommendations for using machine learning to detect and
prognosticate for COVID-19 using chest radiographs and CT scans." Nature Machine Intelligence 3.3 (2021): 199-217.
10.1016/j.cell.2018.02.010
COVID-19
10.1016/j.rxeng.2020.11.001
Catia Pesquita, LASIGE, ULisboa
35. Domain Knowledge Context
Contextualize an explanation within existing knowledge
• Include prior knowledge through links to ontologies
• Enrich the contextual semantic layer with links and relations
across domains of knowledge
35
Catia Pesquita, LASIGE, ULisboa
36. Domain Knowledge Context is key for trust
36
Term 1 Term 2 Similarity
Gingiva Gum 0.98
Catia Pesquita, LASIGE, ULisboa
37. Domain Knowledge Context is key for trust
37
Gingiva Gum
0.98
Term 1 Term 2 Similarity
Gingiva Gum 0.98
Anatomical
Part
Chemical
Substance
Catia Pesquita, LASIGE, ULisboa
38. User Context
Trusting an AI outcome depends on the user context: task, prior
knowledge, expectation, etc.
38
Protein A and Protein B
are not similar at all
since they are involved in
unrelated diseases…
Protein A and Protein B
are very similar because
they perform the same
molecular function!
Biochemist Clinician
Catia Pesquita, LASIGE, ULisboa
39. Ontologies can support XAI
Reasoning
• Patient has Cough and Cough is_a Respiratory
Finding → Patient has Respiratory Finding
Querying
• find all patients annotated with Fever and
Respiratory Finding
Similarity modelling
• find all patients with similar profiles (semantic
similarity)
39
XAI
Catia Pesquita, LASIGE, ULisboa
40. Explaining Protein-Protein Interaction (PPI)
Predictions
40
Protein
P1
Protein P2
?
Experts want to understand the biological mechanisms that
underlie the natural phenomena they are predicting.
Catia Pesquita, LASIGE, ULisboa
41. Catia Pesquita, LASIGE, ULisboa 41
Combining Genetic Programming and Semantic Similarity
#
entity
pairs
Computing KG-based
semantic similarity
between entity pairs for
each semantic aspect
Evolving a
GP model
Predicting on unseen
data using the
GP model
# semantic
aspects
targets
GO:0005575
cellular component
GO:0003674
molecular function
GO:0008151
biological process
GO:0110165
cellular anatomical
entity
GO:0005615
extracellular space
GO:0062167
complement
component C1q
complex
GO:0008152
metabolic process
GO:0070085
glycosylation
GO:0036065
fucosylation
GO:0005488
binding
GO:0097159
organic cyclic
compound binding
GO:0003676
nucleic acid binding
P1 P2
Gene Ontology KG
P3
PPI from STRING Database
Random Negative Sampling
P1 P2 1
P1 P3 1
P2 P3 0
Median weighted average
of F-measures (WAF)
RandomForest
0.914
GP6x 0.866
42. 42
40S ribosomal protein S12 and
40S ribosomal protein S10
A Model That Fits with Biology
GO:0008151
biological process
GO:0051179
localization
GO:0000184
nuclear-transcribed mRNA
catabolic process,
nonsense-mediated decay
S12 S10
GO:0006614
SRP-dependent
cotranslational protein
targeting to membrane
GO:0009987
cellular process
K1 T𝜷6
GO:0005874
microtubule
GO:0110165
cellular anatomical entity
GO:0005575
cellular component
GO:0008151
biological process
GO:0009987
cellular process
GO:0009987
cellular component
organization
True
Positive
(+/+)
True
Negative
(-/-)
cellular anatomical entity
binding
structural molecule activity
interspecies interaction between organisms
metabolic process
biological regulation
protein containing complex
localization
cellular process
0.0 0.2 0.4 0.6 0.8 1.0
Semantic Similarity
0.0 0.2 0.4 0.6 0.8 1.0
Semantic Similarity
cellular anatomical entity
binding
cellular process
Kinetochore-associated protein 1 and
Tubulin B-6 chain
max(SS catalytic activity, SS cellular process, SS molecular adaptor activity, SS molecular function
regulator, SS multicellular organismal process, SS signaling, SS behavior + SS immune system process)
Catia Pesquita, LASIGE, ULisboa
43. But is even more interesting when it fails
43
Protein S100-A10 works together with
neuroblast differentiation-associated
protein AHNAK in the development of the
intracellular membrane, but this
information is missing from the GO
annotations.
S100
-A10
AHNAK
GO:0045121
membrane raft
GO:0110165
cellular anatomical entity
GO:0005575
cellular component
GO:0070062
extracellular exosome
S100-A10 protein and neuroblast differentiation-associated protein
AHNAK
False
Negative
(+/-)
0.0 0.2 0.4 0.6 0.8 1.0
Semantic Similarity
cellular anatomical entity
binding
biological regulation
Catia Pesquita, LASIGE, ULisboa
44. But is even more interesting when it fails
44
The literature describes interactions
between proteins of the same family of the
pair, indicating that this is likely a true but
still unknown interaction.
GO:0008151
biological process
GO:0000165
MAPK cascade
Dlg2
GO:0016323
basolateral plasma
membrane
GO:0005575
cellular component
GO:0110165
cellular anatomical entity
GO:0051179
localization
GO:0009987
cellular process
GO:0065007
biological regulation
Protransforming growth factor 𝜶 and Disks large homolog 2
False
Positive
(-/+)
0.0 0.2 0.4 0.6 0.8 1.0
Semantic Similarity
cellular anatomical entity
binding
localization
metabolic process
biological regulation
cellular process
Catia Pesquita, LASIGE, ULisboa
46. Building a KG for Explainable AI for
Personalized Oncology
46
Catia Pesquita, LASIGE, ULisboa
47. Patient
Ex-smoker
Smoker
Diabetes
MET
HLA-A2 cytokine-mediated
signaling pathway
Renal cell
carcinoma, somatic Sunitinib
Antineoplastic
Agent
Tyrosine kinase
activity
Cancer
AI recommendation
presents
has mutation
inhibits
The KG can be used to contextualize the patient
and the drug recommendation
47
promotes
treats
Patient
treated with
is risk factor
Catia Pesquita, LASIGE, ULisboa
48. Patient
Ex-smoker
Smoker
Diabetes
MET
HLA-A2 cytokine-mediated
signaling pathway
Renal cell
carcinoma, somatic Sunitinib
Antineoplastic
Agent
Tyrosine kinase
activity
Cancer
AI recommendation
presents
has mutation
inhibits
The KG can be used to contextualize the patient
and the drug recommendation
48
promotes
treats
Patient
treated with
is risk factor
How diverse is
my dataset in
terms of patient
features?
Catia Pesquita, LASIGE, ULisboa
49. Patient
Ex-smoker
Smoker
Diabetes
MET
HLA-A2 cytokine-mediated
signaling pathway
Renal cell
carcinoma, somatic Sunitinib
Antineoplastic
Agent
Tyrosine kinase
activity
Cancer
AI recommendation
presents
has mutation
inhibits
The KG can be used to contextualize the patient
and the drug recommendation
49
promotes
treats
Patient
treated with
is risk factor
How **similar** are my
negative and positive
cases?
Catia Pesquita, LASIGE, ULisboa
50. Patient
Ex-smoker
Smoker
Diabetes
MET
HLA-A2 cytokine-mediated
signaling pathway
Renal cell
carcinoma, somatic Sunitinib
Antineoplastic
Agent
Tyrosine kinase
activity
Cancer
AI recommendation
presents
has mutation
inhibits
The KG can be used to contextualize the patient
and the drug recommendation
50
promotes
treats
Patient
treated with
is risk factor
How does the
prediction match
current scientific
knowledge?
Catia Pesquita, LASIGE, ULisboa
51. Patient
Ex-smoker
Smoker
Diabetes
MET
HLA-A2 cytokine-mediated
signaling pathway
Renal cell
carcinoma, somatic Sunitinib
Antineoplastic
Agent
Tyrosine kinase
activity
Cancer
AI recommendation
presents
has mutation
inhibits
The KG can be used to contextualize the patient
and the drug recommendation
51
promotes
treats
Patient
treated with
is risk factor
How **similar** are patients
treated with the same
drug?
Catia Pesquita, LASIGE, ULisboa
54. Acknowledgements
Daniel Faria, LASIGE/Biodata.pt, Portugal
Sara Silva, LASIGE, Portugal
Isabel Cruz, U. Illinois, USA
Daniela Oliveira, Novartis
Booma S. Balasubramani, Microsoft
and many others
Past and present students:
Rita Sousa
Marta Silva
Susana Nunes
Ricardo Carvalho
Ana Guerreiro
Patrícia Eugénio
Filipa Serrano
Beatriz Lima
Carlota Cardoso
and many others
Catia Pesquita, LASIGE, ULisboa 54
This work was supported by FCT through the LASIGEResearch
Unit (UIDB/00408/2020 and UIDP/00408/2020). It was also partially
supported by the KATY project which has received funding from the
European Union’s Horizon 2020 research and innovation
programme under grant agreement No 101017453.