Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Identification of insulin-resistance genes with Knowledge Graphs topology and embeddings
1. How well can embeddings represent the
biology of genes related to the complex
pathophysiology of insulin resistance?
Identification of Insulin
Resistance-Related Genes
with Biomedical Knowledge
Graphs Topology and
Embeddings
M. Lisandra Zepeda Mendoza,
Tankred Ott, Marc Boubnovski, Viktor Sandberg, Ramneek Gupta
2. Executive summary
M. Lisandra Zepeda M.
Identification of Insulin Resistance-Related Genes
with Biomedical Knowledge Graphs Topology and
Embeddings
It is difficult to identify the entire set of
genes associated with IR (insulin
resistance) due to its complexity and
multifactorial nature.
Knowledge graphs (KGs) model relevant
biomedical entities (proteins, diseases,
pathways, etc.) in many different ways.
The specific data model can impact the
results.
Various different algorithms available.
Challenge
How well can embeddings represent the
biology of genes related to the complex
pathophysiology of insulin resistance?
Question
Understand the complexity of insulin
resistance
Identify genes related to insulin resistance
using knowledge graph embeddings using a
data-driven approach.
Goal
Specialist in Biomedical
Knowledge
Representation, NNRCO
Page 2
3. Appendix
3
Novo Nordisk company presentation
Neetima Bhardwaj &
Veleena Nisha Lobo
Product Supply
US
Mandy Marquardt
Team Novo Nordisk
Professional track cyclist
Background
4. What is insulin resistance?
Page 4
The insulin signaling pathway
Picture from https://www.nature.com/articles/s41392-022-01073-0
5. No, wait… actually, it’s tissue-specific
Page 5
A unified concept of insulin resistance in humans
Picture from https://www.nature.com/articles/s41586-019-1797-8
Picture from https://www.nature.com/articles/s41392-022-01073-0
Insulin resistance related diseases in human
6. Developing a framework to explore the IR
landscape using biomedical KG
What to consider?
Page 6
o KG schema
o Information within the knowledge graph:
o Quality
o Amount
o Relevance for task
o Methods used to predict IR-related genes/proteins
7. Methods
Our heritage enables us to
defeat diabetes and other
serious chronic diseases
Novo Nordisk company
presentation
7
Otávio Domingos da Costa
Otávio has type 2 diabetes and obesity
Brazil
8. Which KGs to use? Enriched benchmarking KGs
Page 8
OpenBiolink
(IR node present as phenotype, 55 Gene-IR links)
Hetionet
(IR node absent)
Picture from https://doi.org/10.1093/bioinformatics/btaa274 Picture from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5640425/
• We use general-purpose biomedical knowledge graphs and
want to update them using selected information.
• Add a link between the genes predicted to be related to IR by a
bioinformatics study from Gao et al. [PMID: 32651353]
• This added 624 Gene-IR links to the KGs (i.e. improved our
training set)
11. Methods Overview
• OpenBioLink
• Hetionet
Biomedical KG
• Topological features
• Embeddings
• Link prediction
• Outlier detection & PU
• RFs ensemble model
• GSEA
• Euclidean distance for
clustering
• MSI CMD drugs’ MoAs
1 2 3 4
Feature Engineering Models Biological context
Page 11
Exploring
IR
12. Diffusion profiles | Potential drug’s MoAs from
the public Multiscale Interactome KG
Page 12
https://doi.org/10.1038/s41467-021-21770-8
• Diffusion profile: The path of most
relevance connecting a drug and a disease.
Gives insights into the drug’s possible MoA.
• Implement inhouse the MSI KG and the
methodology to calculate diffusion profiles
of CMD-related drugs
• Identify which genes and biological
functions of those genes are significantly
high in the diffusion profiles of CMD drugs
13. Novo Nordisk company presentation
13
Results David Lozano and Peter Kusztor
David and Peter have type 1 diabetes and are
professional Team Novo Nordisk riders.
They are racing with 100 on their jersey to
celebrate the 100-year anniversary of the
discovery of insulin.
Novo Nordisk
14. OpenBiolink @100 predictions
Top Performers:
• Topology-based approaches on both enriched and
non-enriched OpenBioLink datasets, utilizing large
training sets, outperformed other models.
Close Contenders:
• Elkanoto with XGBoost model, applied to OpenBioLink
with large training sets and employing embeddings
from RotatE link prediction on the same biomedical
knowledge graph (biomedKG), nearly matched the top
topology-based models.
Underperformers:
• Models based on Local Outlier Factor (LOF) were
among the least effective.
Page 14
15. Models vary in the consistency of the top predictions
• Consistency of the @100 predictions across
10 replicates for each modelling
• Topology model very precise and small
variance
• All other models are significantly more
variance; as expected the worst performing
model is the most variance
Page 15
16. Which features are most relevant in the topology
modelling approach?
Page 16
Small Large
17. Euclidean distances known vs unknown IR-related gene
Embedding Quality:
• Lowest-quality embeddings for TransE,
IR-related genes from the positive
training set furthest to the IR node.
Training Set Impact:
• In most models, enriched training sets
decrease distances for both known
and unknown IR-related genes to the
IR node.
Page 17
18. GSEA Top 100 genes predicted and positive set
• Best models matched known IR pathways
and discovered new aspects.
• Worst models identified broad or organ-
specific pathways, not IR-related.
• The training set had the known IR pathways
and unexpected links to Chagas disease
pathway and cancer.
Page 18
19. Biological Context MSI
• The best-performing link prediction
method (RotatE on the enriched
OpenBioLink) found more genes
associated with impaired glucose
tolerance - generalize better than the
other good-scoring topology-based
and PUL-based models.
• Models could also identify obesity-
related diseases
Page 19
20. Novo Nordisk company presentation
20
Perspectives
We transform scientific
ideas into life-saving
medicines for patients
21. Perspectives
Tissue-specific KG
How would the
embeddings look like if
instead we explored the
disease in a tissue-specific
manner, rather than in a
systemic manner (all
diseases, all tissues, all
genes, in a single schema)
Foundational models
Explore the possibility of
using foundational models
on KG to perform few/zero-
shot inductive inference.
Complex queries
Use more complex
reasoning KG-querying
approaches to identify the
relations/connections
between each found gene
and the IR node to
facilitate interpretability
Validation of results
Inhouse in vitro validation
of results
Page 21