Evotec - How can Knowledge Graphs support Druh Discovery
1. Polina Shpudeiko
Scientific Programmer, Computational Biology
How can Knowledge Graphs
Support Drug Discovery?
Graph Summit Neo4j, Frankfurt, October 10th, 2023
2. PAGE 2
Agenda
1. Why do we need knowledge graphs in
drug discovery?
2. How can we build them to solve our
challenges?
3. What can be done with the power of
graphs?
4. Where will it lead us?
3. PAGE 3
Why do we need knowledge
graphs in drug discovery?
4. PAGE 4
Integration of public and internal knowledge
Towards a comprehensive understanding of diseases and therapies
Public knowledge Internal knowledge
5. PAGE 5
The life science space is diverse
Navigating the complexity of biology, chemistry and clinics
Genes
Proteins
Mutations
Tissues
Pathways
Cell types
Compounds
Diseases
6. PAGE 6
The life science space is diverse
Various databases capture the structured public knowledge
And these are only a few examples...
Genes
Proteins Compounds
Mutations
Tissues
Diseases Pathways
Cell types
7. PAGE 7
Literature space is adding more complexity
Scientific articles are the key for sharing novel knowledge
Statistics
There are approximately
30,000 journals in the
world with an increasing
rate of 5-7% per year
The rapidly evolving landscape of
scientific research, marked by an
annual influx of approximately
2 million new articles
There are already 36 million
articles in the open source
database for articles
These ideas can be extracted by utilising natural
language processing (NLP)
8. PAGE 8
The drug discovery process as our in-house data source
Each step generates novel insights and requires dedicated expertise
Clinics
Disease biology Screening and compound chemistry
Target ID and
validation
Hit ID and
optimization
Lead
optimisation
Candidate
selection
Candidate
profiling
Clinical trial
Areas of interest
9. PAGE 9
Mission – harmonize data,
understand diseases and support
the development of new therapies
Integration of public and internal knowledge
Combinig it together will lead us towards novel targets discovery at Evotec
Public knowledge Internal knowledge
12. PAGE 12
Experimental data
Public ontologies space is not standardized Public ontologies space is incomplete
Ontologies do not cover cutting-edge science and novel associations
Multiple distinct
ontologies for Diseases
The ontological space is complex and incomplete
Stable and reliable data models and custom ontologies are essential
13. PAGE 13
Knowledge graph as harmonisation tool
Bringing together heterogeneous biological data in one place
• 15 databases
• 30 mln nodes and
100 mln connections
and counting
• Deep understanding of
ontologies (hierarchical/
semantical connections
between different entities)
was re-quired for har-
monisation of diseases
and traits
14. PAGE 14
• Extract public know-
ledge from scientific
articles with NLP
• Overlay de-novo mined
knowledge with
ontological database
space
Knowledge graph as integration tool
Integration of literature data with NLP approaches
Pathways
Tissue
Genes
Compounds
Diseases
Traits
Mutations
NLP
Article
15. PAGE 15
PMID prevent
Depressive
disorder
BAIAP2
Knowledge graph
• Natural Language Processing (NLP) extracts keys mentioned in the articles
• NLP-powered search engines can understand the context and semantics of queries
• Ontologies help to harmonize the extracted knowledge in one graph
Knowledge graph as tool for integration
Integration of literature data with NLP approaches
16. PAGE 16
• Signatures are a
representation of
internal experimental
knowledge
• Example: genes which
are changing their
expression in response
to therapy
Pathways
Tissue
Genes
Compounds
Diseases
Traits
Mutations
NLP
Article
Knowledge graph as tool for unification
Combining internal knowledge with public data
Signatures
18. PAGE 18
Disease tree ontology:
Fibrosis
Integration of Public and Internal data in graph
Using public knowledge for target identification from patient-derived signatures
Graph representation
of a Signature
19. PAGE 19
• Expression data in a large patient
cohort can be enriched with hetero-
geneous data from public (NLP,
pathways, cell types) and internal
(in-house signatures) resources
• This allows us to understand better
underlying mechanisms that drive
the disease
Patient
stratification
based signature
Integration of patients signatures and experimental models
Translational research from animal to human
In vivo
signatures
In vitro
signatures
20. PAGE 20
Kidney Diseases Genes Any Connected Disease
• Disease space is defined by
Parent term of ontology
(Kidney Disease)
• All NLP co-mentions of
child diseases to genes
are collected
• To determine specificity
all other diseases that were
co-mentioned are added
• Co-mention edges are
weighted by the number
of unique articles
Defining molecular disease spaces
Based on internal experimental data and NLP-mined external knowledge
21. PAGE 21
Genes associated
with genetic
kidney diseases
Genes are
involved in the
infectious
diseases
Neoplasms
Infectious Diseases
Kidney Diseases
Defining molecular disease spaces
Identification of kidney-specific genes in the embeddings of kidney disease space
Polycystic Kidney
Diseases
Genes which are
taking part in
cancer and not
specific to disease
space of interest
Genes which
drive kidney
diseases are the
most important
target candidates
22. PAGE 22
Sharing of the data insights
Neodash solution for internal knowledge sharing
24. PAGE 24
PAGE 24
Summary and outlook
• Graphs are powerful tools for data
harmonization in diverse life science
space – bringing ontologies together
• Alliance between public and internal
knowledge into one place with graphs –
allowed to characterize internal signatures
in the most efficient way
• Application of diverse graph algorithms
helps us understand hidden insights in our
data – identification of specific genes for the
disease of interest with the highest potential
for Target ID