The Challenge
Variety of users / diversity of scientific questions
Scientists
Medical
Doctors
Data
Scientists
Graphdatabase
Biological question:
Are human T2D genes enzymes acting on metabolites which in turn are regulated in pig diabetes model?
The actual question (from a data-point-of-view):
Is there a connection between A and R?
=> 3s to look into the Excel sheet
Why graph? Easy scientific question
The actual question (from a data-point-of-view):
Is there a connection between A and R?
=> 3s to look into the graph
A
B
C
E
D
F
G
K
Q
R
S
W
Z
U
Why graph? Easy scientific question
Back to the question
Are human T2D genes enzymes acting on metabolites which in turn are regulated in pig diabetes model?
Genomics
Human diabetic data
Genes
SNPs
Proteins
Enzymes
Pathways
Metabolites
Metabolomics
Pre diabetic pig
Metabolites
List of SNPs
List of Genes of
(species 1)
List of Proteins of
(species 1)
List of loci
List of Enzymes of
(species 1)
List of Pathways of
(species 1)
List of Metabolites
of (species 1)
List of Metabolites
of (species 2)
graph
Why graph? -> why not relational
• biomedical data / healthcare data is highly connected
• => variety of data
=> unstructured
=> heterogeneous
=> not connected
=> unFAIR
• easy to model
• extremely flexible / easy adoptable („re-shaping the graph“) vs. static SQL model
• scalable (Billion of nodes+relationships on a single machine
• easy to query (cyclic dependencies)
• GraphDataScience library + graph embeddings
DZDconnect: stats
• PROD-Server: 323m nodes, 1.1bn relationships => 480GB
• DEV-Server: 1.1bn nodes, 4.8bn relationships
• Singleserver (60 CPUs, 256GB memory, only SSDs)
• 4 developers
• Neo4j enterprise (live backup, GDS)
• UI: flask web server, SemSpect, Neo4j browser
• Visualization for interactive browsing (SemSpect by derive GmbH)
• Bloom (semi-natural-language queries)
Strata Data
Award finalist 2019
bytes4diabetes Award
2020
Graphie Award 2018
We have
DB role model
DZDconnect:
data integration + ML
Gene RNA Protein
CODES CODES
CODES*
• Python
• Py2Neo, GraphIO
• Docker Pipeline for orchestration (open-source by DZD)
• Based on integrated data => annotate / enrich
• textmatching + Natural Language Processing
• „shortcuts“ for queries (reduce #hops)
• inferring knowledge
The Challenge
User with a specific input => specific output
Scientist
multi-omics
experiment
output
Flask app
The Challenge
User ”start somewhere -> explore freely knowledge”
SemSpect
interactive
browsing
Start from any node
Scientist
or
Medical
Doctor
The Challenge
User with data analysis skills / computer scientist
Scientist
Start from any node
Cypher query language
Graph Data
Science
Use case 1
Handle mapping identifiers of molecular entities
Knowledge Graph
Query „friends of a friend“ on a gene level
Example: diabetes relevant gene ‚TCF7L2’
match path=(g:Gene{sid:'TCF7L2'})-[:MAPS|SYNONYM*0..2]-(g1:Gene) return path
Use case 2
Find information that is NOW connected
Knowledge Graph
Query for SNPs (mutations) associated to diabetes
Output: relevant protein and its function (ontology terms)
match (tr:Trait)
where tr.name contains ‚diabetes mellitus‘
with tr as disease
match path=(disease)<-[:ASSOCIATED_WITH_TRAIT]-(asso:Association)<-[:SNP_HAS_ASSOCIATION]-(snp:SNP)-
[:SNP_HAS_GENE]-(gene:Gene)-[:MAPS]-(g1:Gene)-[x:CODES]->(transcript:Transcript)-[:CODES]->
(prot:Protein)-[:ASSOCIATION]->(term:Term)—(o:Ontology)
return path
Use case 3
Using graph algorithms to infer new insights
Natural Language
Processing
Ontologies
Knowledge Graph
Google’s page rank algorithm - find the most relevant gene
finding ACE2 - the receptor the SARS-Cov2 virus uses to enter the cell
• 140’000 abstracts from
Covid19 related publications
• NamedEntityRecognition
of gene names
• Page Rank identified
‚ACE2‘ as the most relevant
gene
k-nearest neighbour clustering with k=5
representing the 5 diabetes subtypes
patient 01 patient 02
patient 03
Graph
algorithms
patient 04
patient 05
patient 02
p
a
t
i
e
n
t
0
4
patient 03
patient 05
patient 01
subphenotyping of diabetic patients
DZDconnect
connect patient data with knowledge graph
Transcript
Gene
Synonyms
Abstract
PubMed
Article
Keyword
MeSH-term
Ontology term
Hello role-model :-)
Take home message
• Knowledge graph
• as single point of truth
• connect in-house data
• scalability
• infer new insights
• Use cases:
• simple and advanced (Cypher) queries
• Graph Data Science library (page rank, kNN)
• Node embeddings for complex data
• NLP
• Visualization of graph
• different users
• flask app, browser, SemSpect,…