Drug Repurposing using Deep Learning on Knowledge Graphs

Drug Repurposing using Deep Learning
on Knowledge Graphs
Or how to leverage AI to recycle (old) new
drugs

About Us
Alex Thomas is a principal data scientist at Wisecube. He's
used natural language processing and machine learning
with clinical data, identity data, employer and jobseeker
data, and now biochemical data. Alex is also the author of
Natural Language Processing with Spark NLP.
Vishnu is the CTO and Founder of Wisecube AI and has over two
decades of experience building data science teams and
platforms. Vishnu has extensive experience with various graph
databases including Neo4J, TitanDB (now JanusGraph) and
more recently OrientDB and AWS Neptune.

Drug Discovery is Broken
- Every year, around US$200 billion is
spent globally on biomedical
research
- 75% of potential drug target
research could not be reproduced
- New drugs approved / Billion$ spent
on R&D has halved every 9 years
since 1950
- This is trend is now called Eroom’s
Law (opposite of Moore’s law)

Drug Repurposing: looking for (old) new cures
Given the high attrition rates, substantial costs and
slow pace of new drug discovery and development,
repurposing of 'old' drugs is a viable alternative.
Repurposing drugs to treat both common and rare
diseases is increasingly becoming an attractive
proposition because it involves the use of de-risked
compounds
Various data-driven and experimental approaches
have been suggested for the identification of
repurposable drug candidates.

AI (NLP + Knowledge Graphs + Deep Graph Learning) to the rescue
Wisecube works with Research
and Pharmaceutical
organizations to help leverage
the power of AI to accelerate
drug discovery and repurposing
We are currently working with
St.John’s Institute to repurpose
drug candidates

Wisecube Drug Repurposing Pipeline Overview

Pipeline Deep Dive
● Datasets
○ Ingesting Data
○ Graph Building
○ Link Prediction

Datasets
❏ Drug Repurposing Knowledge
Graph (DRKG)
❏ “Drug Repurposing Knowledge Graph (DRKG) is a comprehensive
biological knowledge graph relating genes, compounds, diseases,
biological processes, side effects and symptoms.”
❏ https://github.com/gnn4dr/DRKG
❏ ChEMBL
❏ “ChEMBL is a manually curated database of bioactive molecules with
drug-like properties.”
❏ https://www.ebi.ac.uk/chembl/
❏ PubChem
❏ “PubChem is an open chemistry database at the National Institutes of
Health (NIH).”
❏ https://pubchemdocs.ncbi.nlm.nih.gov/about

Datasets: DRKG
❏ DrugBank
❏ “DrugBank is a pharmaceutical knowledge base that is enabling major advances across the data-driven medicine
industry.”
❏ Link: https://go.drugbank.com/
❏ GNBR
❏ “A global network of biomedical relationships derived from text”
❏ https://zenodo.org/record/1134693#.WqQe1GbVSL9
❏ Hetionet
❏ “Hetionet is an integrative network of biomedical knowledge assembled from 29 different databases of genes,
compounds, diseases, and more.”
❏ https://het.io/
❏ StringDB
❏ “STRING is a database of known and predicted protein-protein interactions.”
❏ https://string-db.org/cgi/about
❏ IntAct
❏ “IntAct provides a freely available, open source database system and analysis tools for molecular interaction data.
“
❏ https://www.ebi.ac.uk/intact/
❏ DGIdb
❏ “[I]nformation on drug-gene interactions and the druggable genome, mined from over thirty trusted sources.”
❏ https://www.dgidb.org/
HETIONET

Pipeline Deep Dive
✓ Datasets
● Ingesting Data
○ Graph Building
○ Link Prediction

Ingesting Data
❏ Unifying the data
❏ Loading the data
❏ Post-processing the data

Ingesting Data: Unification
❏ DrugBankID -> NCBI CID -> ChEMBLID
❏ PUG REST API
❏ https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest
❏ PUG VIEW REST API
❏ https://pubchemdocs.ncbi.nlm.nih.gov/pug-view
NCBI CID <- DrugBankID
NCBI CID -> ChEMBLID

Ingesting Data: Loading
❏ Ingest into Graph DB
❏ Neptune
❏ CosmosDB
❏ Any Graph DB which supports Gremlin
❏ Graph DB vs Triple Store
❏ Most open data is in RDF triples formats (RDF/XML, Turtle,
N-Triples)
❏ Modern Graph Dbs are faster than Triple Stores
@prefix sio: <http://semanticscience.org/resource/> .
@prefix compound: <http://rdf.ncbi.nlm.nih.gov/pubchem/compound/> .
@prefix descriptor: <http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/> .
compound:CID400516 sio:has-attribute
descriptor:CID400516_Isomeric_SMILES ,
descriptor:CID400516_Isotope_Atom_Count ,
descriptor:CID400516_Molecular_Formula ,
descriptor:CID400516_Molecular_Weight ,
descriptor:CID400516_Mono_Isotopic_Weight ,
descriptor:CID400516_Non-hydrogen_Atom_Count ,
~id ~label articles:String[] source_ids:String[] name:String SMILES:String
8647 COMPOUND 13961;... CHEMBL1200689 Nitric oxide [N]=O
344 COMPOUND 268975;... CHEMBL142438 Nitrogen N#N
18030 COMPOUND 10081;... CHEMBL925 TYROSINE N[C@@H](Cc1ccc(O)cc1)C(=O)O
1534 COMPOUND 211538;... CHEMBL1616046
HYPOCHLOROUS
ACID
OCl
18800 COMPOUND 13464;... CHEMBL978 Methacholine CC(=O)OC(C)C[N+](C)(C)C
26747 COMPOUND 226005;.... CHEMBL863 Cysteine N[C@@H](CS)C(=O)O

Ingesting Data: Post-processing
1. Save predictions
2. Experts review
3. Ingest new edges

Pipeline Deep Dive
✓ Datasets
✓ Ingesting Data
● Graph Building
○ Link Prediction

Graph Building
❏ Explicit Relationships
❏ Literature-based Relationships
❏ Link Prediction Relationships

Graph Building: Explicit Relationships
❏ Explicit Relationships
❏ Triples data
❏ Inherently represents relationships
❏ Tabular data (flattened graph)
❏ 2 (or more) entities or IDs in each row
❏ Need to determine which fields are associated with which entity or edge
❏ RDBMS data
❏ Foreign keys
❏ Join tables

Graph Building: from Literature
❏ Heuristic vs Model
❏ Relationship extraction data sets are rare, compared to NER models
❏ Creating labels requires experts
❏ Heuristics with labels
❏ Stated relationships may span across multiple sentences
❏ Certain styles of language are excessively verbose
❏ Especially academic language

Graph Building: from Literature
1. Given two terms, u and v
2. Calculate TF.IDF for extracted entities
3. Sum TF.IDF for u and v over all documents
• TF.IDF(u), TF.IDF(v)
4. Identify documents where u and v share a
context
• Sentence, window, paragraph, whole document
5. Sum TF.IDF for u and v over all documents
where u and v share a context
• TF.IDF(u,v)
6. The weight for the potential u~v edges is
the ratio of these two sums
7. Accept edges over chosen threshold
• Top 10%

Pipeline Deep Dive
✓ Datasets
✓ Ingesting Data
✓ Graph Building
● Link Prediction

Link Prediction
❏ Untyped models
❏ Jaccard
❏ Deepwalk
❏ Typed Models
❏ TransE-L2
❏ DLG
❏ “Deep Graph Library (DGL) is a Python package
built for easy implementation of graph neural
network model family, on top of existing DL
frameworks (currently supporting PyTorch, MXNet
and TensorFlow).”
❏ https://docs.dgl.ai/

❏ Intuition
❏ Unconnected nodes which are connected to many of the same nodes may be connected
❏ Pro’s
❏ No training necessary
❏ Con’s
❏ Intuition is unrealistic
❏ Jaccard similarity
❏ For node u and v
❏ N(u): set of nodes connected to u
❏ N(v): set of nodes connected to v
❏ Jaccard similarity is |N(u) intersect N(v)| / |N(u) union N(v)|
Link Prediction: Jaccard

❏ Intuition
❏ A node can be characterized by the paths it occurs in
❏ Creates embeddings (vector representations)
❏ Pro’s
❏ Easy to train as it relies on models used in NLP
❏ Con’s
❏ Does not take into account the edge type
❏ DeepWalk
❏ For each node u, generate K random paths of length L with u in the
middle of the path
❏ Using these paths, build a model to predict u given the nodes before
and after it
❏ Model
❏ Build a model to predict if two nodes (represented by their
embeddings) are connected
DeepWalk

❏ Intuition
❏ Learn embeddings that directly predict embeddings
❏ Pro’s
❏ Directly predicts embeddings
❏ After embeddings are built, no additional model is needed
❏ Learns representation for relationships
❏ Con’s
❏ More sophisticated model (more parameters) takes longer to train
❏ TransE L2
❏ u, v are node representations (vectors)
❏ r is an edge type representation
❏ Train model that assumes ||u+r-v||2=0 if u and v are connected by and edge of type r
TransE L2

Research Case Study: Early Results
We worked with St.John’s
Institute (Part of Providence
Healthcare) to repurpose
drugs to inhibit a kinase
target related to Alzheimer's
disease and have submitted
the first round of drug
candidates for expert review

In Summary
• Drug Discovery Scientists are drowning
in disjoined datasets and bringing new
drugs to market is expensive and slow
• Drug Repurposing is one way to bring
new cures using old drugs
• NLP, Knowledge Graphs and Deep
Graph Learning are Key to leveraging
the combined knowledge of
experimental and literature based
evidence for accelerating drug
repurposing and research

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Drug Repurposing using Deep Learning on Knowledge Graphs

More Related Content

What's hot

Similar to Drug Repurposing using Deep Learning on Knowledge Graphs

More from Databricks

Recently uploaded

Drug Repurposing using Deep Learning on Knowledge Graphs