Drug Repurposing using Deep Learning
on Knowledge Graphs
Or how to leverage AI to recycle (old) new
drugs
About Us
Alex Thomas is a principal data scientist at Wisecube. He's
used natural language processing and machine learning
with clinical data, identity data, employer and jobseeker
data, and now biochemical data. Alex is also the author of
Natural Language Processing with Spark NLP.
Vishnu is the CTO and Founder of Wisecube AI and has over two
decades of experience building data science teams and
platforms. Vishnu has extensive experience with various graph
databases including Neo4J, TitanDB (now JanusGraph) and
more recently OrientDB and AWS Neptune.
Drug Discovery is Broken
- Every year, around US$200 billion is
spent globally on biomedical
research
- 75% of potential drug target
research could not be reproduced
- New drugs approved / Billion$ spent
on R&D has halved every 9 years
since 1950
- This is trend is now called Eroom’s
Law (opposite of Moore’s law)
Drug Repurposing: looking for (old) new cures
Given the high attrition rates, substantial costs and
slow pace of new drug discovery and development,
repurposing of 'old' drugs is a viable alternative.
Repurposing drugs to treat both common and rare
diseases is increasingly becoming an attractive
proposition because it involves the use of de-risked
compounds
Various data-driven and experimental approaches
have been suggested for the identification of
repurposable drug candidates.
AI (NLP + Knowledge Graphs + Deep Graph Learning) to the rescue
Wisecube works with Research
and Pharmaceutical
organizations to help leverage
the power of AI to accelerate
drug discovery and repurposing
We are currently working with
St.John’s Institute to repurpose
drug candidates
Wisecube Drug Repurposing Pipeline Overview
Pipeline Deep Dive
● Datasets
○ Ingesting Data
○ Graph Building
○ Link Prediction
Datasets
❏ Drug Repurposing Knowledge
Graph (DRKG)
❏ “Drug Repurposing Knowledge Graph (DRKG) is a comprehensive
biological knowledge graph relating genes, compounds, diseases,
biological processes, side effects and symptoms.”
❏ https://github.com/gnn4dr/DRKG
❏ ChEMBL
❏ “ChEMBL is a manually curated database of bioactive molecules with
drug-like properties.”
❏ https://www.ebi.ac.uk/chembl/
❏ PubChem
❏ “PubChem is an open chemistry database at the National Institutes of
Health (NIH).”
❏ https://pubchemdocs.ncbi.nlm.nih.gov/about
Datasets: DRKG
❏ DrugBank
❏ “DrugBank is a pharmaceutical knowledge base that is enabling major advances across the data-driven medicine
industry.”
❏ Link: https://go.drugbank.com/
❏ GNBR
❏ “A global network of biomedical relationships derived from text”
❏ https://zenodo.org/record/1134693#.WqQe1GbVSL9
❏ Hetionet
❏ “Hetionet is an integrative network of biomedical knowledge assembled from 29 different databases of genes,
compounds, diseases, and more.”
❏ https://het.io/
❏ StringDB
❏ “STRING is a database of known and predicted protein-protein interactions.”
❏ https://string-db.org/cgi/about
❏ IntAct
❏ “IntAct provides a freely available, open source database system and analysis tools for molecular interaction data.
“
❏ https://www.ebi.ac.uk/intact/
❏ DGIdb
❏ “[I]nformation on drug-gene interactions and the druggable genome, mined from over thirty trusted sources.”
❏ https://www.dgidb.org/
HETIONET
Pipeline Deep Dive
✓ Datasets
● Ingesting Data
○ Graph Building
○ Link Prediction
Ingesting Data
❏ Unifying the data
❏ Loading the data
❏ Post-processing the data
Ingesting Data: Unification
❏ DrugBankID -> NCBI CID -> ChEMBLID
❏ PUG REST API
❏ https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest
❏ PUG VIEW REST API
❏ https://pubchemdocs.ncbi.nlm.nih.gov/pug-view
NCBI CID <- DrugBankID
NCBI CID -> ChEMBLID
Ingesting Data: Loading
❏ Ingest into Graph DB
❏ Neptune
❏ CosmosDB
❏ Any Graph DB which supports Gremlin
❏ Graph DB vs Triple Store
❏ Most open data is in RDF triples formats (RDF/XML, Turtle,
N-Triples)
❏ Modern Graph Dbs are faster than Triple Stores
@prefix sio: <http://semanticscience.org/resource/> .
@prefix compound: <http://rdf.ncbi.nlm.nih.gov/pubchem/compound/> .
@prefix descriptor: <http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/> .
compound:CID400516 sio:has-attribute
descriptor:CID400516_Isomeric_SMILES ,
descriptor:CID400516_Isotope_Atom_Count ,
descriptor:CID400516_Molecular_Formula ,
descriptor:CID400516_Molecular_Weight ,
descriptor:CID400516_Mono_Isotopic_Weight ,
descriptor:CID400516_Non-hydrogen_Atom_Count ,
~id ~label articles:String[] source_ids:String[] name:String SMILES:String
8647 COMPOUND 13961;... CHEMBL1200689 Nitric oxide [N]=O
344 COMPOUND 268975;... CHEMBL142438 Nitrogen N#N
18030 COMPOUND 10081;... CHEMBL925 TYROSINE N[C@@H](Cc1ccc(O)cc1)C(=O)O
1534 COMPOUND 211538;... CHEMBL1616046
HYPOCHLOROUS
ACID
OCl
18800 COMPOUND 13464;... CHEMBL978 Methacholine CC(=O)OC(C)C[N+](C)(C)C
26747 COMPOUND 226005;.... CHEMBL863 Cysteine N[C@@H](CS)C(=O)O
Ingesting Data: Post-processing
1. Save predictions
2. Experts review
3. Ingest new edges
Pipeline Deep Dive
✓ Datasets
✓ Ingesting Data
● Graph Building
○ Link Prediction
Graph Building
❏ Explicit Relationships
❏ Literature-based Relationships
❏ Link Prediction Relationships
Graph Building: Explicit Relationships
❏ Explicit Relationships
❏ Triples data
❏ Inherently represents relationships
❏ Tabular data (flattened graph)
❏ 2 (or more) entities or IDs in each row
❏ Need to determine which fields are associated with which entity or edge
❏ RDBMS data
❏ Foreign keys
❏ Join tables
Graph Building: from Literature
❏ Heuristic vs Model
❏ Relationship extraction data sets are rare, compared to NER models
❏ Creating labels requires experts
❏ Heuristics with labels
❏ Stated relationships may span across multiple sentences
❏ Certain styles of language are excessively verbose
❏ Especially academic language
Graph Building: from Literature
1. Given two terms, u and v
2. Calculate TF.IDF for extracted entities
3. Sum TF.IDF for u and v over all documents
• TF.IDF(u), TF.IDF(v)
4. Identify documents where u and v share a
context
• Sentence, window, paragraph, whole document
5. Sum TF.IDF for u and v over all documents
where u and v share a context
• TF.IDF(u,v)
6. The weight for the potential u~v edges is
the ratio of these two sums
7. Accept edges over chosen threshold
• Top 10%
Graph Building: from Literature
1. Given two terms, u and v
2. Calculate TF.IDF for extracted entities
3. Sum TF.IDF for u and v over all documents
• TF.IDF(u), TF.IDF(v)
4. Identify documents where u and v share a
context
• Sentence, window, paragraph, whole document
5. Sum TF.IDF for u and v over all documents
where u and v share a context
• TF.IDF(u,v)
6. The weight for the potential u~v edges is
the ratio of these two sums
7. Accept edges over chosen threshold
• Top 10%
Pipeline Deep Dive
✓ Datasets
✓ Ingesting Data
✓ Graph Building
● Link Prediction
Link Prediction
❏ Untyped models
❏ Jaccard
❏ Deepwalk
❏ Typed Models
❏ TransE-L2
❏ DLG
❏ “Deep Graph Library (DGL) is a Python package
built for easy implementation of graph neural
network model family, on top of existing DL
frameworks (currently supporting PyTorch, MXNet
and TensorFlow).”
❏ https://docs.dgl.ai/
❏ Intuition
❏ Unconnected nodes which are connected to many of the same nodes may be connected
❏ Pro’s
❏ No training necessary
❏ Con’s
❏ Intuition is unrealistic
❏ Jaccard similarity
❏ For node u and v
❏ N(u): set of nodes connected to u
❏ N(v): set of nodes connected to v
❏ Jaccard similarity is |N(u) intersect N(v)| / |N(u) union N(v)|
Link Prediction: Jaccard
❏ Intuition
❏ A node can be characterized by the paths it occurs in
❏ Creates embeddings (vector representations)
❏ Pro’s
❏ Easy to train as it relies on models used in NLP
❏ Con’s
❏ Does not take into account the edge type
❏ DeepWalk
❏ For each node u, generate K random paths of length L with u in the
middle of the path
❏ Using these paths, build a model to predict u given the nodes before
and after it
❏ Model
❏ Build a model to predict if two nodes (represented by their
embeddings) are connected
DeepWalk
❏ Intuition
❏ Learn embeddings that directly predict embeddings
❏ Pro’s
❏ Directly predicts embeddings
❏ After embeddings are built, no additional model is needed
❏ Learns representation for relationships
❏ Con’s
❏ More sophisticated model (more parameters) takes longer to train
❏ TransE L2
❏ u, v are node representations (vectors)
❏ r is an edge type representation
❏ Train model that assumes ||u+r-v||2=0 if u and v are connected by and edge of type r
TransE L2
Research Case Study: Early Results
We worked with St.John’s
Institute (Part of Providence
Healthcare) to repurpose
drugs to inhibit a kinase
target related to Alzheimer's
disease and have submitted
the first round of drug
candidates for expert review
In Summary
• Drug Discovery Scientists are drowning
in disjoined datasets and bringing new
drugs to market is expensive and slow
• Drug Repurposing is one way to bring
new cures using old drugs
• NLP, Knowledge Graphs and Deep
Graph Learning are Key to leveraging
the combined knowledge of
experimental and literature based
evidence for accelerating drug
repurposing and research
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Drug Repurposing using Deep Learning on Knowledge Graphs

  • 1.
    Drug Repurposing usingDeep Learning on Knowledge Graphs Or how to leverage AI to recycle (old) new drugs
  • 2.
    About Us Alex Thomasis a principal data scientist at Wisecube. He's used natural language processing and machine learning with clinical data, identity data, employer and jobseeker data, and now biochemical data. Alex is also the author of Natural Language Processing with Spark NLP. Vishnu is the CTO and Founder of Wisecube AI and has over two decades of experience building data science teams and platforms. Vishnu has extensive experience with various graph databases including Neo4J, TitanDB (now JanusGraph) and more recently OrientDB and AWS Neptune.
  • 3.
    Drug Discovery isBroken - Every year, around US$200 billion is spent globally on biomedical research - 75% of potential drug target research could not be reproduced - New drugs approved / Billion$ spent on R&D has halved every 9 years since 1950 - This is trend is now called Eroom’s Law (opposite of Moore’s law)
  • 4.
    Drug Repurposing: lookingfor (old) new cures Given the high attrition rates, substantial costs and slow pace of new drug discovery and development, repurposing of 'old' drugs is a viable alternative. Repurposing drugs to treat both common and rare diseases is increasingly becoming an attractive proposition because it involves the use of de-risked compounds Various data-driven and experimental approaches have been suggested for the identification of repurposable drug candidates.
  • 5.
    AI (NLP +Knowledge Graphs + Deep Graph Learning) to the rescue Wisecube works with Research and Pharmaceutical organizations to help leverage the power of AI to accelerate drug discovery and repurposing We are currently working with St.John’s Institute to repurpose drug candidates
  • 6.
    Wisecube Drug RepurposingPipeline Overview
  • 7.
    Pipeline Deep Dive ●Datasets ○ Ingesting Data ○ Graph Building ○ Link Prediction
  • 8.
    Datasets ❏ Drug RepurposingKnowledge Graph (DRKG) ❏ “Drug Repurposing Knowledge Graph (DRKG) is a comprehensive biological knowledge graph relating genes, compounds, diseases, biological processes, side effects and symptoms.” ❏ https://github.com/gnn4dr/DRKG ❏ ChEMBL ❏ “ChEMBL is a manually curated database of bioactive molecules with drug-like properties.” ❏ https://www.ebi.ac.uk/chembl/ ❏ PubChem ❏ “PubChem is an open chemistry database at the National Institutes of Health (NIH).” ❏ https://pubchemdocs.ncbi.nlm.nih.gov/about
  • 9.
    Datasets: DRKG ❏ DrugBank ❏“DrugBank is a pharmaceutical knowledge base that is enabling major advances across the data-driven medicine industry.” ❏ Link: https://go.drugbank.com/ ❏ GNBR ❏ “A global network of biomedical relationships derived from text” ❏ https://zenodo.org/record/1134693#.WqQe1GbVSL9 ❏ Hetionet ❏ “Hetionet is an integrative network of biomedical knowledge assembled from 29 different databases of genes, compounds, diseases, and more.” ❏ https://het.io/ ❏ StringDB ❏ “STRING is a database of known and predicted protein-protein interactions.” ❏ https://string-db.org/cgi/about ❏ IntAct ❏ “IntAct provides a freely available, open source database system and analysis tools for molecular interaction data. “ ❏ https://www.ebi.ac.uk/intact/ ❏ DGIdb ❏ “[I]nformation on drug-gene interactions and the druggable genome, mined from over thirty trusted sources.” ❏ https://www.dgidb.org/ HETIONET
  • 10.
    Pipeline Deep Dive ✓Datasets ● Ingesting Data ○ Graph Building ○ Link Prediction
  • 11.
    Ingesting Data ❏ Unifyingthe data ❏ Loading the data ❏ Post-processing the data
  • 12.
    Ingesting Data: Unification ❏DrugBankID -> NCBI CID -> ChEMBLID ❏ PUG REST API ❏ https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest ❏ PUG VIEW REST API ❏ https://pubchemdocs.ncbi.nlm.nih.gov/pug-view NCBI CID <- DrugBankID NCBI CID -> ChEMBLID
  • 13.
    Ingesting Data: Loading ❏Ingest into Graph DB ❏ Neptune ❏ CosmosDB ❏ Any Graph DB which supports Gremlin ❏ Graph DB vs Triple Store ❏ Most open data is in RDF triples formats (RDF/XML, Turtle, N-Triples) ❏ Modern Graph Dbs are faster than Triple Stores @prefix sio: <http://semanticscience.org/resource/> . @prefix compound: <http://rdf.ncbi.nlm.nih.gov/pubchem/compound/> . @prefix descriptor: <http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/> . compound:CID400516 sio:has-attribute descriptor:CID400516_Isomeric_SMILES , descriptor:CID400516_Isotope_Atom_Count , descriptor:CID400516_Molecular_Formula , descriptor:CID400516_Molecular_Weight , descriptor:CID400516_Mono_Isotopic_Weight , descriptor:CID400516_Non-hydrogen_Atom_Count , ~id ~label articles:String[] source_ids:String[] name:String SMILES:String 8647 COMPOUND 13961;... CHEMBL1200689 Nitric oxide [N]=O 344 COMPOUND 268975;... CHEMBL142438 Nitrogen N#N 18030 COMPOUND 10081;... CHEMBL925 TYROSINE N[C@@H](Cc1ccc(O)cc1)C(=O)O 1534 COMPOUND 211538;... CHEMBL1616046 HYPOCHLOROUS ACID OCl 18800 COMPOUND 13464;... CHEMBL978 Methacholine CC(=O)OC(C)C[N+](C)(C)C 26747 COMPOUND 226005;.... CHEMBL863 Cysteine N[C@@H](CS)C(=O)O
  • 14.
    Ingesting Data: Post-processing 1.Save predictions 2. Experts review 3. Ingest new edges
  • 15.
    Pipeline Deep Dive ✓Datasets ✓ Ingesting Data ● Graph Building ○ Link Prediction
  • 16.
    Graph Building ❏ ExplicitRelationships ❏ Literature-based Relationships ❏ Link Prediction Relationships
  • 17.
    Graph Building: ExplicitRelationships ❏ Explicit Relationships ❏ Triples data ❏ Inherently represents relationships ❏ Tabular data (flattened graph) ❏ 2 (or more) entities or IDs in each row ❏ Need to determine which fields are associated with which entity or edge ❏ RDBMS data ❏ Foreign keys ❏ Join tables
  • 18.
    Graph Building: fromLiterature ❏ Heuristic vs Model ❏ Relationship extraction data sets are rare, compared to NER models ❏ Creating labels requires experts ❏ Heuristics with labels ❏ Stated relationships may span across multiple sentences ❏ Certain styles of language are excessively verbose ❏ Especially academic language
  • 19.
    Graph Building: fromLiterature 1. Given two terms, u and v 2. Calculate TF.IDF for extracted entities 3. Sum TF.IDF for u and v over all documents • TF.IDF(u), TF.IDF(v) 4. Identify documents where u and v share a context • Sentence, window, paragraph, whole document 5. Sum TF.IDF for u and v over all documents where u and v share a context • TF.IDF(u,v) 6. The weight for the potential u~v edges is the ratio of these two sums 7. Accept edges over chosen threshold • Top 10%
  • 20.
    Graph Building: fromLiterature 1. Given two terms, u and v 2. Calculate TF.IDF for extracted entities 3. Sum TF.IDF for u and v over all documents • TF.IDF(u), TF.IDF(v) 4. Identify documents where u and v share a context • Sentence, window, paragraph, whole document 5. Sum TF.IDF for u and v over all documents where u and v share a context • TF.IDF(u,v) 6. The weight for the potential u~v edges is the ratio of these two sums 7. Accept edges over chosen threshold • Top 10%
  • 21.
    Pipeline Deep Dive ✓Datasets ✓ Ingesting Data ✓ Graph Building ● Link Prediction
  • 22.
    Link Prediction ❏ Untypedmodels ❏ Jaccard ❏ Deepwalk ❏ Typed Models ❏ TransE-L2 ❏ DLG ❏ “Deep Graph Library (DGL) is a Python package built for easy implementation of graph neural network model family, on top of existing DL frameworks (currently supporting PyTorch, MXNet and TensorFlow).” ❏ https://docs.dgl.ai/
  • 23.
    ❏ Intuition ❏ Unconnectednodes which are connected to many of the same nodes may be connected ❏ Pro’s ❏ No training necessary ❏ Con’s ❏ Intuition is unrealistic ❏ Jaccard similarity ❏ For node u and v ❏ N(u): set of nodes connected to u ❏ N(v): set of nodes connected to v ❏ Jaccard similarity is |N(u) intersect N(v)| / |N(u) union N(v)| Link Prediction: Jaccard
  • 24.
    ❏ Intuition ❏ Anode can be characterized by the paths it occurs in ❏ Creates embeddings (vector representations) ❏ Pro’s ❏ Easy to train as it relies on models used in NLP ❏ Con’s ❏ Does not take into account the edge type ❏ DeepWalk ❏ For each node u, generate K random paths of length L with u in the middle of the path ❏ Using these paths, build a model to predict u given the nodes before and after it ❏ Model ❏ Build a model to predict if two nodes (represented by their embeddings) are connected DeepWalk
  • 25.
    ❏ Intuition ❏ Learnembeddings that directly predict embeddings ❏ Pro’s ❏ Directly predicts embeddings ❏ After embeddings are built, no additional model is needed ❏ Learns representation for relationships ❏ Con’s ❏ More sophisticated model (more parameters) takes longer to train ❏ TransE L2 ❏ u, v are node representations (vectors) ❏ r is an edge type representation ❏ Train model that assumes ||u+r-v||2=0 if u and v are connected by and edge of type r TransE L2
  • 26.
    Research Case Study:Early Results We worked with St.John’s Institute (Part of Providence Healthcare) to repurpose drugs to inhibit a kinase target related to Alzheimer's disease and have submitted the first round of drug candidates for expert review
  • 27.
    In Summary • DrugDiscovery Scientists are drowning in disjoined datasets and bringing new drugs to market is expensive and slow • Drug Repurposing is one way to bring new cures using old drugs • NLP, Knowledge Graphs and Deep Graph Learning are Key to leveraging the combined knowledge of experimental and literature based evidence for accelerating drug repurposing and research
  • 28.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.