dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020

Tudor I. Oprea
University of New Mexico, Albuquerque NM
10/23/2020
dkNET: Connecting Researchers to Resources
Via Zoom Funding: NIH U24 CA224370 & NIH U24 TR002278
http://druggablegenome.net/
http://datascience.unm.edu/

75% of protein research still
focused on 10% genes known
before human genome was mapped
AM Edwards et al, Nature, 2011
This prompted NIH to start the
Illuminating the Druggable Genome
Initiative

 Informatics, Data Science and
Machine Learning (“AI”) can be
used as follows:
 Diseases: EMR processing,
nosology, ontology, & EMR-based ML
 Targets: drug target selection &
validation, phenotype associations,
ML
 Drugs: Identifying novel therapeutic
modalities using in silico methods
 IDG is developing methods
applicable to each of these 3 areas
8/24/20 revision
Diseases image credit: Julie McMurry, Melissa Haendel (OHSU).
All other images credit: Nature Reviews Drug Discovery cover page

2/4/20 revisionR. Santos et al., Nature Rev.Drug Discov. 2017, 16:19-34 link
We curated 667 human genome-derived
proteins and 226 pathogen-derived
biomolecules through which 1,578 US FDA-
approved drugs act.
This set included 1004 orally formulated
drugs as well as 530 injectable drugs
(approved through June 2016).
Data captured in DrugCentral (link)

2/4/20 revision
RFA-RM-16-026
(DRGC)
GPCRs
U24 DK116195:
Bryan Roth, M.D., Ph.D. (UNC)
Brian Shoichet, Ph.D. (UCSF)
Ion
Channels
U24 DK116214:
Lily Jan, Ph.D. (UCSF)
Michael T. McManus, Ph.D. (UCSF)
Kinases
U24 DK116204:
Gary L. Johnson, Ph.D. (UNC)
RFA-RM-16-025
(RDOC)
Outreach
U24 TR002278:
Stephan C. Schürer, Ph.D. (UMiami)
Tudor Oprea, M.D., Ph.D. (UNM)
Larry A. Sklar, Ph.D. (UNM)
RFA-RM-16-024
(KMC) Data
U24 CA224260:
Avi Ma’ayan, Ph.D. (ISMMS)
U24 CA224370:
Tudor Oprea, M.D., Ph.D. (UNM)
RFA-RM-18-011
(CEIT)
Tools
U01 CA239106: N Kannan, PhD & KJ Kochut (UGA)
U01 CA239108: PN Robinson, MD PhD (JAX), CJ Mungall
(LBL), T Oprea (UNM)
U01 CA239069: G Wu, PhD (OHSU), PG D’Eustachio PhD
(NYU), Lincoln D Stein, PhD (OICR)
T. Oprea et al., Nature Rev.Drug Discov. 2018, 17:317-332 link

 Most protein classification schemes are
based on structural and functional criteria.
 For therapeutic development, it is useful to
understand how much and what types of
data are available for a given protein,
thereby highlighting well-studied and
understudied targets.
 Tclin: Proteins annotated as drug targets
 Tchem: Proteins for which potent small
molecules are known
 Tbio: Proteins for which biology is better
understood
 Tdark: These proteins lack antibodies,
publications or Gene RIFs
T. Oprea et al., Nature Rev.Drug Discov. 2018, 17:317-332 link 2/10/20 revision
2020 Update: Tdark 31.2%;Tbio 57.7%;Tchem 8%;Tclin 3.1%

4/25/19 revisionT. Oprea, Mammalian Genome, 2019, 30:192-200 https://bit.ly/2NUK0BK
Further information
Email: idg.rdoc@gmail.com
Follow: @DruggableGenome
URLs:
https://druggablegenome.net/
https://commonfund.nih.gov/idg/
IDG Knowledge User-Interface
Email: pharos@mail.nih.gov
Follow: @IDG_ Pharos
URL: https://pharos.nih.gov/

2/4/20 revisionMathias SL et al., IDG F2F Poster 2019

 Tclin proteins are associated
with drug Mechanism of Action
(MoA) – NRDD 2017
 Tchem proteins have
bioactivitis in ChEMBL and
DrugCentral, + human curation
for some targets
 Kinases: <= 30nM
 GPCRs: <= 100nM
 Nuclear Receptors: <= 100nM
 Ion Channels: <= 10μM
 Non-IDG Family Targets: <= 1μM
10/19/16 revision
Bioactivities of approved drugs (by Target class)
ChEMBL: database of bioactive chemicals
https://www.ebi.ac.uk/chembl/
DrugCentral: online drug compendium
http://drugcentral.org/
R. Santos et al., Nature Rev.Drug Discov. 2017, 16:19-34 link

 Tbio proteins lack small molecule annotation cf.Tchem criteria,
and satisfy one of these criteria:
 protein is above the cutoff criteria for Tdark
 protein is annotated with a GO Molecular Function or Biological Process
leaf term(s) with an Experimental Evidence code
 protein has confirmed OMIM phenotype(s)
 Tdark (“ignorome”) have little information available, and satisfy
these criteria:
 PubMed text-mining score from Jensen Lab < 5
 <= 3 Gene RIFs
 <= 50 Antibodies available according to antibodypedia.com
8/20/15 revisionT. Oprea et al., Nature Rev.Drug Discov. 2018, 17:317-332 link

Tdark parameters differ from the other TDLs across the 4 external
metrics cf.Kruskal-Wallis post-hoc pairwise Dunn tests
2/23/18 revisionT. Oprea et al., Nature Rev.Drug Discov. 2018, 17:317-332 link

https://rpubs.com/
cbologa/TDL7
Tdark:
9199 proteins in 2013
Tclin:
10/12/20 revisionT. Sheils, S.L. Mathias et al., Nucleic Acids Research 2021 doi:10.1093/nar/gkaa993

T. Sheils, S.L. Mathias et al., Nucleic Acids Research 2021 doi:10.1093/nar/gkaa993 10/12/20 revision

2/4/20 revisionHaendel M, et al. Nature Rev.Drug Discov. 2020 19:77-78 link
 We revised the number of RDs from ~7,000 to
10,393 using Disease Ontology, OrphaNet,
GARD, NCIT, OMIM and the Monarch
Initiative MONDO system
 We also pointed out the lack of a uniform
definition for rare diseases, and called for
coordinated efforts to precisely define them
 We surveyed therapeutic modalities
available to translate advances in the
scientific understanding of rare diseases into
therapies, and discussed overarching issues
in drug development for rare diseases.

Tambuyzer E, et al. Nature Rev.Drug Discov. 2020 19:93-111 link 2/4/20 revision

 6077 human proteins are associated
with at least one Rare Disease.
 Sources: Disease Ontology (RD-slim),
eRAM and OrphaNet
 ~50% agreement (gene level)
 Contrast:Tclin at 3% & Tchem at 7%
overall vs. RD subset: 6.94% Tclin and
14.1% for Tchem.
 20% of the RD proteome is Tclin &
Tchem. This means hope for cures.
 Potentially significant opportunities for
target & drug repurposing.
2/4/20 revisionTambuyzer E, et al. Nature Rev.Drug Discov. 2020 19:93-111 link

3/12/18 revision
~35% of the proteins remain
poorly described (Tdark)
~11% of the Proteome (Tclin & Tchem) are currently targeted by
small molecule probes
With help from rare disease patient advocacy groups, rare disease
research is likely to witness a significant increase in translation

IN GOD WE TRUST.
All others bring Data.
Quote attributed to W. Edwards Deming, controversial:
Other attributions: George A. Box and Robert W. Hayden.
Bernhard Fisher, MD has said this to a journalist

https://pharos.nih.gov/targets/KCNJ11
The IDG KMC tracks 11 information
channels for protein-disease
associations, accessible via the
Pharos portal.
Our challenge is to harmonize
disease concepts, and to enable
computational use: e.g., KCNJ11 with
ABCC8 form the Sulfonylurea 1
Kir6.2 receptor, MoA drug target for
glibenclamide (type 2 diabetes).
10/23/20 revision
The challenge for ML & AI: How to prioritize targets? i.e., which protein-
disease associations are clinically actionable?
(involved is not the same as committed)

Sorin Avram et al., Nucl Acids Res, database issue, 2021, doi: 10.1093/nar/gkaa997 10/23/20 revision

10/23/20 revisionhttp://drugcentral.org/drugcard/1679

9/09/20 revisionG. KC, G Bocci et al., Nature Machine Intell 2020, submitted link
We used data from the NCATS COVID19
portal to develop a suite of ML models for
six assays related to SARS-CoV-2 activities:
• viral entry (Spike/ACE2 via AlphaLISA;
counterscrens TruHit & ACE2 inhibition)
• viral replication (3CL or Mpro)
• live virus infectivity (CPE & cytotoxicity)
REDIAL-2020 prediction workflow
Input: SMILES
Drug Name
PubChem CID
ML: Fingerprints
Pharmacophores
Phys-chem
based on:
RDKit
scikit-learn
External set predictions
a) CPE, 24 actives;
b) CPE, 14 actives;
c) 3CL, 6 actives.
http://drugcentral.org/Redial

9/09/20 revisionG. KC, G Bocci et al., Nature Machine Intell 2020, submitted link
http://drugcentral.org/Redial

 IDG KMC2 seeks knowledge gaps
across the five branches of the
“knowledge tree”:
 Genotype; Phenotype; Interactions
& Pathways; Structure & Function;
and Expression, respectively.
 We can use biological systems
network modeling to infer novel
relationships based on available
evidence, and infer new “function”
and “role in disease” data based
on other layers of evidence
 Primary focus on Tdark & Tbio
O. Ursu,T Oprea et al., IDG2 KMC 2/01/18 revision

O. Ursu et al., manuscript in preparation
Data source Data type Data points
CCLE Gene expression 19,006,134
GTEx Gene expression 2,612,227
Protein Atlas Gene & Protein expression 949,199
Reactome Biological pathways 303,681
KEGG Biological pathways 27,683
StringDB Protein-Protein interactions 5,080,023
Gene ontology Biological pathways & Gene function 434,317
InterPro Protein structure and function 467,163
ClinVar Human Gene - Disease/Phenotype associations 881,357
GWAS Gene - Disease/Phenotype associations 54,360
OMIM Human Gene - Disease/Phenotype associations 25,557
UniProt Disease Human Gene - Disease/Phenotype associations 5,365
JensenLab DISEASE Gene - Disease associations from text mining 44,829
NCBI Homology Homology mapping of human/mouse/rat genes 70,922
IMPC Mouse Gene - Phenotype associations 2,153,999
RGD Rat Gene - Phenotype associations 117,606
LINCS Drug induced gene signatures 230,111,315
We developed automated
methods for data collection
(TCRD), visualization (Pharos)
and data aggregation.
These aggregated datasets
were used to build machine
learning models for 20+
disease and 73 mouse
phenotype.
Each knowledge graph
contains ~22,000 metapaths
and 284 million path instances.
10/07/18 revision

 a meta-path is a path consisting of
a sequence of relations defined
between different object types
(i.e., structural paths at the meta
level)
 Our metapaths encode type-
specific network topology
between the source node (e.g.,
Protein) and the destination node
(e.g., Disease).
 This approach enables the trans-
formation of assertions/evidence
chains of heterogeneous
biological data types into a ML
ready format.
G. Fu et al., BMC Bioinformatics 2016, 17:160 is an early example for drug-target interactions 10/01/18 revision
Similar assertions or evidence form metapaths (white).
Instances of metapath (paths) are used to determine the strength of the
evidence linking a gene to disease/phenotype/function.

one protein-disease
association at the time
O. Ursu,T Oprea et al., IDG2 KMC 2/01/18 revision
Genes associated with a disease/phenotype are positive examples, whereas genes lacking the same
association are negative examples. The Metapath approach transforms assertions/evidence chains into
classification problems that can be solved using suitably designed machine learning algorithms.

All datasets are merged, via R
scripts, into a PostgreSQL.
Python under development.
Graph embedding transforms
evidence paths into vectors,
converting data into matrices.
Input genes are positive
labels. OMIM (not input) are
negative labels (we prefer true
negatives where possible).
XGBoost runs 100 models.The
“median model” (AUC, F1) is
then selected for analysis and
prediction to avoid overfitting.
10/15/19 revisionJ.J.Yang, P. Kumar, D. Byrd et al., IDG2 KMC

A soccer match at RoboCup, Nagoya 2017
Image searching for “Bad AI”

Build data matrix from “Alzheimer’s disease” in
TCRD subset
 protein knowledge graph along metapaths:
 Protein – Protein Interactions
 Pathways
 GO terms
 Gene expression
 ...
 Training set: 53 genes associated with
Alzheimer’s disease (positives); 3,952 genes
associated with other pathologies from OMIM
were assumed to be negative
 Test set: 23 genes associated with Alzheimer's
(positives) and 200 genes not associated with
Alzheimer's (negatives)  from Text Mining
 “Complete forest” binary classifier using
XGBoost & 5-fold cross-validation.
 Weighted model is better than balanced model
2/14/18 revisionML work by Oleg Ursu
Bal. Predicted
Actual
Pos Neg
Pos 16 7
Neg 94 106
Wtd Predicted
Actual
Pos Neg
Pos 20 3
Neg 41 159

 The top most important features are interactions with
proteins mediating inflammatory processes (JAK2/Tclin,
IL10 & IL2 / Tchem), response to oxidative stress
(GSTP1/Tchem), nervous system development (BDNF/Tbio)
and glycolysis (GAPDH/Tchem).
 LINCS drug-induced gene expression perturbations are
the largest category of features for these predictions.
 Brain cortex expression is a necessary requirement.
 One Reactome pathway (AU-rich mRNA elements binding
proteins) is also important.
 Weighted approached showed better performance in the
test set for Alzheimer's Disease, Schizophrenia, and Dilated
Cardiomyopathy.
4/23/18 revisionML work by Oleg Ursu

 We tested the top 20 genes identified
by PKG/m-p/ML with a high-
throughput validation system by
measuring AD-relevant
hyperphosphorylated (at
S199/S202/T205) tau protein (AT8-Tau
and AT180-Tau) using a Cellomics®
high-content microscope; as well as
gene expression and
immunochemistry analysis via human
AD induced pluripotent stem cells
and human AD brain tissue
8/24/20 revisionAD validation work by Jessica Binder & Kiran Bhaskar,funded by U24CA224370-S2 supplement

SHSY5Y’s in vitro
siRNA knock-downs
measuring ∆pTau
(AT8) levels –
unbiased cellomics
qPCR gene
expression
Human induced
pluripotent stem
cells derived into
neurons –AD vs Ctrl
A
K
N
A
B
C
O
2
C
C
N
Y
C
R
T
A
M
F
A
M
92B
F
O
X
P
4
F
R
R
S
1
G
R
IN
2C
IL
17R
E
L
L
IL
R
A
3
L
M
04
N
D
R
G
2
P
IB
F
1
R
A
B
40AS
C
G
B
3A
1S
L
C
44A
2
S
P
O
P
S
T
A
R
D
3
T
M
E
F
F
2T
X
N
D
C
12
0
1
2
2.5
5.0
7.5
FoldChange(2^-∆∆Ct)
RelativetoCtrl
AX0018
sAD2.1
*
****
**
**
**
****
**
*
**
*
****
****
****
****
****
****
*

A
K
N
A
B
C
O
2
C
C
N
Y
C
R
TA
M
FA
M
92B
FO
XP4
FR
R
S1
G
R
IN
2C
IL17R
EL
LILR
A
3
LM
04
N
D
R
G
2
PIB
F1
R
A
B
40A
SC
G
B
3A
1
SLC
44A
2
SPO
P
STA
R
D
3
TM
EFF2
TXN
D
C
12
0.0
0.5
1.0
1.5
2.0
2.5
8
16
24
FoldChange(2^-∆∆Ct)
RelativetoCtrl
Control
AD1
AD2
AD3
***
**
*
&
**
**
& **
&
&
**
*
*
*
**
&#
*
**
#
**
*
*
**
#
#
&
&
&
&
&
&
#
#
&
&
#
#
**
**
& #
#
**
**
# = p***
& = p****
qPCR gene expression
Human brain tissue
3 different AD patients
vs 3 ctrl patients

Top 20 Genes
predicted by the
XGBoost/Metapath
model, clustered by
functional roles

We proposed to validation ML models for the top 20 genes:
AKNA, BC02, CCNY,CRTAM, FAM92B, FOXP4, FRRS1, GRIN2C,1L17REL,
LILRA3, LM04, NDRG2, PIBF1, RAB40A, SCGB3A1, SLC44A2, SPOP,
STARD3,TMEFF2,TXNDC12
The most obvious effects based on the combined Cellomics & qPCR
of iPSNs & autopsy brains suggests that AKNA, LILRA3, PIBF1 and
TXNDC12 significantly increased pTau (as tracked by two different
antibodies for T180, S202 and S205)
 PIBF1, LILRA3 and CRTAM show the most significant effect on tau
phosphorylation; two (CRTAM and LILRA3) novel genes are
implicated in innate immune pathways

ML work by Tudor Oprea
Genes 51
Source https://omim.org/entry/125853
AUC 0.72±0.02
1/16/19 revision
First model: 51 OMIM genes
associated with T2D vs. 3,954
OMIM genes associated with
other pathologies. AUC = 0.72 ±
0.08.
VIP-ranked variables include
HFE & HMOX1, which relate to
hemochromatosis (80% leads to
T2D), and IL1B & IL10 (suggests
an immune component).

From: Mark McCarthy <mark.mccarthy@drl.ox.ac.uk>
Sent: Friday, December 7, 2018 11:10 AM
The general summary is that we don’t see any enrichment for T2D associations
in either exome or GWAS data from the predicted gene sets (however we slice
them up).
But having that we don’t really see anything in the TRAINING set either: No
association in the exomes, and a weak (just nominal) association in the GWAS
data.
To be honest, I think, now we’ve taken a look at it, we’d all question the
training set: I had missed that this came from OMIM, which is simply not a
reliable source of information in this regard.
1/3/19 revision

ML work by Tudor Oprea
Genes 54
Source Causal T2DM transcripts
AUC 0.79±0.01
1/16/19 revision
• Second model: 54 causal transcripts
provided by Anuba Mahajan & Mark
McCarthy vs. 3,954 OMIM genes.
AUC = 0.79 ± 0.01.
 Genes confirmed by GWAS (9 in
top 24): C2CD4B, C2CD4A,
JAZF1, ADAMTS9, CRY2,
LINGO2, THADA, TMEM18 &
SEC16B. 4 genes have GO terms
for insulin secretion: CPLX1,
ADRA2A, SYT7 & SYTL4
 Top 4 VIP-ranked variables include
2 PPI nodes: SLC30A8 (rs13266634)
and GIPR (rs8108269), which have
GWAS-T2D associations.

Mouse Phenotype MP-ML models relevant for T2D
Specific Mouse Phenotype MP Number Input
Genes
Top score pre-
dicted genes
Evidence supporting
predictions (GWAS)
abnormal circulating glucose level MP_ 0000188 155 98 7 (4)
abnormal circulating insulin level MP_ 0001560 76 100 21 (5)
abnormal glucose tolerance MP_ 0005291 146 100 12 (2)
increased circulating glucose level MP_ 0005559 78 98 19 (5)
decreased circulating glucose level MP_ 0005560 63 98 7 (4)
 Human genes predicted from *glucose level MP-MLs: COX4I2, FOXQ1, DCD, APELA, FCRL3,
PALM2, OSTN, NXNL1,TLL1, PYY, MAP3K14, EDIL3, DISC1, EPM2AIP1, PSD3, GFRA2, DDR2,
ST3GAL3, MTURN, USP54, CPT1B,TYW1B, UGT1A5, UGT1A8, UGT1A3, UGT1A9, PPP1R15B,
NUFIP2,TMEM167A, ITGA9, MRPL51, GBA, FOXRED1, DDIAS, BHLHA15, NAGS, RBM20, GKN1,
C1orf43,TPGS2, MTPN, BEND3, CPEB3, ARHGAP40, CYSTM1
 Human genes predicted from insulin level MP-ML: COQ8B,VAX1, SLC47A2, CCSER1, CMYA5,
DNAH17, MTRNR2L12, IL17C, NLRP7, NLRP6, RASGRF2, ANKRD31, LAYN, UGT1A6, AMY2A,
FAM19A2, FAM209B, RBM44, RNASE10, IL17RC, RLN2
3/28/19 revision

 Mackmyra tasked Microsoft and Fourkind to create novel
whisky recipes using AI
 From input of 75 recipes,“AI” could generate 70 million
combinations.
 Nr 36 on the AI ranked combinations was approved by
humans 
https://www.geekwire.com/2019/microsoft-got-creation-worlds-first-whisky-formulated-ai/ 9/22/19 revision

How long does it take to move from “natural” language processing
to AI-driven large-dataset mining? Klingon, anyone? tlhIngan, vay'?
9/25/19 revision
Tomáš Mikolov (Google), developed an efficient algorithm to compute the
distributed representation of words, Word2Vec. It’s currently used for automatic
translation, spam filtering and speech recognition. Word2vec encodes words
using a distribution of weights across 100s of elements that compose the vectors.
Each element contributes to many words.
T. Mikolov et al.,ICLR 2013
10/10/19 revision

Alexahealth™: Given today’s health status and my calorie budget,
what food should I shop/prepare today?
Expanding on current models, IDG KMC could use AI/ML to integrate context-
specific computational reasoning tools (“AMI”) with /real time –omics,
biomarker and biomedical literature data.
These could be plugged into hospital / EMR data to improve patient services.
10/10/19 revision

8/24/20 revision
Predictivity between different models for the same disease (even
using the same ML methods) may differ due to input variations
High quality data is really hard to obtain
Weakest components:
‘Ground Truth’ (true negatives) and Domain Expertise

dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020

Recommended

Recommended

More Related Content

Similar to dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020

Similar to dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020 (20)

More from dkNET

More from dkNET (20)

Recently uploaded

Recently uploaded (20)

dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020