Tudor I. Oprea
University of New Mexico, Albuquerque NM
10/23/2020
dkNET: Connecting Researchers to Resources
Via Zoom Funding: NIH U24 CA224370 & NIH U24 TR002278
http://druggablegenome.net/
http://datascience.unm.edu/
75% of protein research still
focused on 10% genes known
before human genome was mapped
AM Edwards et al, Nature, 2011
This prompted NIH to start the
Illuminating the Druggable Genome
Initiative
 Informatics, Data Science and
Machine Learning (“AI”) can be
used as follows:
 Diseases: EMR processing,
nosology, ontology, & EMR-based ML
 Targets: drug target selection &
validation, phenotype associations,
ML
 Drugs: Identifying novel therapeutic
modalities using in silico methods
 IDG is developing methods
applicable to each of these 3 areas
8/24/20 revision
Diseases image credit: Julie McMurry, Melissa Haendel (OHSU).
All other images credit: Nature Reviews Drug Discovery cover page
2/4/20 revisionR. Santos et al., Nature Rev.Drug Discov. 2017, 16:19-34 link
We curated 667 human genome-derived
proteins and 226 pathogen-derived
biomolecules through which 1,578 US FDA-
approved drugs act.
This set included 1004 orally formulated
drugs as well as 530 injectable drugs
(approved through June 2016).
Data captured in DrugCentral (link)
2/4/20 revision
RFA-RM-16-026
(DRGC)
GPCRs
U24 DK116195:
Bryan Roth, M.D., Ph.D. (UNC)
Brian Shoichet, Ph.D. (UCSF)
Ion
Channels
U24 DK116214:
Lily Jan, Ph.D. (UCSF)
Michael T. McManus, Ph.D. (UCSF)
Kinases
U24 DK116204:
Gary L. Johnson, Ph.D. (UNC)
RFA-RM-16-025
(RDOC)
Outreach
U24 TR002278:
Stephan C. Schürer, Ph.D. (UMiami)
Tudor Oprea, M.D., Ph.D. (UNM)
Larry A. Sklar, Ph.D. (UNM)
RFA-RM-16-024
(KMC) Data
U24 CA224260:
Avi Ma’ayan, Ph.D. (ISMMS)
U24 CA224370:
Tudor Oprea, M.D., Ph.D. (UNM)
RFA-RM-18-011
(CEIT)
Tools
U01 CA239106: N Kannan, PhD & KJ Kochut (UGA)
U01 CA239108: PN Robinson, MD PhD (JAX), CJ Mungall
(LBL), T Oprea (UNM)
U01 CA239069: G Wu, PhD (OHSU), PG D’Eustachio PhD
(NYU), Lincoln D Stein, PhD (OICR)
T. Oprea et al., Nature Rev.Drug Discov. 2018, 17:317-332 link
 Most protein classification schemes are
based on structural and functional criteria.
 For therapeutic development, it is useful to
understand how much and what types of
data are available for a given protein,
thereby highlighting well-studied and
understudied targets.
 Tclin: Proteins annotated as drug targets
 Tchem: Proteins for which potent small
molecules are known
 Tbio: Proteins for which biology is better
understood
 Tdark: These proteins lack antibodies,
publications or Gene RIFs
T. Oprea et al., Nature Rev.Drug Discov. 2018, 17:317-332 link 2/10/20 revision
2020 Update: Tdark 31.2%;Tbio 57.7%;Tchem 8%;Tclin 3.1%
4/25/19 revisionT. Oprea, Mammalian Genome, 2019, 30:192-200 https://bit.ly/2NUK0BK
Further information
Email: idg.rdoc@gmail.com
Follow: @DruggableGenome
URLs:
https://druggablegenome.net/
https://commonfund.nih.gov/idg/
IDG Knowledge User-Interface
Email: pharos@mail.nih.gov
Follow: @IDG_ Pharos
URL: https://pharos.nih.gov/
2/4/20 revisionMathias SL et al., IDG F2F Poster 2019
 Tclin proteins are associated
with drug Mechanism of Action
(MoA) – NRDD 2017
 Tchem proteins have
bioactivitis in ChEMBL and
DrugCentral, + human curation
for some targets
 Kinases: <= 30nM
 GPCRs: <= 100nM
 Nuclear Receptors: <= 100nM
 Ion Channels: <= 10μM
 Non-IDG Family Targets: <= 1μM
10/19/16 revision
Bioactivities of approved drugs (by Target class)
ChEMBL: database of bioactive chemicals
https://www.ebi.ac.uk/chembl/
DrugCentral: online drug compendium
http://drugcentral.org/
R. Santos et al., Nature Rev.Drug Discov. 2017, 16:19-34 link
 Tbio proteins lack small molecule annotation cf.Tchem criteria,
and satisfy one of these criteria:
 protein is above the cutoff criteria for Tdark
 protein is annotated with a GO Molecular Function or Biological Process
leaf term(s) with an Experimental Evidence code
 protein has confirmed OMIM phenotype(s)
 Tdark (“ignorome”) have little information available, and satisfy
these criteria:
 PubMed text-mining score from Jensen Lab < 5
 <= 3 Gene RIFs
 <= 50 Antibodies available according to antibodypedia.com
8/20/15 revisionT. Oprea et al., Nature Rev.Drug Discov. 2018, 17:317-332 link
Tdark parameters differ from the other TDLs across the 4 external
metrics cf.Kruskal-Wallis post-hoc pairwise Dunn tests
2/23/18 revisionT. Oprea et al., Nature Rev.Drug Discov. 2018, 17:317-332 link
https://rpubs.com/
cbologa/TDL7
Tdark:
9199 proteins in 2013
7658 proteins in 2016
6368 proteins in 2020
Tclin:
601 proteins in 2013
592 proteins in 2016
659 proteins in 2020
10/12/20 revisionT. Sheils, S.L. Mathias et al., Nucleic Acids Research 2021 doi:10.1093/nar/gkaa993
T. Sheils, S.L. Mathias et al., Nucleic Acids Research 2021 doi:10.1093/nar/gkaa993 10/12/20 revision
2/4/20 revisionHaendel M, et al. Nature Rev.Drug Discov. 2020 19:77-78 link
 We revised the number of RDs from ~7,000 to
10,393 using Disease Ontology, OrphaNet,
GARD, NCIT, OMIM and the Monarch
Initiative MONDO system
 We also pointed out the lack of a uniform
definition for rare diseases, and called for
coordinated efforts to precisely define them
 We surveyed therapeutic modalities
available to translate advances in the
scientific understanding of rare diseases into
therapies, and discussed overarching issues
in drug development for rare diseases.
Tambuyzer E, et al. Nature Rev.Drug Discov. 2020 19:93-111 link 2/4/20 revision
 6077 human proteins are associated
with at least one Rare Disease.
 Sources: Disease Ontology (RD-slim),
eRAM and OrphaNet
 ~50% agreement (gene level)
 Contrast:Tclin at 3% & Tchem at 7%
overall vs. RD subset: 6.94% Tclin and
14.1% for Tchem.
 20% of the RD proteome is Tclin &
Tchem. This means hope for cures.
 Potentially significant opportunities for
target & drug repurposing.
2/4/20 revisionTambuyzer E, et al. Nature Rev.Drug Discov. 2020 19:93-111 link
3/12/18 revision
~35% of the proteins remain
poorly described (Tdark)
~11% of the Proteome (Tclin & Tchem) are currently targeted by
small molecule probes
With help from rare disease patient advocacy groups, rare disease
research is likely to witness a significant increase in translation
IN GOD WE TRUST.
All others bring Data.
Quote attributed to W. Edwards Deming, controversial:
Other attributions: George A. Box and Robert W. Hayden.
Bernhard Fisher, MD has said this to a journalist
https://pharos.nih.gov/targets/KCNJ11
The IDG KMC tracks 11 information
channels for protein-disease
associations, accessible via the
Pharos portal.
Our challenge is to harmonize
disease concepts, and to enable
computational use: e.g., KCNJ11 with
ABCC8 form the Sulfonylurea 1
Kir6.2 receptor, MoA drug target for
glibenclamide (type 2 diabetes).
10/23/20 revision
The challenge for ML & AI: How to prioritize targets? i.e., which protein-
disease associations are clinically actionable?
(involved is not the same as committed)
Sorin Avram et al., Nucl Acids Res, database issue, 2021, doi: 10.1093/nar/gkaa997 10/23/20 revision
10/23/20 revisionhttp://drugcentral.org/drugcard/1679
9/09/20 revisionG. KC, G Bocci et al., Nature Machine Intell 2020, submitted link
We used data from the NCATS COVID19
portal to develop a suite of ML models for
six assays related to SARS-CoV-2 activities:
• viral entry (Spike/ACE2 via AlphaLISA;
counterscrens TruHit & ACE2 inhibition)
• viral replication (3CL or Mpro)
• live virus infectivity (CPE & cytotoxicity)
REDIAL-2020 prediction workflow
Input: SMILES
Drug Name
PubChem CID
ML: Fingerprints
Pharmacophores
Phys-chem
based on:
RDKit
scikit-learn
External set predictions
a) CPE, 24 actives;
b) CPE, 14 actives;
c) 3CL, 6 actives.
http://drugcentral.org/Redial
9/09/20 revisionG. KC, G Bocci et al., Nature Machine Intell 2020, submitted link
http://drugcentral.org/Redial
 IDG KMC2 seeks knowledge gaps
across the five branches of the
“knowledge tree”:
 Genotype; Phenotype; Interactions
& Pathways; Structure & Function;
and Expression, respectively.
 We can use biological systems
network modeling to infer novel
relationships based on available
evidence, and infer new “function”
and “role in disease” data based
on other layers of evidence
 Primary focus on Tdark & Tbio
O. Ursu,T Oprea et al., IDG2 KMC 2/01/18 revision
O. Ursu et al., manuscript in preparation
Data source Data type Data points
CCLE Gene expression 19,006,134
GTEx Gene expression 2,612,227
Protein Atlas Gene & Protein expression 949,199
Reactome Biological pathways 303,681
KEGG Biological pathways 27,683
StringDB Protein-Protein interactions 5,080,023
Gene ontology Biological pathways & Gene function 434,317
InterPro Protein structure and function 467,163
ClinVar Human Gene - Disease/Phenotype associations 881,357
GWAS Gene - Disease/Phenotype associations 54,360
OMIM Human Gene - Disease/Phenotype associations 25,557
UniProt Disease Human Gene - Disease/Phenotype associations 5,365
JensenLab DISEASE Gene - Disease associations from text mining 44,829
NCBI Homology Homology mapping of human/mouse/rat genes 70,922
IMPC Mouse Gene - Phenotype associations 2,153,999
RGD Rat Gene - Phenotype associations 117,606
LINCS Drug induced gene signatures 230,111,315
We developed automated
methods for data collection
(TCRD), visualization (Pharos)
and data aggregation.
These aggregated datasets
were used to build machine
learning models for 20+
disease and 73 mouse
phenotype.
Each knowledge graph
contains ~22,000 metapaths
and 284 million path instances.
10/07/18 revision
 a meta-path is a path consisting of
a sequence of relations defined
between different object types
(i.e., structural paths at the meta
level)
 Our metapaths encode type-
specific network topology
between the source node (e.g.,
Protein) and the destination node
(e.g., Disease).
 This approach enables the trans-
formation of assertions/evidence
chains of heterogeneous
biological data types into a ML
ready format.
G. Fu et al., BMC Bioinformatics 2016, 17:160 is an early example for drug-target interactions 10/01/18 revision
Similar assertions or evidence form metapaths (white).
Instances of metapath (paths) are used to determine the strength of the
evidence linking a gene to disease/phenotype/function.
one protein-disease
association at the time
O. Ursu,T Oprea et al., IDG2 KMC 2/01/18 revision
Genes associated with a disease/phenotype are positive examples, whereas genes lacking the same
association are negative examples. The Metapath approach transforms assertions/evidence chains into
classification problems that can be solved using suitably designed machine learning algorithms.
All datasets are merged, via R
scripts, into a PostgreSQL.
Python under development.
Graph embedding transforms
evidence paths into vectors,
converting data into matrices.
Input genes are positive
labels. OMIM (not input) are
negative labels (we prefer true
negatives where possible).
XGBoost runs 100 models.The
“median model” (AUC, F1) is
then selected for analysis and
prediction to avoid overfitting.
10/15/19 revisionJ.J.Yang, P. Kumar, D. Byrd et al., IDG2 KMC
A soccer match at RoboCup, Nagoya 2017
Image searching for “Bad AI”
Build data matrix from “Alzheimer’s disease” in
TCRD subset
 protein knowledge graph along metapaths:
 Protein – Protein Interactions
 Pathways
 GO terms
 Gene expression
 ...
 Training set: 53 genes associated with
Alzheimer’s disease (positives); 3,952 genes
associated with other pathologies from OMIM
were assumed to be negative
 Test set: 23 genes associated with Alzheimer's
(positives) and 200 genes not associated with
Alzheimer's (negatives)  from Text Mining
 “Complete forest” binary classifier using
XGBoost & 5-fold cross-validation.
 Weighted model is better than balanced model
2/14/18 revisionML work by Oleg Ursu
Bal. Predicted
Actual
Pos Neg
Pos 16 7
Neg 94 106
Wtd Predicted
Actual
Pos Neg
Pos 20 3
Neg 41 159
 The top most important features are interactions with
proteins mediating inflammatory processes (JAK2/Tclin,
IL10 & IL2 / Tchem), response to oxidative stress
(GSTP1/Tchem), nervous system development (BDNF/Tbio)
and glycolysis (GAPDH/Tchem).
 LINCS drug-induced gene expression perturbations are
the largest category of features for these predictions.
 Brain cortex expression is a necessary requirement.
 One Reactome pathway (AU-rich mRNA elements binding
proteins) is also important.
 Weighted approached showed better performance in the
test set for Alzheimer's Disease, Schizophrenia, and Dilated
Cardiomyopathy.
4/23/18 revisionML work by Oleg Ursu
 We tested the top 20 genes identified
by PKG/m-p/ML with a high-
throughput validation system by
measuring AD-relevant
hyperphosphorylated (at
S199/S202/T205) tau protein (AT8-Tau
and AT180-Tau) using a Cellomics®
high-content microscope; as well as
gene expression and
immunochemistry analysis via human
AD induced pluripotent stem cells
and human AD brain tissue
8/24/20 revisionAD validation work by Jessica Binder & Kiran Bhaskar,funded by U24CA224370-S2 supplement
2/14/19 revisionAD validation work by Jessica Binder & Kiran Bhaskar,funded by U24CA224370-S2 supplement
SHSY5Y’s in vitro
siRNA knock-downs
measuring ∆pTau
(AT8) levels –
unbiased cellomics
qPCR gene
expression
Human induced
pluripotent stem
cells derived into
neurons –AD vs Ctrl
A
K
N
A
B
C
O
2
C
C
N
Y
C
R
T
A
M
F
A
M
92B
F
O
X
P
4
F
R
R
S
1
G
R
IN
2C
IL
17R
E
L
L
IL
R
A
3
L
M
04
N
D
R
G
2
P
IB
F
1
R
A
B
40AS
C
G
B
3A
1S
L
C
44A
2
S
P
O
P
S
T
A
R
D
3
T
M
E
F
F
2T
X
N
D
C
12
0
1
2
2.5
5.0
7.5
FoldChange(2^-∆∆Ct)
RelativetoCtrl
AX0018
sAD2.1
*
****
**
**
**
****
**
*
**
*
****
****
****
****
****
****
*
A
K
N
A
B
C
O
2
C
C
N
Y
C
R
TA
M
FA
M
92B
FO
XP4
FR
R
S1
G
R
IN
2C
IL17R
EL
LILR
A
3
LM
04
N
D
R
G
2
PIB
F1
R
A
B
40A
SC
G
B
3A
1
SLC
44A
2
SPO
P
STA
R
D
3
TM
EFF2
TXN
D
C
12
0.0
0.5
1.0
1.5
2.0
2.5
8
16
24
FoldChange(2^-∆∆Ct)
RelativetoCtrl
Control
AD1
AD2
AD3
***
**
*
&
**
**
& **
&
&
**
*
*
*
**
&#
*
**
#
**
*
*
**
#
#
&
&
&
&
&
&
#
#
&
&
#
#
**
**
& #
#
**
**
# = p***
& = p****
2/14/19 revisionAD validation work by Jessica Binder & Kiran Bhaskar,funded by U24CA224370-S2 supplement
qPCR gene expression
Human brain tissue
3 different AD patients
vs 3 ctrl patients
5/22/19 revisionAD validation work by Jessica Binder & Kiran Bhaskar,funded by U24CA224370-S2 supplement
Top 20 Genes
predicted by the
XGBoost/Metapath
model, clustered by
functional roles
8/24/20 revisionAD validation work by Jessica Binder & Kiran Bhaskar,funded by U24CA224370-S2 supplement
We proposed to validation ML models for the top 20 genes:
AKNA, BC02, CCNY,CRTAM, FAM92B, FOXP4, FRRS1, GRIN2C,1L17REL,
LILRA3, LM04, NDRG2, PIBF1, RAB40A, SCGB3A1, SLC44A2, SPOP,
STARD3,TMEFF2,TXNDC12
The most obvious effects based on the combined Cellomics & qPCR
of iPSNs & autopsy brains suggests that AKNA, LILRA3, PIBF1 and
TXNDC12 significantly increased pTau (as tracked by two different
antibodies for T180, S202 and S205)
 PIBF1, LILRA3 and CRTAM show the most significant effect on tau
phosphorylation; two (CRTAM and LILRA3) novel genes are
implicated in innate immune pathways
ML work by Tudor Oprea
Genes 51
Source https://omim.org/entry/125853
AUC 0.72±0.02
1/16/19 revision
First model: 51 OMIM genes
associated with T2D vs. 3,954
OMIM genes associated with
other pathologies. AUC = 0.72 ±
0.08.
VIP-ranked variables include
HFE & HMOX1, which relate to
hemochromatosis (80% leads to
T2D), and IL1B & IL10 (suggests
an immune component).
From: Mark McCarthy <mark.mccarthy@drl.ox.ac.uk>
Sent: Friday, December 7, 2018 11:10 AM
The general summary is that we don’t see any enrichment for T2D associations
in either exome or GWAS data from the predicted gene sets (however we slice
them up).
But having that we don’t really see anything in the TRAINING set either: No
association in the exomes, and a weak (just nominal) association in the GWAS
data.
To be honest, I think, now we’ve taken a look at it, we’d all question the
training set: I had missed that this came from OMIM, which is simply not a
reliable source of information in this regard.
1/3/19 revision
ML work by Tudor Oprea
Genes 54
Source Causal T2DM transcripts
AUC 0.79±0.01
1/16/19 revision
• Second model: 54 causal transcripts
provided by Anuba Mahajan & Mark
McCarthy vs. 3,954 OMIM genes.
AUC = 0.79 ± 0.01.
 Genes confirmed by GWAS (9 in
top 24): C2CD4B, C2CD4A,
JAZF1, ADAMTS9, CRY2,
LINGO2, THADA, TMEM18 &
SEC16B. 4 genes have GO terms
for insulin secretion: CPLX1,
ADRA2A, SYT7 & SYTL4
 Top 4 VIP-ranked variables include
2 PPI nodes: SLC30A8 (rs13266634)
and GIPR (rs8108269), which have
GWAS-T2D associations.
Mouse Phenotype MP-ML models relevant for T2D
Specific Mouse Phenotype MP Number Input
Genes
Top score pre-
dicted genes
Evidence supporting
predictions (GWAS)
abnormal circulating glucose level MP_ 0000188 155 98 7 (4)
abnormal circulating insulin level MP_ 0001560 76 100 21 (5)
abnormal glucose tolerance MP_ 0005291 146 100 12 (2)
increased circulating glucose level MP_ 0005559 78 98 19 (5)
decreased circulating glucose level MP_ 0005560 63 98 7 (4)
 Human genes predicted from *glucose level MP-MLs: COX4I2, FOXQ1, DCD, APELA, FCRL3,
PALM2, OSTN, NXNL1,TLL1, PYY, MAP3K14, EDIL3, DISC1, EPM2AIP1, PSD3, GFRA2, DDR2,
ST3GAL3, MTURN, USP54, CPT1B,TYW1B, UGT1A5, UGT1A8, UGT1A3, UGT1A9, PPP1R15B,
NUFIP2,TMEM167A, ITGA9, MRPL51, GBA, FOXRED1, DDIAS, BHLHA15, NAGS, RBM20, GKN1,
C1orf43,TPGS2, MTPN, BEND3, CPEB3, ARHGAP40, CYSTM1
 Human genes predicted from insulin level MP-ML: COQ8B,VAX1, SLC47A2, CCSER1, CMYA5,
DNAH17, MTRNR2L12, IL17C, NLRP7, NLRP6, RASGRF2, ANKRD31, LAYN, UGT1A6, AMY2A,
FAM19A2, FAM209B, RBM44, RNASE10, IL17RC, RLN2
3/28/19 revision
 Mackmyra tasked Microsoft and Fourkind to create novel
whisky recipes using AI
 From input of 75 recipes,“AI” could generate 70 million
combinations.
 Nr 36 on the AI ranked combinations was approved by
humans 
https://www.geekwire.com/2019/microsoft-got-creation-worlds-first-whisky-formulated-ai/ 9/22/19 revision
How long does it take to move from “natural” language processing
to AI-driven large-dataset mining? Klingon, anyone? tlhIngan, vay'?
9/25/19 revision
Tomáš Mikolov (Google), developed an efficient algorithm to compute the
distributed representation of words, Word2Vec. It’s currently used for automatic
translation, spam filtering and speech recognition. Word2vec encodes words
using a distribution of weights across 100s of elements that compose the vectors.
Each element contributes to many words.
T. Mikolov et al.,ICLR 2013
10/10/19 revision
Alexahealth™: Given today’s health status and my calorie budget,
what food should I shop/prepare today?
Expanding on current models, IDG KMC could use AI/ML to integrate context-
specific computational reasoning tools (“AMI”) with /real time –omics,
biomarker and biomedical literature data.
These could be plugged into hospital / EMR data to improve patient services.
10/10/19 revision
8/24/20 revision
Predictivity between different models for the same disease (even
using the same ML methods) may differ due to input variations
High quality data is really hard to obtain
Weakest components:
‘Ground Truth’ (true negatives) and Domain Expertise

dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020

  • 1.
    Tudor I. Oprea Universityof New Mexico, Albuquerque NM 10/23/2020 dkNET: Connecting Researchers to Resources Via Zoom Funding: NIH U24 CA224370 & NIH U24 TR002278 http://druggablegenome.net/ http://datascience.unm.edu/
  • 2.
    75% of proteinresearch still focused on 10% genes known before human genome was mapped AM Edwards et al, Nature, 2011 This prompted NIH to start the Illuminating the Druggable Genome Initiative
  • 3.
     Informatics, DataScience and Machine Learning (“AI”) can be used as follows:  Diseases: EMR processing, nosology, ontology, & EMR-based ML  Targets: drug target selection & validation, phenotype associations, ML  Drugs: Identifying novel therapeutic modalities using in silico methods  IDG is developing methods applicable to each of these 3 areas 8/24/20 revision Diseases image credit: Julie McMurry, Melissa Haendel (OHSU). All other images credit: Nature Reviews Drug Discovery cover page
  • 4.
    2/4/20 revisionR. Santoset al., Nature Rev.Drug Discov. 2017, 16:19-34 link We curated 667 human genome-derived proteins and 226 pathogen-derived biomolecules through which 1,578 US FDA- approved drugs act. This set included 1004 orally formulated drugs as well as 530 injectable drugs (approved through June 2016). Data captured in DrugCentral (link)
  • 5.
    2/4/20 revision RFA-RM-16-026 (DRGC) GPCRs U24 DK116195: BryanRoth, M.D., Ph.D. (UNC) Brian Shoichet, Ph.D. (UCSF) Ion Channels U24 DK116214: Lily Jan, Ph.D. (UCSF) Michael T. McManus, Ph.D. (UCSF) Kinases U24 DK116204: Gary L. Johnson, Ph.D. (UNC) RFA-RM-16-025 (RDOC) Outreach U24 TR002278: Stephan C. Schürer, Ph.D. (UMiami) Tudor Oprea, M.D., Ph.D. (UNM) Larry A. Sklar, Ph.D. (UNM) RFA-RM-16-024 (KMC) Data U24 CA224260: Avi Ma’ayan, Ph.D. (ISMMS) U24 CA224370: Tudor Oprea, M.D., Ph.D. (UNM) RFA-RM-18-011 (CEIT) Tools U01 CA239106: N Kannan, PhD & KJ Kochut (UGA) U01 CA239108: PN Robinson, MD PhD (JAX), CJ Mungall (LBL), T Oprea (UNM) U01 CA239069: G Wu, PhD (OHSU), PG D’Eustachio PhD (NYU), Lincoln D Stein, PhD (OICR) T. Oprea et al., Nature Rev.Drug Discov. 2018, 17:317-332 link
  • 6.
     Most proteinclassification schemes are based on structural and functional criteria.  For therapeutic development, it is useful to understand how much and what types of data are available for a given protein, thereby highlighting well-studied and understudied targets.  Tclin: Proteins annotated as drug targets  Tchem: Proteins for which potent small molecules are known  Tbio: Proteins for which biology is better understood  Tdark: These proteins lack antibodies, publications or Gene RIFs T. Oprea et al., Nature Rev.Drug Discov. 2018, 17:317-332 link 2/10/20 revision 2020 Update: Tdark 31.2%;Tbio 57.7%;Tchem 8%;Tclin 3.1%
  • 7.
    4/25/19 revisionT. Oprea,Mammalian Genome, 2019, 30:192-200 https://bit.ly/2NUK0BK Further information Email: idg.rdoc@gmail.com Follow: @DruggableGenome URLs: https://druggablegenome.net/ https://commonfund.nih.gov/idg/ IDG Knowledge User-Interface Email: pharos@mail.nih.gov Follow: @IDG_ Pharos URL: https://pharos.nih.gov/
  • 8.
    2/4/20 revisionMathias SLet al., IDG F2F Poster 2019
  • 9.
     Tclin proteinsare associated with drug Mechanism of Action (MoA) – NRDD 2017  Tchem proteins have bioactivitis in ChEMBL and DrugCentral, + human curation for some targets  Kinases: <= 30nM  GPCRs: <= 100nM  Nuclear Receptors: <= 100nM  Ion Channels: <= 10μM  Non-IDG Family Targets: <= 1μM 10/19/16 revision Bioactivities of approved drugs (by Target class) ChEMBL: database of bioactive chemicals https://www.ebi.ac.uk/chembl/ DrugCentral: online drug compendium http://drugcentral.org/ R. Santos et al., Nature Rev.Drug Discov. 2017, 16:19-34 link
  • 10.
     Tbio proteinslack small molecule annotation cf.Tchem criteria, and satisfy one of these criteria:  protein is above the cutoff criteria for Tdark  protein is annotated with a GO Molecular Function or Biological Process leaf term(s) with an Experimental Evidence code  protein has confirmed OMIM phenotype(s)  Tdark (“ignorome”) have little information available, and satisfy these criteria:  PubMed text-mining score from Jensen Lab < 5  <= 3 Gene RIFs  <= 50 Antibodies available according to antibodypedia.com 8/20/15 revisionT. Oprea et al., Nature Rev.Drug Discov. 2018, 17:317-332 link
  • 11.
    Tdark parameters differfrom the other TDLs across the 4 external metrics cf.Kruskal-Wallis post-hoc pairwise Dunn tests 2/23/18 revisionT. Oprea et al., Nature Rev.Drug Discov. 2018, 17:317-332 link
  • 12.
    https://rpubs.com/ cbologa/TDL7 Tdark: 9199 proteins in2013 7658 proteins in 2016 6368 proteins in 2020 Tclin: 601 proteins in 2013 592 proteins in 2016 659 proteins in 2020 10/12/20 revisionT. Sheils, S.L. Mathias et al., Nucleic Acids Research 2021 doi:10.1093/nar/gkaa993
  • 13.
    T. Sheils, S.L.Mathias et al., Nucleic Acids Research 2021 doi:10.1093/nar/gkaa993 10/12/20 revision
  • 14.
    2/4/20 revisionHaendel M,et al. Nature Rev.Drug Discov. 2020 19:77-78 link  We revised the number of RDs from ~7,000 to 10,393 using Disease Ontology, OrphaNet, GARD, NCIT, OMIM and the Monarch Initiative MONDO system  We also pointed out the lack of a uniform definition for rare diseases, and called for coordinated efforts to precisely define them  We surveyed therapeutic modalities available to translate advances in the scientific understanding of rare diseases into therapies, and discussed overarching issues in drug development for rare diseases.
  • 15.
    Tambuyzer E, etal. Nature Rev.Drug Discov. 2020 19:93-111 link 2/4/20 revision
  • 16.
     6077 humanproteins are associated with at least one Rare Disease.  Sources: Disease Ontology (RD-slim), eRAM and OrphaNet  ~50% agreement (gene level)  Contrast:Tclin at 3% & Tchem at 7% overall vs. RD subset: 6.94% Tclin and 14.1% for Tchem.  20% of the RD proteome is Tclin & Tchem. This means hope for cures.  Potentially significant opportunities for target & drug repurposing. 2/4/20 revisionTambuyzer E, et al. Nature Rev.Drug Discov. 2020 19:93-111 link
  • 17.
    3/12/18 revision ~35% ofthe proteins remain poorly described (Tdark) ~11% of the Proteome (Tclin & Tchem) are currently targeted by small molecule probes With help from rare disease patient advocacy groups, rare disease research is likely to witness a significant increase in translation
  • 18.
    IN GOD WETRUST. All others bring Data. Quote attributed to W. Edwards Deming, controversial: Other attributions: George A. Box and Robert W. Hayden. Bernhard Fisher, MD has said this to a journalist
  • 19.
    https://pharos.nih.gov/targets/KCNJ11 The IDG KMCtracks 11 information channels for protein-disease associations, accessible via the Pharos portal. Our challenge is to harmonize disease concepts, and to enable computational use: e.g., KCNJ11 with ABCC8 form the Sulfonylurea 1 Kir6.2 receptor, MoA drug target for glibenclamide (type 2 diabetes). 10/23/20 revision The challenge for ML & AI: How to prioritize targets? i.e., which protein- disease associations are clinically actionable? (involved is not the same as committed)
  • 20.
    Sorin Avram etal., Nucl Acids Res, database issue, 2021, doi: 10.1093/nar/gkaa997 10/23/20 revision
  • 21.
  • 22.
    9/09/20 revisionG. KC,G Bocci et al., Nature Machine Intell 2020, submitted link We used data from the NCATS COVID19 portal to develop a suite of ML models for six assays related to SARS-CoV-2 activities: • viral entry (Spike/ACE2 via AlphaLISA; counterscrens TruHit & ACE2 inhibition) • viral replication (3CL or Mpro) • live virus infectivity (CPE & cytotoxicity) REDIAL-2020 prediction workflow Input: SMILES Drug Name PubChem CID ML: Fingerprints Pharmacophores Phys-chem based on: RDKit scikit-learn External set predictions a) CPE, 24 actives; b) CPE, 14 actives; c) 3CL, 6 actives. http://drugcentral.org/Redial
  • 23.
    9/09/20 revisionG. KC,G Bocci et al., Nature Machine Intell 2020, submitted link http://drugcentral.org/Redial
  • 24.
     IDG KMC2seeks knowledge gaps across the five branches of the “knowledge tree”:  Genotype; Phenotype; Interactions & Pathways; Structure & Function; and Expression, respectively.  We can use biological systems network modeling to infer novel relationships based on available evidence, and infer new “function” and “role in disease” data based on other layers of evidence  Primary focus on Tdark & Tbio O. Ursu,T Oprea et al., IDG2 KMC 2/01/18 revision
  • 25.
    O. Ursu etal., manuscript in preparation Data source Data type Data points CCLE Gene expression 19,006,134 GTEx Gene expression 2,612,227 Protein Atlas Gene & Protein expression 949,199 Reactome Biological pathways 303,681 KEGG Biological pathways 27,683 StringDB Protein-Protein interactions 5,080,023 Gene ontology Biological pathways & Gene function 434,317 InterPro Protein structure and function 467,163 ClinVar Human Gene - Disease/Phenotype associations 881,357 GWAS Gene - Disease/Phenotype associations 54,360 OMIM Human Gene - Disease/Phenotype associations 25,557 UniProt Disease Human Gene - Disease/Phenotype associations 5,365 JensenLab DISEASE Gene - Disease associations from text mining 44,829 NCBI Homology Homology mapping of human/mouse/rat genes 70,922 IMPC Mouse Gene - Phenotype associations 2,153,999 RGD Rat Gene - Phenotype associations 117,606 LINCS Drug induced gene signatures 230,111,315 We developed automated methods for data collection (TCRD), visualization (Pharos) and data aggregation. These aggregated datasets were used to build machine learning models for 20+ disease and 73 mouse phenotype. Each knowledge graph contains ~22,000 metapaths and 284 million path instances. 10/07/18 revision
  • 26.
     a meta-pathis a path consisting of a sequence of relations defined between different object types (i.e., structural paths at the meta level)  Our metapaths encode type- specific network topology between the source node (e.g., Protein) and the destination node (e.g., Disease).  This approach enables the trans- formation of assertions/evidence chains of heterogeneous biological data types into a ML ready format. G. Fu et al., BMC Bioinformatics 2016, 17:160 is an early example for drug-target interactions 10/01/18 revision Similar assertions or evidence form metapaths (white). Instances of metapath (paths) are used to determine the strength of the evidence linking a gene to disease/phenotype/function.
  • 27.
    one protein-disease association atthe time O. Ursu,T Oprea et al., IDG2 KMC 2/01/18 revision Genes associated with a disease/phenotype are positive examples, whereas genes lacking the same association are negative examples. The Metapath approach transforms assertions/evidence chains into classification problems that can be solved using suitably designed machine learning algorithms.
  • 28.
    All datasets aremerged, via R scripts, into a PostgreSQL. Python under development. Graph embedding transforms evidence paths into vectors, converting data into matrices. Input genes are positive labels. OMIM (not input) are negative labels (we prefer true negatives where possible). XGBoost runs 100 models.The “median model” (AUC, F1) is then selected for analysis and prediction to avoid overfitting. 10/15/19 revisionJ.J.Yang, P. Kumar, D. Byrd et al., IDG2 KMC
  • 29.
    A soccer matchat RoboCup, Nagoya 2017 Image searching for “Bad AI”
  • 30.
    Build data matrixfrom “Alzheimer’s disease” in TCRD subset  protein knowledge graph along metapaths:  Protein – Protein Interactions  Pathways  GO terms  Gene expression  ...  Training set: 53 genes associated with Alzheimer’s disease (positives); 3,952 genes associated with other pathologies from OMIM were assumed to be negative  Test set: 23 genes associated with Alzheimer's (positives) and 200 genes not associated with Alzheimer's (negatives)  from Text Mining  “Complete forest” binary classifier using XGBoost & 5-fold cross-validation.  Weighted model is better than balanced model 2/14/18 revisionML work by Oleg Ursu Bal. Predicted Actual Pos Neg Pos 16 7 Neg 94 106 Wtd Predicted Actual Pos Neg Pos 20 3 Neg 41 159
  • 31.
     The topmost important features are interactions with proteins mediating inflammatory processes (JAK2/Tclin, IL10 & IL2 / Tchem), response to oxidative stress (GSTP1/Tchem), nervous system development (BDNF/Tbio) and glycolysis (GAPDH/Tchem).  LINCS drug-induced gene expression perturbations are the largest category of features for these predictions.  Brain cortex expression is a necessary requirement.  One Reactome pathway (AU-rich mRNA elements binding proteins) is also important.  Weighted approached showed better performance in the test set for Alzheimer's Disease, Schizophrenia, and Dilated Cardiomyopathy. 4/23/18 revisionML work by Oleg Ursu
  • 32.
     We testedthe top 20 genes identified by PKG/m-p/ML with a high- throughput validation system by measuring AD-relevant hyperphosphorylated (at S199/S202/T205) tau protein (AT8-Tau and AT180-Tau) using a Cellomics® high-content microscope; as well as gene expression and immunochemistry analysis via human AD induced pluripotent stem cells and human AD brain tissue 8/24/20 revisionAD validation work by Jessica Binder & Kiran Bhaskar,funded by U24CA224370-S2 supplement
  • 33.
    2/14/19 revisionAD validationwork by Jessica Binder & Kiran Bhaskar,funded by U24CA224370-S2 supplement SHSY5Y’s in vitro siRNA knock-downs measuring ∆pTau (AT8) levels – unbiased cellomics qPCR gene expression Human induced pluripotent stem cells derived into neurons –AD vs Ctrl A K N A B C O 2 C C N Y C R T A M F A M 92B F O X P 4 F R R S 1 G R IN 2C IL 17R E L L IL R A 3 L M 04 N D R G 2 P IB F 1 R A B 40AS C G B 3A 1S L C 44A 2 S P O P S T A R D 3 T M E F F 2T X N D C 12 0 1 2 2.5 5.0 7.5 FoldChange(2^-∆∆Ct) RelativetoCtrl AX0018 sAD2.1 * **** ** ** ** **** ** * ** * **** **** **** **** **** **** *
  • 34.
  • 35.
    5/22/19 revisionAD validationwork by Jessica Binder & Kiran Bhaskar,funded by U24CA224370-S2 supplement Top 20 Genes predicted by the XGBoost/Metapath model, clustered by functional roles
  • 36.
    8/24/20 revisionAD validationwork by Jessica Binder & Kiran Bhaskar,funded by U24CA224370-S2 supplement We proposed to validation ML models for the top 20 genes: AKNA, BC02, CCNY,CRTAM, FAM92B, FOXP4, FRRS1, GRIN2C,1L17REL, LILRA3, LM04, NDRG2, PIBF1, RAB40A, SCGB3A1, SLC44A2, SPOP, STARD3,TMEFF2,TXNDC12 The most obvious effects based on the combined Cellomics & qPCR of iPSNs & autopsy brains suggests that AKNA, LILRA3, PIBF1 and TXNDC12 significantly increased pTau (as tracked by two different antibodies for T180, S202 and S205)  PIBF1, LILRA3 and CRTAM show the most significant effect on tau phosphorylation; two (CRTAM and LILRA3) novel genes are implicated in innate immune pathways
  • 37.
    ML work byTudor Oprea Genes 51 Source https://omim.org/entry/125853 AUC 0.72±0.02 1/16/19 revision First model: 51 OMIM genes associated with T2D vs. 3,954 OMIM genes associated with other pathologies. AUC = 0.72 ± 0.08. VIP-ranked variables include HFE & HMOX1, which relate to hemochromatosis (80% leads to T2D), and IL1B & IL10 (suggests an immune component).
  • 38.
    From: Mark McCarthy<mark.mccarthy@drl.ox.ac.uk> Sent: Friday, December 7, 2018 11:10 AM The general summary is that we don’t see any enrichment for T2D associations in either exome or GWAS data from the predicted gene sets (however we slice them up). But having that we don’t really see anything in the TRAINING set either: No association in the exomes, and a weak (just nominal) association in the GWAS data. To be honest, I think, now we’ve taken a look at it, we’d all question the training set: I had missed that this came from OMIM, which is simply not a reliable source of information in this regard. 1/3/19 revision
  • 39.
    ML work byTudor Oprea Genes 54 Source Causal T2DM transcripts AUC 0.79±0.01 1/16/19 revision • Second model: 54 causal transcripts provided by Anuba Mahajan & Mark McCarthy vs. 3,954 OMIM genes. AUC = 0.79 ± 0.01.  Genes confirmed by GWAS (9 in top 24): C2CD4B, C2CD4A, JAZF1, ADAMTS9, CRY2, LINGO2, THADA, TMEM18 & SEC16B. 4 genes have GO terms for insulin secretion: CPLX1, ADRA2A, SYT7 & SYTL4  Top 4 VIP-ranked variables include 2 PPI nodes: SLC30A8 (rs13266634) and GIPR (rs8108269), which have GWAS-T2D associations.
  • 40.
    Mouse Phenotype MP-MLmodels relevant for T2D Specific Mouse Phenotype MP Number Input Genes Top score pre- dicted genes Evidence supporting predictions (GWAS) abnormal circulating glucose level MP_ 0000188 155 98 7 (4) abnormal circulating insulin level MP_ 0001560 76 100 21 (5) abnormal glucose tolerance MP_ 0005291 146 100 12 (2) increased circulating glucose level MP_ 0005559 78 98 19 (5) decreased circulating glucose level MP_ 0005560 63 98 7 (4)  Human genes predicted from *glucose level MP-MLs: COX4I2, FOXQ1, DCD, APELA, FCRL3, PALM2, OSTN, NXNL1,TLL1, PYY, MAP3K14, EDIL3, DISC1, EPM2AIP1, PSD3, GFRA2, DDR2, ST3GAL3, MTURN, USP54, CPT1B,TYW1B, UGT1A5, UGT1A8, UGT1A3, UGT1A9, PPP1R15B, NUFIP2,TMEM167A, ITGA9, MRPL51, GBA, FOXRED1, DDIAS, BHLHA15, NAGS, RBM20, GKN1, C1orf43,TPGS2, MTPN, BEND3, CPEB3, ARHGAP40, CYSTM1  Human genes predicted from insulin level MP-ML: COQ8B,VAX1, SLC47A2, CCSER1, CMYA5, DNAH17, MTRNR2L12, IL17C, NLRP7, NLRP6, RASGRF2, ANKRD31, LAYN, UGT1A6, AMY2A, FAM19A2, FAM209B, RBM44, RNASE10, IL17RC, RLN2 3/28/19 revision
  • 41.
     Mackmyra taskedMicrosoft and Fourkind to create novel whisky recipes using AI  From input of 75 recipes,“AI” could generate 70 million combinations.  Nr 36 on the AI ranked combinations was approved by humans  https://www.geekwire.com/2019/microsoft-got-creation-worlds-first-whisky-formulated-ai/ 9/22/19 revision
  • 42.
    How long doesit take to move from “natural” language processing to AI-driven large-dataset mining? Klingon, anyone? tlhIngan, vay'? 9/25/19 revision Tomáš Mikolov (Google), developed an efficient algorithm to compute the distributed representation of words, Word2Vec. It’s currently used for automatic translation, spam filtering and speech recognition. Word2vec encodes words using a distribution of weights across 100s of elements that compose the vectors. Each element contributes to many words. T. Mikolov et al.,ICLR 2013 10/10/19 revision
  • 43.
    Alexahealth™: Given today’shealth status and my calorie budget, what food should I shop/prepare today? Expanding on current models, IDG KMC could use AI/ML to integrate context- specific computational reasoning tools (“AMI”) with /real time –omics, biomarker and biomedical literature data. These could be plugged into hospital / EMR data to improve patient services. 10/10/19 revision
  • 44.
    8/24/20 revision Predictivity betweendifferent models for the same disease (even using the same ML methods) may differ due to input variations High quality data is really hard to obtain Weakest components: ‘Ground Truth’ (true negatives) and Domain Expertise