 (Long-term) goals
1. Develop a workflow for building prediction models for the
bioactivity of small molecules against a given target, using the
public-domain data available in PubChem.
• Structure-based (molecular docking)
• Ligand-based (2-D/3-D similarity)
• Machine-learning based
2. Automate the workflow to develop models for all targets for
which a sufficient amount of bioactivity data are publicly
available.
 Computational prediction models for the RXRA activity of small
molecules were developed using six machine learning methods
and public bioactivity data in PubChem.
 The RXRA activity data from the Tox21 project were used to build
and test the models, and the ChEMBL and NCGC data sets were
used as external test sets to further evaluate the general
applicability of the models.
 The best model, developed using the Random Forest and
PubChem fingerprint, gave an AUC score of 0.77 and a BACC of
0.70.
 When the models were tested against the two external data sets,
the performance of the models against the ChEMBL set were
found to be very different from those against the Tox21 test set
and the NCGC set.
 This indicates that the compound coverage of the Tox21 project is
somewhat different from what scientists in the medicinal
chemistry and natural product areas have been studying.
 This study showcases how public data contained in PubChem can
be used to develop prediction models for small-molecule
bioactivity, using open-source software and PubChem’s public
services.
Development of machine learning-based prediction models for chemical modulators of
the retinoid X receptor (RXR) signaling pathway using public-domain bioactivity data
Sunghwan Kim, Ph.D., M.Sc. (sunghwan.kim@nih.gov)
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health
 Retinoid X receptor α (RXRA)
 Nuclear receptor activated by 9-cis-retinoic acid.
 Forms a heterodimer with other nuclear receptors, including
RARs, PPARs, T3R/TR-B, VDR, LXRs, PXR, FXR, CAR/NR1I3, NR4A2,
etc.
 Forms homodimers and homotetramers.
 Considered as an important regulator of a wide range of
biological pathways regulated by its dimerization partners.
 Has potential roles in metabolic signaling pathways, skin alopecia
(spot baldness), dermal cysts, cardiac development, insulin
sensitization, etc.
 Molecular descriptors
 Nine molecular fingerprints were generated using PaDEL.
[Yap CW, J. Comput. Chem., 2011, 32(7):1466-1474.]
Abbr. Descriptor Length
AP Atom Pairs 2D Fingerprint 780
ESTAT Estate fingerprint 79
EXTFP CDK Extended Fingerprint 1,024
FP CDK fingerprint 1,024
GOFP CDK graph only fingerprint 1,024
KR Klekota-Roth fingerprint 4,860
MACCS MACCS fingerprint 166
PUB PubChem fingerprint 881
SUB Substructure fingerprint 307
Abbr. Algorithm Hyperparameters optimized
NB Naïve Bayes α (10-10 ~ 1)
DT Decision tree
max_depth_range (3 ~ 7);
min_samples_split_range (3 ~ 7);
min_samples_leaf_range (2 ~ 6)
KNN
k-Nearest
neighbors
weights (uniform, minkowski, jaccard);
n_neighbors (1 ~ 25)
RF Random forest n_estimators (10 ~ 200)
SVM
Support vector
machine
C (2-10 ~ 210); γ ( 2-10 ∼ 210)
NN Neural network solver (lbfgs, adam); α (10-7 ∼ 107)
 Metrics for model performance evaluation
 Area under the receiver operating characteristic curve (AUC)
→ used for hyperparameter optimization.
 Balanced accuracy (BACC)
=
1
2
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
+
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
=
1
2
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 + 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
 Sensitivity (SENS) =
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
 Specificity (SPEC) =
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
AUC
BACC
 General applicability of the models
Test Ext1 Ext2
Data sources Tox21 ChEMBL NCGC
Measurement
methods
qHTS
literature-
extracted
qHTS
Chemical domains
environmental
chemicals
medicinal
chemistry,
natural product
environmental
chemicals
Compounds within
applicability domain
of the best model a
76.2%
(=417/547)
22.5%
(=50/222)
86.2%
(=422/489)
Methods
Introduction
Executive Summary
BACCAUC
Sensitivity Specificity
Results
 Preprocessing
1. Salts and mixtures were replaced with their parent compounds.
2. Duplicate compounds and those with conflicting bioactivities
were removed.
3. Molecular properties were downloaded from PubChem
4. 10% of the remaining compounds were randomly selected and
set aside as a test set
5. The training set was balanced by downsampling the inactives.
• All 471 actives in the training set were kept.
• 471 inactives were selected after grouping the inactive
compounds in the training set (using the k-means clustering
with the eight molecular properties as descriptors).
6. The external sets were preprocessed in the same manner as the
training/test sets.
7. The external set compounds that also occurred in the Tox21 set
were removed.
Acknowledgements
 This research was supported by the Intramural Research Program
of the National Library of Medicine, National Institutes of Health,
U.S. Department of Health and Human Services.
Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(942 compounds)
Test
(547 compounds)
Ext1
(222 compounds)
Ext2
(489 compounds)
• 471 actives
• 471 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing Preprocessing Preprocessing
• Quantitative HTS
(qHTS) data for
10K compounds
• Predominantly
inactive
• Data extracted
from journal
articles
• Predominantly
active
• qHTS data
• Predominantly
inactive
• Some overlap with
the Tox21 data
 Data sets
 Machine-learning algorithms used for model building
 The 10-fold cross-validation was used for hyperparameter
optimization.
The present study used RXRA activity data in PubChem
(https://pubchem.ncbi.nlm.nih.gov) to develop machine learning-
based prediction models for RXRA activity of small molecules.
 Applicability domain of the models
 The applicability domain of the developed models was evaluated
using a distance-based approach described in Shen et al.
[J. Med. Chem., 2002, 45(13):2811-2823.]
a Generated using the Random-Forest and PubChem fingerprint.

Development of machine learning-based prediction models for chemical modulators of the retinoid X receptor (RXR) signaling pathway using public-domain bioactivity data

  • 1.
     (Long-term) goals 1.Develop a workflow for building prediction models for the bioactivity of small molecules against a given target, using the public-domain data available in PubChem. • Structure-based (molecular docking) • Ligand-based (2-D/3-D similarity) • Machine-learning based 2. Automate the workflow to develop models for all targets for which a sufficient amount of bioactivity data are publicly available.  Computational prediction models for the RXRA activity of small molecules were developed using six machine learning methods and public bioactivity data in PubChem.  The RXRA activity data from the Tox21 project were used to build and test the models, and the ChEMBL and NCGC data sets were used as external test sets to further evaluate the general applicability of the models.  The best model, developed using the Random Forest and PubChem fingerprint, gave an AUC score of 0.77 and a BACC of 0.70.  When the models were tested against the two external data sets, the performance of the models against the ChEMBL set were found to be very different from those against the Tox21 test set and the NCGC set.  This indicates that the compound coverage of the Tox21 project is somewhat different from what scientists in the medicinal chemistry and natural product areas have been studying.  This study showcases how public data contained in PubChem can be used to develop prediction models for small-molecule bioactivity, using open-source software and PubChem’s public services. Development of machine learning-based prediction models for chemical modulators of the retinoid X receptor (RXR) signaling pathway using public-domain bioactivity data Sunghwan Kim, Ph.D., M.Sc. (sunghwan.kim@nih.gov) National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health  Retinoid X receptor α (RXRA)  Nuclear receptor activated by 9-cis-retinoic acid.  Forms a heterodimer with other nuclear receptors, including RARs, PPARs, T3R/TR-B, VDR, LXRs, PXR, FXR, CAR/NR1I3, NR4A2, etc.  Forms homodimers and homotetramers.  Considered as an important regulator of a wide range of biological pathways regulated by its dimerization partners.  Has potential roles in metabolic signaling pathways, skin alopecia (spot baldness), dermal cysts, cardiac development, insulin sensitization, etc.  Molecular descriptors  Nine molecular fingerprints were generated using PaDEL. [Yap CW, J. Comput. Chem., 2011, 32(7):1466-1474.] Abbr. Descriptor Length AP Atom Pairs 2D Fingerprint 780 ESTAT Estate fingerprint 79 EXTFP CDK Extended Fingerprint 1,024 FP CDK fingerprint 1,024 GOFP CDK graph only fingerprint 1,024 KR Klekota-Roth fingerprint 4,860 MACCS MACCS fingerprint 166 PUB PubChem fingerprint 881 SUB Substructure fingerprint 307 Abbr. Algorithm Hyperparameters optimized NB Naïve Bayes α (10-10 ~ 1) DT Decision tree max_depth_range (3 ~ 7); min_samples_split_range (3 ~ 7); min_samples_leaf_range (2 ~ 6) KNN k-Nearest neighbors weights (uniform, minkowski, jaccard); n_neighbors (1 ~ 25) RF Random forest n_estimators (10 ~ 200) SVM Support vector machine C (2-10 ~ 210); γ ( 2-10 ∼ 210) NN Neural network solver (lbfgs, adam); α (10-7 ∼ 107)  Metrics for model performance evaluation  Area under the receiver operating characteristic curve (AUC) → used for hyperparameter optimization.  Balanced accuracy (BACC) = 1 2 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 + 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 = 1 2 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 + 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆  Sensitivity (SENS) = 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹  Specificity (SPEC) = 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹 AUC BACC  General applicability of the models Test Ext1 Ext2 Data sources Tox21 ChEMBL NCGC Measurement methods qHTS literature- extracted qHTS Chemical domains environmental chemicals medicinal chemistry, natural product environmental chemicals Compounds within applicability domain of the best model a 76.2% (=417/547) 22.5% (=50/222) 86.2% (=422/489) Methods Introduction Executive Summary BACCAUC Sensitivity Specificity Results  Preprocessing 1. Salts and mixtures were replaced with their parent compounds. 2. Duplicate compounds and those with conflicting bioactivities were removed. 3. Molecular properties were downloaded from PubChem 4. 10% of the remaining compounds were randomly selected and set aside as a test set 5. The training set was balanced by downsampling the inactives. • All 471 actives in the training set were kept. • 471 inactives were selected after grouping the inactive compounds in the training set (using the k-means clustering with the eight molecular properties as descriptors). 6. The external sets were preprocessed in the same manner as the training/test sets. 7. The external set compounds that also occurred in the Tox21 set were removed. Acknowledgements  This research was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health, U.S. Department of Health and Human Services. Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (942 compounds) Test (547 compounds) Ext1 (222 compounds) Ext2 (489 compounds) • 471 actives • 471 inactives • 53 actives • 494 inactives • 205 actives • 17 inactives • 20 actives • 469 inactives Preprocessing Preprocessing Preprocessing • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive • Data extracted from journal articles • Predominantly active • qHTS data • Predominantly inactive • Some overlap with the Tox21 data  Data sets  Machine-learning algorithms used for model building  The 10-fold cross-validation was used for hyperparameter optimization. The present study used RXRA activity data in PubChem (https://pubchem.ncbi.nlm.nih.gov) to develop machine learning- based prediction models for RXRA activity of small molecules.  Applicability domain of the models  The applicability domain of the developed models was evaluated using a distance-based approach described in Shen et al. [J. Med. Chem., 2002, 45(13):2811-2823.] a Generated using the Random-Forest and PubChem fingerprint.