Using Open Bioactivity Data for Developing
Machine-Learning Prediction Models for Chemical
Modulators of the Retinoid X Receptor (RXR)
Signaling Pathway
Sunghwan Kim, Ph.D., M.Sc.
(sunghwan.kim@nih.gov)
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
ACS Fall 2018 National Meeting in Boston, MA
Wednesday, August 22, 2018
2
1. Introduction
• PubChem and its bioactivity data
• Retinoid X Receptor  (RXRA)
2. Methods
3. Results
4. Summary
Contents
3
Introduction
4
 NIH’s chemical information resource.
 Collects public-domain chemical data from >620 data sources.
 Disseminates it back to the public free of charge.
What is PubChem?
The Public
Data
Collection
Data
Dissemination
(free of charge)
Gov.
Agencies
University
Labs
Publishers
Pharma
Companies
Chemical
Vendors
Chemical
Biology
Resources
5
Data Organization in PubChem
Unique chemical
structure extraction
through Standardization
Depositor-provided
substance descriptions
Unique chemical structures
Data Contributors
Substance
deposition
Depositor-provided
Bioactivity test results
Activity of tested
“substances”
Activity of “compounds” derived
from associated “substances”
Assay
deposition
6
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Jan Apr Jul Oct Jan Apr Jul Oct Jan Apr Jul Oct Jan Apr Jul
NumberofUsers
(millions)
Time
Monthly Unique Users
(interactive users only)
2015 2016 2017
For the past 12 months,
1.5 ~ 3.0 million unique users per month
2018
Source: Google Analytics
7
The largest collection of
public-domain chemical data.
 >96.5 million unique chemical structures (as of August 2018)
 Covers various chemical entities:
small molecules, siRNAs, miRNAs, carbohydrates, lipids, peptides, chemically modified
macromolecules, ……
 Contains various types of information
• 3-D structure • Names/synonyms • Bioactivity
• Pharmacology • Safety and handling • Gene/protein targets
• Patents • Toxicology • Regulation
• Literature • Environmental health • Classifications
⁞ ⁞ ⁞
PubChem as a Source of Big Chemistry Data
8
Bioactivity Data in PubChem
Tested
3.0 millions
(3.08%)
Active
(AC  1 nM)
50 thousands
(0.05%)
Active
(1 nM < AC  1 M)
589 thousands
(0.61%)
Active
(others)
499 thousands
(0.52%)
Inactive
1.8 millions
(1.90%)
Not Tested
93.5 millions
(96.92%)
All Compounds
96.5 millions
(100.00%)
9
High-throughput screening
data
Literature-extracted
data
Bioactivity Data in PubChem
10
High-throughput screening
data
• From Molecular Libraries
Program and other HTS
projects.
• Many inactives
• False hits
(e.g., aggregators,
autofluorescent compounds)
• Often measured at single
concentration
Literature-extracted
data
Bioactivity Data in PubChem
11
High-throughput screening
data
• From Molecular Libraries
Program and other HTS
projects.
• Many inactives
• False hits
(e.g., aggregators,
autofluorescent compounds)
• Often measured at single
concentration
Literature-extracted
data
• From manual curation or data
mining
• No (or few) inactives
• Provided by various PubChem
depositors including:
ChEMBL,
BindingDB,
PDBbind,
Guide to Pharmacology
Bioactivity Data in PubChem
12
(Long-term) Goals
1. Develop a workflow for building a prediction model for
bioactivity of small molecules against a given target, using the
public-domain data available in PubChem.
• Structure-based (molecular docking)
• Ligand-based (2-D/3-D similarity)
• Machine-learning based
2. Automate the workflow to develop models for all targets for
which enough bioactivity data are publicly available.
This study focuses on the first goal.
13
Retinoid X Receptor  (RXRA)
 Nuclear receptor, activated by 9-cis-retinoic acid.
 Forms a heterodimer with other nuclear receptors
• retinoid acid receptors (RARs)
• peroxisome proliferator-activated receptors (PPARs)
• thyroid hormone receptors (T3R, and TR-B)
• vitamin D receptor (VDR)
• liver X receptors (LXRs)
• pregnane X receptor (PXR)
• farnesoid X receptor (FXR)
• constitutive androstane receptor (CAR or NR1I3)
• nuclear receptor related 1 protein (NR4A2)
• ……
 Forms homodimers and homotetramers.
14
Retinoid X Receptor  (RXRA)
 Involved in regulation of gene expression in various biological
processes.
 Potentials roles in:
• metabolic signaling pathways
• skin alopecia (spot baldness)
• dermal cysts
• cardiac development
• insulin sensitization
• ……
15
Retinoid X Receptor  (RXRA)
 Bioactivity data for RXRA in PubChem
(https://pubchem.ncbi.nlm.nih.gov/target/gene/RXRA)
• 202 assays
• 14,415 tested compounds
• 1,022 active compounds
 These data are very heterogeneous!
• different techniques : (q)HTS vs. literature-extracted
• different definition of active/inactive compounds
• ……
 The present study used these data (with care) to develop a
machine learning-based prediction model for RXRA activity of
small molecules.
16
Methods
17
Data Set for Model Development
Activity Class # Substances
Active 919
Active agonist 251
Active antagonist 668
Inactive 7,083
Inconclusive 1,165
Total 9,667
 Training/test sets for model development
 AID 1159531
(https://pubchem.ncbi.nlm.nih.gov/bioassay/1159531)
• Quantitative HTS (qHTS) data from the Tox21 project
• Activity against RXRA was measured.
18
Preprocessing
1. Salts and mixtures were replaced with their parent
compounds.
Parent compound :
• conceptually the "important" part of the molecule for a molecules with
multiple covalent units.
• Must have at least one carbon and contain at least 70% of the heavy
(non-hydrogen) atoms of all the unique covalent units (ignoring
stoichiometry).
2. Remove duplicate compounds and those with conflicting
bioactivities.
19
3. Download molecular properties from PubChem.
o This step effectively removes compounds containing
covalently-bonded inorganic elements (because they are
not supported in XLogP computation).
Preprocessing
• Molecular weight • XLogP
• Heavy atom count • Topological polar surface area (TPSA)
• Rotatable bond count • Hydrogen Bond Donor Count
• Molecular complexity • Hydrogen Bond Acceptor Count
20
5. Balance the training data set.
• All 471 actives in the training set were kept.
• 471 inactives were selected after grouping the inactive
compounds in the training set using k-means clustering
with the eight molecular properties as descriptors.
Preprocessing
Actives Inactives Total Inactive/Active
Training set 471 4,445 4,916 9.4
Test set 53 494 547 9.3
4. Randomly select 10% of the remaining compounds and set
aside them as a test set.
21
 Molecular descriptors
• Generated using PaDEL
[Yap CW (2011). J. Comput. Chem., 32 (7): 1466-1474]
Model Building
Abbreviation Name Length
AP AtomPairs 2D Fingerprint 780
ESTAT Estate fingerprint 79
EXTFP* CDK Extended Fingerprint 1,024
FP* CDK fingerprint 1,024
GOFP* CDK graph only fingerprint 1,024
KR Klekota-Roth fingerprint 4,860
MACCS MACCS fingerprint 166
PUB PubChem fingerprint 881
SUB Substructure fingerprint 307
* Hashed fingerprints
22
 Machine-learning algorithms (implemented in scikit-learn)
Model Building
Abbreviation Name Hyperparameters optimized
NB Naïve Bayes  (10-10 ~ 1)
DT Decision tree max_depth_range (3 ~ 7)
min_samples_split_range (3 ~ 7)
min_samples_leaf_range (2 ~ 6)
kNN K-Nearest neighbors weights (uniform, minkowski, jaccard)
n_neighbors (1 ~ 25)
RF Random forest n_estimators (10 ~ 200)
SVM Support vector machine C (2-10 ~ 210);  ( 2-10  210)
NN Neural network solver (lbfgs or adam);  (10-7  107)
 10-fold cross-validation was used for hyperparameter
optimization.
23
Model Performance Evaluation
 Area under the Receiver operating characteristic curve (AUC)
 Used for hyperparameter optimization.
 𝐵𝑎𝑙𝑎𝑛𝑐𝑒𝑑 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝐵𝐴𝐶𝐶
=
1
2
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
+
𝑇𝑁
𝑇𝑁 + 𝐹𝑃
=
1
2
𝑆𝐸𝑁𝑆 + 𝑆𝑃𝐸𝐶
 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (𝑆𝐸𝑁𝑆) =
𝑇𝑃
𝑇𝑃+𝐹𝑁
 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑆𝑃𝐸𝐶 =
𝑇𝑁
𝑇𝑁+𝐹𝑃
24
External Data Sets
 External sets for testing general applicability of the models.
 Ext1: 222 compounds from 45 ChEMBL assays
• Literature-extracted
• Predominantly active compounds
 Ext2: 489 compounds from 2 NCGC assays
(AIDs 588544 and 588546)
• quantitative HTS data
• Predominantly inactive
 The external sets were preprocessed in the same manner as
the training/test sets.
25
Applicability Domain
𝐷 𝑇 = 𝑑 + 𝑍 ∙ 𝜎
 Applicability domain of the developed models was assessed
using a distance-based approach [Shen et al. (2002). J. Med.
Chem., 45(13):2811-2823].
1. For each training set, the (Manhattan) distance to its nearest neighbor
was collected and used to determine the applicability domain threshold.
2. If a test/external set compound does not have a neighbor in the training
set (closer than DT), it was considered to be out of the applicability
domain of the model, and the prediction for that test compound was
deemed to be unreliable.
26
Results
27
 Performance of the models
 AUC scores of 0.7 were
observed for models developed
using:
PubChem/MACCS/CDK-FP with
NN/SVM/RF/kNN
 Maximum AUC score (0.77):
PubChem fingerprint with RF
Area under ROC curve (AUC)
28
 The BACC scores of the models
show similar trends with the AUC
scores.
 Maximum BACC score (0.70) for:
• PubChem-RF
• MACCS-RF
• MACCS-SVM
Balanced Accuracy (BACC)
 Performance of the models
29
Sensitivity Specificity
NB and DT show poor specificities (in general)  poor performances
 Performance of the models
30
 General applicability of the models
Area under ROC curve (AUC), Inactive-to-active ratio = 1
31
 General applicability of the models
Balanced accuracy (BACC), Inactive-to-active ratio = 1
32
 General applicability of the models
External Set 1 (Ext1) External Set 2 (Ext2)
Performance measures
(compared to the test set)
not very similar similar
Data sources ChEMBL NCGC
Measurement methods literature-extracted qHTS
Chemical domains
medicinal chemistry,
natural product
environmental
chemicals
Compounds within
applicability domain a 22.5% (=50/222) 86.2% (=422/489)
a The fraction of “test set” compounds within applicability domain is 76.2% (=417/547).
33
Summary
34
Summary
 Bioactivity data contained in PubChem was used to develop
computational prediction models for RXRA activity of small
molecules.
 The RXRA activity data from the tox21 project was used to
build and test the models, and the ChEMBL and NCGC data sets
were used as external test sets to further test the general
applicability of the models.
 Six machine learning methods in conjunction with nine
molecular fingerprints were used to build the models.
 The best model generated from the balanced training set,
developed using the Random Forest and PubChem fingerprint,
gave an AUC score of 0.77 and a BACC of 0.70.
35
 When the models were tested against the two external data
sets, the performance of the models against the ChEMBL set
were found to be very different from those against the Tox21
test set and the NCGC set.
Summary
 This indicates that the compound coverage of the tox21 project
is somewhat different from what scientists in the medicinal
chemistry and natural product areas has been studying.
 This study showcases how public data contained in PubChem
can be used to develop prediction models for small-molecule
bioactivity, using open-source software and PubChem’s public
services.
36
Acknowledgements
Evan Bolton
Jie Chen
Tiejun Cheng
Asta Gindulyte
Jia He
Siqian He
Qingliang Li
Benjamin Shoemaker
Thiessen Paul
Bo Yu
Leonid Zaslavsky
Jian Zhang
 The PubChem Team
 PubChem depositors, users, and collaborators
 Funded by the National Library of Medicine

Using open bioactivity data for developing machine-learning prediction models for chemical modulators of the retinoid X receptor (RXR) signaling pathway

  • 1.
    Using Open BioactivityData for Developing Machine-Learning Prediction Models for Chemical Modulators of the Retinoid X Receptor (RXR) Signaling Pathway Sunghwan Kim, Ph.D., M.Sc. (sunghwan.kim@nih.gov) National Center for Biotechnology Information National Library of Medicine National Institutes of Health ACS Fall 2018 National Meeting in Boston, MA Wednesday, August 22, 2018
  • 2.
    2 1. Introduction • PubChemand its bioactivity data • Retinoid X Receptor  (RXRA) 2. Methods 3. Results 4. Summary Contents
  • 3.
  • 4.
    4  NIH’s chemicalinformation resource.  Collects public-domain chemical data from >620 data sources.  Disseminates it back to the public free of charge. What is PubChem? The Public Data Collection Data Dissemination (free of charge) Gov. Agencies University Labs Publishers Pharma Companies Chemical Vendors Chemical Biology Resources
  • 5.
    5 Data Organization inPubChem Unique chemical structure extraction through Standardization Depositor-provided substance descriptions Unique chemical structures Data Contributors Substance deposition Depositor-provided Bioactivity test results Activity of tested “substances” Activity of “compounds” derived from associated “substances” Assay deposition
  • 6.
    6 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Jan Apr JulOct Jan Apr Jul Oct Jan Apr Jul Oct Jan Apr Jul NumberofUsers (millions) Time Monthly Unique Users (interactive users only) 2015 2016 2017 For the past 12 months, 1.5 ~ 3.0 million unique users per month 2018 Source: Google Analytics
  • 7.
    7 The largest collectionof public-domain chemical data.  >96.5 million unique chemical structures (as of August 2018)  Covers various chemical entities: small molecules, siRNAs, miRNAs, carbohydrates, lipids, peptides, chemically modified macromolecules, ……  Contains various types of information • 3-D structure • Names/synonyms • Bioactivity • Pharmacology • Safety and handling • Gene/protein targets • Patents • Toxicology • Regulation • Literature • Environmental health • Classifications ⁞ ⁞ ⁞ PubChem as a Source of Big Chemistry Data
  • 8.
    8 Bioactivity Data inPubChem Tested 3.0 millions (3.08%) Active (AC  1 nM) 50 thousands (0.05%) Active (1 nM < AC  1 M) 589 thousands (0.61%) Active (others) 499 thousands (0.52%) Inactive 1.8 millions (1.90%) Not Tested 93.5 millions (96.92%) All Compounds 96.5 millions (100.00%)
  • 9.
  • 10.
    10 High-throughput screening data • FromMolecular Libraries Program and other HTS projects. • Many inactives • False hits (e.g., aggregators, autofluorescent compounds) • Often measured at single concentration Literature-extracted data Bioactivity Data in PubChem
  • 11.
    11 High-throughput screening data • FromMolecular Libraries Program and other HTS projects. • Many inactives • False hits (e.g., aggregators, autofluorescent compounds) • Often measured at single concentration Literature-extracted data • From manual curation or data mining • No (or few) inactives • Provided by various PubChem depositors including: ChEMBL, BindingDB, PDBbind, Guide to Pharmacology Bioactivity Data in PubChem
  • 12.
    12 (Long-term) Goals 1. Developa workflow for building a prediction model for bioactivity of small molecules against a given target, using the public-domain data available in PubChem. • Structure-based (molecular docking) • Ligand-based (2-D/3-D similarity) • Machine-learning based 2. Automate the workflow to develop models for all targets for which enough bioactivity data are publicly available. This study focuses on the first goal.
  • 13.
    13 Retinoid X Receptor (RXRA)  Nuclear receptor, activated by 9-cis-retinoic acid.  Forms a heterodimer with other nuclear receptors • retinoid acid receptors (RARs) • peroxisome proliferator-activated receptors (PPARs) • thyroid hormone receptors (T3R, and TR-B) • vitamin D receptor (VDR) • liver X receptors (LXRs) • pregnane X receptor (PXR) • farnesoid X receptor (FXR) • constitutive androstane receptor (CAR or NR1I3) • nuclear receptor related 1 protein (NR4A2) • ……  Forms homodimers and homotetramers.
  • 14.
    14 Retinoid X Receptor (RXRA)  Involved in regulation of gene expression in various biological processes.  Potentials roles in: • metabolic signaling pathways • skin alopecia (spot baldness) • dermal cysts • cardiac development • insulin sensitization • ……
  • 15.
    15 Retinoid X Receptor (RXRA)  Bioactivity data for RXRA in PubChem (https://pubchem.ncbi.nlm.nih.gov/target/gene/RXRA) • 202 assays • 14,415 tested compounds • 1,022 active compounds  These data are very heterogeneous! • different techniques : (q)HTS vs. literature-extracted • different definition of active/inactive compounds • ……  The present study used these data (with care) to develop a machine learning-based prediction model for RXRA activity of small molecules.
  • 16.
  • 17.
    17 Data Set forModel Development Activity Class # Substances Active 919 Active agonist 251 Active antagonist 668 Inactive 7,083 Inconclusive 1,165 Total 9,667  Training/test sets for model development  AID 1159531 (https://pubchem.ncbi.nlm.nih.gov/bioassay/1159531) • Quantitative HTS (qHTS) data from the Tox21 project • Activity against RXRA was measured.
  • 18.
    18 Preprocessing 1. Salts andmixtures were replaced with their parent compounds. Parent compound : • conceptually the "important" part of the molecule for a molecules with multiple covalent units. • Must have at least one carbon and contain at least 70% of the heavy (non-hydrogen) atoms of all the unique covalent units (ignoring stoichiometry). 2. Remove duplicate compounds and those with conflicting bioactivities.
  • 19.
    19 3. Download molecularproperties from PubChem. o This step effectively removes compounds containing covalently-bonded inorganic elements (because they are not supported in XLogP computation). Preprocessing • Molecular weight • XLogP • Heavy atom count • Topological polar surface area (TPSA) • Rotatable bond count • Hydrogen Bond Donor Count • Molecular complexity • Hydrogen Bond Acceptor Count
  • 20.
    20 5. Balance thetraining data set. • All 471 actives in the training set were kept. • 471 inactives were selected after grouping the inactive compounds in the training set using k-means clustering with the eight molecular properties as descriptors. Preprocessing Actives Inactives Total Inactive/Active Training set 471 4,445 4,916 9.4 Test set 53 494 547 9.3 4. Randomly select 10% of the remaining compounds and set aside them as a test set.
  • 21.
    21  Molecular descriptors •Generated using PaDEL [Yap CW (2011). J. Comput. Chem., 32 (7): 1466-1474] Model Building Abbreviation Name Length AP AtomPairs 2D Fingerprint 780 ESTAT Estate fingerprint 79 EXTFP* CDK Extended Fingerprint 1,024 FP* CDK fingerprint 1,024 GOFP* CDK graph only fingerprint 1,024 KR Klekota-Roth fingerprint 4,860 MACCS MACCS fingerprint 166 PUB PubChem fingerprint 881 SUB Substructure fingerprint 307 * Hashed fingerprints
  • 22.
    22  Machine-learning algorithms(implemented in scikit-learn) Model Building Abbreviation Name Hyperparameters optimized NB Naïve Bayes  (10-10 ~ 1) DT Decision tree max_depth_range (3 ~ 7) min_samples_split_range (3 ~ 7) min_samples_leaf_range (2 ~ 6) kNN K-Nearest neighbors weights (uniform, minkowski, jaccard) n_neighbors (1 ~ 25) RF Random forest n_estimators (10 ~ 200) SVM Support vector machine C (2-10 ~ 210);  ( 2-10  210) NN Neural network solver (lbfgs or adam);  (10-7  107)  10-fold cross-validation was used for hyperparameter optimization.
  • 23.
    23 Model Performance Evaluation Area under the Receiver operating characteristic curve (AUC)  Used for hyperparameter optimization.  𝐵𝑎𝑙𝑎𝑛𝑐𝑒𝑑 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝐵𝐴𝐶𝐶 = 1 2 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 + 𝑇𝑁 𝑇𝑁 + 𝐹𝑃 = 1 2 𝑆𝐸𝑁𝑆 + 𝑆𝑃𝐸𝐶  𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (𝑆𝐸𝑁𝑆) = 𝑇𝑃 𝑇𝑃+𝐹𝑁  𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑆𝑃𝐸𝐶 = 𝑇𝑁 𝑇𝑁+𝐹𝑃
  • 24.
    24 External Data Sets External sets for testing general applicability of the models.  Ext1: 222 compounds from 45 ChEMBL assays • Literature-extracted • Predominantly active compounds  Ext2: 489 compounds from 2 NCGC assays (AIDs 588544 and 588546) • quantitative HTS data • Predominantly inactive  The external sets were preprocessed in the same manner as the training/test sets.
  • 25.
    25 Applicability Domain 𝐷 𝑇= 𝑑 + 𝑍 ∙ 𝜎  Applicability domain of the developed models was assessed using a distance-based approach [Shen et al. (2002). J. Med. Chem., 45(13):2811-2823]. 1. For each training set, the (Manhattan) distance to its nearest neighbor was collected and used to determine the applicability domain threshold. 2. If a test/external set compound does not have a neighbor in the training set (closer than DT), it was considered to be out of the applicability domain of the model, and the prediction for that test compound was deemed to be unreliable.
  • 26.
  • 27.
    27  Performance ofthe models  AUC scores of 0.7 were observed for models developed using: PubChem/MACCS/CDK-FP with NN/SVM/RF/kNN  Maximum AUC score (0.77): PubChem fingerprint with RF Area under ROC curve (AUC)
  • 28.
    28  The BACCscores of the models show similar trends with the AUC scores.  Maximum BACC score (0.70) for: • PubChem-RF • MACCS-RF • MACCS-SVM Balanced Accuracy (BACC)  Performance of the models
  • 29.
    29 Sensitivity Specificity NB andDT show poor specificities (in general)  poor performances  Performance of the models
  • 30.
    30  General applicabilityof the models Area under ROC curve (AUC), Inactive-to-active ratio = 1
  • 31.
    31  General applicabilityof the models Balanced accuracy (BACC), Inactive-to-active ratio = 1
  • 32.
    32  General applicabilityof the models External Set 1 (Ext1) External Set 2 (Ext2) Performance measures (compared to the test set) not very similar similar Data sources ChEMBL NCGC Measurement methods literature-extracted qHTS Chemical domains medicinal chemistry, natural product environmental chemicals Compounds within applicability domain a 22.5% (=50/222) 86.2% (=422/489) a The fraction of “test set” compounds within applicability domain is 76.2% (=417/547).
  • 33.
  • 34.
    34 Summary  Bioactivity datacontained in PubChem was used to develop computational prediction models for RXRA activity of small molecules.  The RXRA activity data from the tox21 project was used to build and test the models, and the ChEMBL and NCGC data sets were used as external test sets to further test the general applicability of the models.  Six machine learning methods in conjunction with nine molecular fingerprints were used to build the models.  The best model generated from the balanced training set, developed using the Random Forest and PubChem fingerprint, gave an AUC score of 0.77 and a BACC of 0.70.
  • 35.
    35  When themodels were tested against the two external data sets, the performance of the models against the ChEMBL set were found to be very different from those against the Tox21 test set and the NCGC set. Summary  This indicates that the compound coverage of the tox21 project is somewhat different from what scientists in the medicinal chemistry and natural product areas has been studying.  This study showcases how public data contained in PubChem can be used to develop prediction models for small-molecule bioactivity, using open-source software and PubChem’s public services.
  • 36.
    36 Acknowledgements Evan Bolton Jie Chen TiejunCheng Asta Gindulyte Jia He Siqian He Qingliang Li Benjamin Shoemaker Thiessen Paul Bo Yu Leonid Zaslavsky Jian Zhang  The PubChem Team  PubChem depositors, users, and collaborators  Funded by the National Library of Medicine