Cibb2013

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
Advanced Circuits, Architecture, and Computing Lab
Molecular Docking for Drug Discovery:
Machine-Learning Approaches for Native
Pose Prediction for Protein-Ligand Complexes
Hossam M. Ashtawy
ashtawy@egr.msu.edu
Tenth International Meeting on Computational Intelligence Methods for
Bioinformatics and Biostatistics
(CIBB 2013)
June 20, 2013
Nihar R. Mahapatra
nrm@egr.msu.edu
Department of Electrical & Computer Engineering
Michigan State University, East Lansing, MI, U.S.A.
© 2013

 Accurately predicting BA of large sets of diverse
protein-ligand complexes remains one of the
most challenging unsolved problems in
computational bimolecular science
 Conventional SFs have been shown to have
limited predictive and docking power
 Size and diversity of protein-ligand complexes
with known experimental BA is limited.
Large and diverse datasets of protein-ligand
complexes help in building more accurate statistical-
based SFs
Motivation
2

• Motivation
• Background and Scope of Work
– Scoring Functions
– Our Approach and Scope of Work
• Materials and Methods
– Compound Database and Characterization
– Machine Learning Methods
• Experiments, Results, and Discussion
– Tuning, Training, and Testing Scoring Functions
– Evaluation and Comparison of Scoring Functions
• Concluding Remarks
Outline
3

Background and Scope of Work
4

 Lack of accurate accounting of
intermolecular physicochemical interactions
 Imprecise solvent modeling
 Uncertainties in collected experimental
affinity data
 Inability to capture inherent nonlinear
relationships correlating intermolecular
interactions to binding affinity or native
binding pose
Scoring & Docking Challenges
6

 Predict the binding pose explicitly.
 Use sophisticated machine-learning methods
to model closeness of a pose to the native
conformation.
 Use this nonparametric technique in
conjunction with physiochemical features
describing intermolecular interactions
between proteins and ligands
 Train predictive models on a large and
diverse dataset of high-quality protein-ligand
complexes
 Evaluate the docking accuracies of resulting
SF on diverse protein families
Our Approach & Scope of Work
7

Compound Database: PDBbind
[1]
9
 Protein-ligand complexes obtained from
PDBbind 2007
 PDBbind is a selective compilation of the
Protein Data Bank (PDB) database

PDB
Ligand’s
MW ≤ 1000
# non-hydrogen
atoms of the
ligand ≥ 6
Only one
ligand is
bound to
the protein
Protein &
ligand non-
covalently
bound
Resolution of the complex
crystal structure ≤ 2.5Å
Elements in complex
must be C, N, O, P, S, F,
Cl, Br, I, H
Known Kd or Ki
Hydrogenation
Protonation &
deprotonation
Refined set
of PDBbind
PDBbind: Refined Set
11

PDBbind: Core Set
12
Refin
ed
set
Similarity
search using
BLAST
Similarity
cutoff of
90%
Clusters
with ≥ 4
complexes
Binding affinity of highest-
affinity complex is 100-
fold the affinity of lowest
one
First, middle, and
lowest affinity
complexes from
each cluster
Core Set in
PDBbind
[2]

Decoy Generation
13
A protein-
ligand Complex
Generate a
random low-
energy
conformation
Generate ~2000
conformations using 4
different docking
protocols
Discard poses
> 10Å from
native pose
Group poses into
10 1Å bins based
on their RMSD
values
Each bin is further
clustered into 10
clusters
Choose the pose
with the lowest
energy from
each sub-
cluster
100
Decoys
[2]

 Extracted features
calculated for the
following scoring
functions:
X-Score (6 features)
AffiScore (30 features).
RF-Score (36 features)
GOLD (14 features)
Compound Characterization
14

 Primary training
set : Pr
1105 (Y=BA)
39,085 (Y=RMSD)
 Core test set: Cr
16,554
Training and Test Datasets
15

 Single models
Multiple linear regression (MLR)
Multivariate adaptive regression splines (MARS)
k-Nearest neighbors (kNN)
Support vector machine (SVM)
 Ensemble models
Random forests (RF)
Boosted regression trees (BRT)
Machine Learning Methods
16

Conventional SFs
17
Software
SF Type
Discovery
Studio
SYBYL GOLD Schrodinger Standalone |SFs|
Empirical PLP
JAIN
LUDI
ChemScore
F-Score
ChemScore
ASP
GlidScore X-Score 9
Knowledge
Based
LigScore
PMF
PMF-Score DrugScore 4
Force-field D-Score
G-Score
GoldScore 3
|SFs| 5 5 3 1 2 16

Experiments, Results, and
Discussion
18

SF Construction & Application Workflow
19

 Docking power: Measures the ability of an SF to
distinguish a promising binding mode from a
less promising one
𝑆 𝐶
𝑁
(𝑖𝑛 %)
 Success rate that accounts for the percentage of
times an SF is able to find a pose whose RMSD is
within a predefined cutoff value C Å by only
considering the N topmost poses ranked by their
predicted scores.
 C (e.g. 0, 1, 2, and 3Å) N (e.g. ,1, 2, 3, and 5)
Evaluation of Scoring Functions
21

Success rates of Conv. & ML SFs: Cr
22
> 60%
~ 50%
>70%
~80%
𝑆2
1
< 5%

Success rates of Conv. & ML SFs: Core test set
23
GOLD::ASP from 82% to 92% RF::RG from 87% to 96%
𝑆0
5
~60% 𝑆0
5
~77%

Success rates of Conv. & ML SFs: HIV & TRY
24
MLR: 𝑆1
1
=72% 𝑆3
1
= 90% MLR: 𝑆1
1
=80% 𝑆3
1
= 95%
MLR: 𝑆0
1
=50% 𝑆0
5
= 90% MARS:𝑆0
1
=48% 𝑆0
5
= 83%
MLR: 𝑆1
1
=41% 𝑆3
1
= 80% MLR: 𝑆1
1
=66% 𝑆3
1
= 90%
MARS:𝑆0
1
=36% 𝑆0
5
= 80%MARS:𝑆0
1
=23% 𝑆0
5
= 68%

Success rates of Conv. & ML SFs: CAR & THR
25
MLR: 𝑆1
1
=22% 𝑆3
1
= 53% SVM: 𝑆1
1
=32% 𝑆3
1
= 62%
MLR: 𝑆0
1
=40% 𝑆0
5
= 79%MARS:𝑆0
1
=24% 𝑆0
5
= 74%
MLR: 𝑆0
1
=15% 𝑆0
5
= 34%MARS:𝑆0
1
=9% 𝑆0
5
= 33%
MLR: 𝑆1
1
=58% 𝑆3
1
= 82% MLR: 𝑆1
1
=92% 𝑆3
1
= 95%

 ML models trained to explicitly predict RMSD
values significantly outperform all conventional
SFs
 Estimated RMSD values of such models have a
correlation of 0.7 on average with the true RMSD
values. While predicted BA’s have a correlation of
as low as 0.2 with the measured RMSD values.
 The empirical SF GOLD::ASP achieved a success
rate of 70% in identifying a pose that lies within
1Å from the native pose of 195 different
complexes.
 Our top RMSD-based SF, MARS::XARG, has a
success rate of ~80% on the same test set
Concluding Remarks
27

[1] Berman, H. et al., The Protein Data Bank, Nucleic Acids Research 28 (1) (2000) 235-242.
[2] Cheng, T., Li, X., Li, Y., Liu, Z., Wang, R.: Comparative assessment of scoring functions on a
diverse test set. Journal of Chemical Information and Modeling 49 (4) (2009) 1079–1093.
References
29

Cibb2013

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Cibb2013

Similar to Cibb2013 (20)

Recently uploaded

Recently uploaded (20)

Cibb2013