Computational tools for drug discovery

Computational tools for drug
discovery
Akos Tarcsay

https://doi.org/10.1038/sj.ijir.3901522

Fate of drugs in the human body
https://doi.org/10.1038/nrd1032

Anti-targets

~20k non-modiﬁed (canonical) human proteins
1 target to engage with
Target, the William Tell’s challenge

Rubik cube for medicinal chemists
- Medicinal chemistry optimization is
multi-dimensional problem
- Each chemical modification corresponds to a series
of biological activity changes
- Rubik’s cube has 43 252 003 274 489 856 000 (1019
)
different configurations
- All configurations can be solved in ~20 steps
- Druggable chemical space is ~1060
https://doi.org/10.1016/j.drudis.2011.05.005

https://doi.org/10.1038/nrd.2017.232

Protein-ligand interaction

https://doi.org/10.1002/1521-3773(20020802)41:15<2644::AID-ANIE2644>3.0.CO;2-O
Protein-ligand interaction

Scale
Ki or IC50 or EC50
M [mol/dm3] ΔG [kJ/mol] ΔG [kCal/mol] Aﬃnity
0.1 100 mM -5,7 -1,4
Weak
0.01 10 mM -11,4 -2,7
0.001 1 mM -17,1 -4,1
0.0001 100 uM -22,8 -5,5
0.00001 10 uM -28,5 -6,8
Medium
1.00E-06 1 uM -34,2 -8,2
1.00E-07 100 nM -39,9 -9,5
Strong
1.00E-08 10 nM -45,6 -10,9
1.00E-09 1 nM -51,3 -12,3
1.00E-10 100 pM -57,0 -13,6
Very strong
1.00E-11 10 pM -62,8 -15,0
1.00E-12 1 pM -68,5 -16,4

Ki or IC50 or EC50
0.1 100 mM -5,7 -1,4
Weak
0.01 10 mM -11,4 -2,7
0.001 1 mM -17,1 -4,1
0.0001 100 uM -22,8 -5,5
0.00001 10 uM -28,5 -6,8
Medium
1.00E-06 1 uM -34,2 -8,2
1.00E-07 100 nM -39,9 -9,5
Strong
1.00E-08 10 nM -45,6 -10,9
1.00E-09 1 nM -51,3 -12,3
1.00E-10 100 pM -57,0 -13,6
Very strong
1.00E-11 10 pM -62,8 -15,0
1.00E-12 1 pM -68,5 -16,4
Scale
https://doi.org/10.1021/jm100112j

Ki of 1 nM.
Replacing the isopropyl group (marked in red) by hydrogen reduces the aﬃnity to 39 μM.
Ki or IC50 or EC50
0.1 100 mM -5,7 -1,4
Weak
0.01 10 mM -11,4 -2,7
0.001 1 mM -17,1 -4,1
0.0001 100 uM -22,8 -5,5
0.00001 10 uM -28,5 -6,8
Medium
1.00E-06 1 uM -34,2 -8,2
1.00E-07 100 nM -39,9 -9,5
Strong
1.00E-08 10 nM -45,6 -10,9
1.00E-09 1 nM -51,3 -12,3
1.00E-10 100 pM -57,0 -13,6
Very strong
1.00E-11 10 pM -62,8 -15,0
1.00E-12 1 pM -68,5 -16,4
Scale

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3209665/

Molecules are chameleons
https://pixabay.com/photos/chameleon-mimicry-green-557367/

The Nature of Chemistry
Environment: pH, solvent, …
Impacts:
- Tautomerism
- Ionization, microspecies
- Solubility
- Lipophilicity
By IconMark, PH from the Noun Project

Histidine, a simple amino acid
Strongest basic pKa: 9.44
Strongest acidic pKa: 1.85

Basic pKa: 9.44, 6.61
Acidic pKa: 1.85, 12.94

https://disco.chemaxon.com/calculators/demo/playground/

https://disco.chemaxon.com/calculators/demo/plugins/tautomers/
Tautomerization

https://pubs.acs.org/doi/10.1021/acs.jcim.0c00232
Tautomerization
effect
CBR14->CBR18: QM-based rule update
Calculated phys-chem properties

„The fundamental laws necessary for the
mathematical treatment of a large part of physics and
the whole of chemistry are thus completely known

„The fundamental laws necessary for the
mathematical treatment of a large part of physics and
the whole of chemistry are thus completely known, and
the diﬃculty lies only in the fact that application of
these laws leads to equations that are too complex to
be solved. „
(Paul Dirac, 1929)

- Ligand only
QM: ESP, torsional scan, tautomers, interactions
MM: ﬂexible-rigid alignment
- Flexible ligand and rigid protein
Docking, scoring functions
Levels of approximations

QM application: Halogen bond
https://pubs.acs.org/doi/10.1021/jm3012068

https://pubs.acs.org/doi/10.1021/jm3012068
QM application: Halogen bond

B3LYP, 6-311++ G(2d,2p) https://doi.org/10.1021/ml100016x
QM application: Site of metabolism prediction:SMARTCyp

https://doi.org/10.1021/ml100016x
QM application: Site of metabolism prediction:SMARTCyp

https://pubs.acs.org/doi/10.1021/ci100436p
Docking

https://doi.org/10.1007/s12539-019-00327-w
Scoring

Validation
https://doi.org/10.1021/ci600426e

https://doi.org/10.1007/s10822-015-9883-y
Docking into MD frames: 5HT6

- Flexible ligand ﬂexible protein
Induced ﬁt docking
MD - Ensemble docking
Free energy perturbation

Free energy perturbation
Thermodynamic cycle https://doi.org/10.1007/978-1-4939-9608-7

Alchemical transformation
FEP +
- Hamiltonian replica exchange method
- region surrounding the protein binding pocket is “heated up”
- the rest of the system stays “cold”
- GPU calculation involving ~6000 atoms requires ~6h (4/day)
https://doi.org/10.1007/978-1-4939-9608-7

https://doi.org/10.1007/978-1-4939-9608-7
FEP+ results

1. Availability of at least one high-quality crystal structure with
co-crystallized series ligand.
2. A reasonable expectation of a conserved binding mode across the
series.
3. Minimal tautomeric, ionization state, and stereochemistry uncertainties
across the series.
4. High reliability experimental binding data from the same assay for all
compounds.
5. Assay data and crystal structures are for the same protein construct.
Constraints
https://doi.org/10.1007/978-1-4939-9608-7

In God we trust, all others bring data.
William Edwards Deming
Trevor Hastie, Robert Tibshirani, Jerome Friedman
The Elements of Statistical Learning Data Mining, Inference, and Prediction

Drug discovery data is expensive

https://doi.org/10.1021/ci100258p

https://www.youtube.com/watch?v=UjBRNeqaDJA

http://dx.doi.org/10.1021/acs.jmedchem.7b00935
Sharing conﬁdential data via MMP

Validation and error prediction

https://doi.org/10.1021/ci400084k
Model validation

Random
Time
Neighbor
Prospective
https://doi.org/10.1021/ci400084k

https://www.kaggle.com/alexisbcook/cross-validation
Cross validation

Z = (z1 , z2 , . . . , zN ) where zi = (xi , yi )
B times producing B bootstrap datasets with
replacement
S(Z) is any quantity computed from the data Z
Bootstrap methods

https://yashuseth.blog/2017/12/02/bootstrapping-a-resampling-method-in-statistics/

Conformal prediction
Proper
Training Set
Model Error model
Calibration
set
Error Prediction
Training Set
P(80%)
calibration
factor (ɑ)

Classiﬁcation validation: confusion matrix
https://doi.org/10.1186/s12864-019-6413-7

P(A|B)=P(B|A)xP(A)/P(B)
99% sensitivity
99% speciﬁcity
Bayes theorem

99% sensitivity
99% speciﬁcity
0.5% positive cases
Bayes theorem

99% sensitivity
99% speciﬁcity
0.5% positive cases
P(TP|+)=0.99*0.005/[0.99*0.005+0.01*0.995]=33.2%
If the test is positive, still there is only 33.2% chance to be true positive.
1000 cases, 995 negative, 5 positive
995*0.01 = 10 false positive
5*0.99~5
Sum positive 15, true positive = 5 (33%)
Bayes theorem

Generating models on conﬁdential data
https://arxiv.org/pdf/1610.05755.pdf

Phys-chem: lipophilicity (logP/logD), pka,
solubility, donors, acceptors
Topological: polar surface are, ring count, bond
counts, fsp3, graph distance indices, donor count,
acceptor count
Descriptors

Radial ﬁngerprint
https://doi.org/10.1021/ci100050t

T= AND(A,B)/OR(A,B)
A=24, B:21, A&B:19
T=0.73
T>0.85 similar
A=40, B=30, A&B=30
T=0.75 (can be a substructure)
Similarity deﬁnitions: Tanimoto

https://doi.org/10.1021/ci100062n
Fingerprint comparison methods

CFP (linear)
Tanimoto
CFP (linear)
Euclidean
https://disco.chemaxon.com/madfast-demo

CFP (linear)
Tanimoto
CFP (linear)
Manhattan

CFP (linear)
Tanimoto
ECFP (raidal)
Tanimoto

808 x 977 M ~ 8 x 1011
dissimilarity
data points
https://chemaxon.com/poster/similarity-implicated-
exploration-of-the-fragment-galaxy

https://disco.chemaxon.com/products/madfast/latest//doc/basic-search-workflow.html

https://doi.org/10.1038/nchem.1243

- Linear regression (PLS, LASSO)
- Decision tree (CART) and Random forest
- Support Vector Machine
- Neural Network (Deep, Convolutional Neural
Network)
Model building

Decision tree (CART)
Depth
Node size

Random Forest
https://doi.org/10.1038/nmeth.4438

https://doi.org/10.1038/nmeth.4438

Support vector machine (SVM)
https://doi.org/10.1517/17460441.2014.866943

2D not separable problem
https://doi.org/10.1517/17460441.2014.866943

Neural networks
https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/, https://doi.org/10.1016/j.drudis.2019.07.006

https://doi.org/10.3389/fenvs.2015.00080

Hierarchical composition of complex features.

Feature Construction by Deep Learning.

ToxCast 21 challenge

Workﬂow overview
Training data
(sdf, with labelled
data)
Training
module
Build
- Descriptor generation
- Model building
- Validation
Model management
- Persistence
- Execution
New model
Icon by Aﬁcons from Noun Project
https://disco.chemaxon.com/calculators/trainer-engine/

- Feature engineering
ChemAxon descriptors
User deﬁned descriptors
- Model building
Type: regression, classiﬁcation
Models: RF, SVR, GB, GC
Hyperparameter optimization: pre-optimized preset, optimizer
Precise automatic models: under the hood
Icon by modgekar from Noun Project

- Validation statistics and report
Training test set split
Retrospective accuracy
- Reliability
- Applicability domain: most similar structures
- Prediction error: Conformal prediction
- Overﬁtting
- Scramble Y
Quality assessment

Application Study on ChEMBL
Dataset: Journal of Cheminformatics volume 9, Article number: 45 (2017)
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0232-0
- 163 ChEMBL targets
- Data points in range: 500-4703 per target
- 10-90% test-training set split
- Target pAct
- Pearson, RMSE

Pearson (R) RMSE (pAct)
Average 0.804 0.671
Median 0.817 0.672
Count 163 163
STDev 0.071 0.100
Min 0.569 0.421
Max 0.936 1.030
Test set results

https://doi.org/10.1038/s41573-019-0024-5

Specs:
- Using MadFast dev version
- machine: a single Amazon EC2 x1.16xlarge instance
(976 GiB RAM, 64 cores, 2T SSD, $6.7 / h on-demand)
- dataset: Enamine Real 2019q34, 1.2B molecules,
- ﬁngerprint: CFP7, 512 bit
Importing:
- importing time was 6h 16m (ran concurrently with an
ECFP import, using half of the cores)
- result binary blob: 167 GiB
Server startup
- 448 s (~7.5 min) to read 167 GiB to memory (~380 MB/s
throughput)
- Of which 169 s for mols, 74 s for ids, 203 s for ﬁngerprints
Fast similarity search Dissim limit Hit count limit Runs Avg search time
0.4 1 500 0.45 s
0.4 9 500 0.96 s
0.4 81 500 1.16 s
0.4 729 500 1.25 s
0.4 2187 500 1.26 s
0.4 6561 500 1.42 s
0.4 15000 500 1.89 s
1.0 1 50 0.61 s
1.0 9 50 0.98 s
1.0 81 50 1.29 s
1.0 729 50 1.39 s
1.0 2187 50 1.65 s
1.0 6561 50 3.22 s
1.0 15000 50 9.67 s

- Pre-screen
Fingerprint match all query bits present in target
Descriptor screen (Mw, counts)
- Graph isomorph check
A graph S is a subgraph of a graph G if S is isomorphic to a
subgraph of G (Ullmann, VF2, VF2+)
Substructure search
Tutorials in Chemoinformatics, 395-448 John Wiley & Sons Ltd, Chichester, UK, 2017; https://doi.org/10.1186/1758-2946-4-13

● Data set: The Enamine library containing 1.2 billion structures was imported in the database cluster.
● Hardware:
○ Citus cluster was set up in AWS to use a distributed PostgreSQL database.
○ The cluster included one coordinator node and 20 worker nodes.
■ Coordinator node was installed on a t2.xlarge type EC2 instance was used (4 cores, 16 GiB memory)
■ Worker nodes were installed on c5a.4xlarge type instances (16 cores and 32 GiB memory per instance)
● Data upload and chemical indexing:
○ Upload of the data took ~12h;
○ Chemical index creation with JChem PostgreSQL Cartridge: 19.3h
● Search types:
○ Full structure, substructure and similarity search, as well as different combined queries were used with one, two or
three additional properties.
○ The number of records returned by the queries was limited to return only the top 100 results.
JChem PostgreSQL cartridge test runs

Array of services
- Data access
- Preprocessed data
MMPA
- Derived models
QSAR
Docking, AI, Machine Learning
Gather information

Dynamic
Design Hub
Gather information

● Complexity of the human body
○ Single target to interact
○ Multiple targets to avoid
● Complexity of the binding interactions
○ Inﬂuence of small structural changes
○ Balancing speed and accuracy
○ Need for structural information
● Pitfalls of machine learning
○ Validation strategy
○ Overﬁtting
● Size of chemical space
○ Searching in the (multi)billion chemical space
● Accessibility
○ Connecting all the models with designers and medicinal chemists
Challenges

„ Essentially, all models are wrong, but some are useful. ”
Box, G. E. P., and Draper, N. R., (1987), Empirical Model Building and Response Surfaces
John Wiley & Sons, New York, NY.

Akos Tarcsay
atarcsay@chemaxon.com

Computational tools for drug discovery

Recommended

Recommended

More Related Content

Similar to Computational tools for drug discovery

Similar to Computational tools for drug discovery (20)

More from Eszter Szabó

More from Eszter Szabó (7)

Recently uploaded

Recently uploaded (20)

Computational tools for drug discovery