Discovery of a novel drug is an optimizing challenge against an array of chemical and biological attributes to reach the desired efficacy and safety profile. The immense complexity of the human body combined with the astronomically large druggable chemical space hinders the selection of molecules with such a balanced profile. Therefore, the medicinal chemistry toolbox embraces all computational techniques with predictive power to focus the chemical space to the most promising candidates for synthesis and testing. The diversity includes data analysis tools, physics-based simulations, biological target structure driven or ligand structure based approaches [1-3]. While the size of the compound collections vary from a couple of close analogues up to billions of virtual compounds to process[4]. This presentation will highlight general concepts and techniques applied in computer aided drug design, focusing on data and ligand based computational chemistry approaches and showcase solutions developed by ChemAxon.
[1] Gisbert Schneider, David E Clark, Angew Chem Int Ed Engl. 2019, 5;58(32):10792-10803.
[2] John G Cumming, Andrew M Davis, Sorel Muresan, Markus Haeberlein, Hongming Chen, Nat Rev Drug Discov, 2013, 12(12):948-62.
[3] Yu-Chen Lo, Stefano E Rensi, Wen Torng, Russ B Altman, Drug Discov Today 2018, 23(8):1538-1546
[4] Torsten Hoffmanm, Marcus Gastreich, Drug Discov Today, 2019, 24(5):1148-1156.
9. ~20k non-modified (canonical) human proteins
1 target to engage with
Target, the William Tell’s challenge
https://doi.org/10.1038/nrd892
10. Rubik cube for medicinal chemists
- Medicinal chemistry optimization is
multi-dimensional problem
- Each chemical modification corresponds to a series
of biological activity changes
- Rubik’s cube has 43 252 003 274 489 856 000 (1019
)
different configurations
- All configurations can be solved in ~20 steps
- Druggable chemical space is ~1060
https://doi.org/10.1016/j.drudis.2011.05.005
18. Scale
Ki or IC50 or EC50
M [mol/dm3] ΔG [kJ/mol] ΔG [kCal/mol] Affinity
0.1 100 mM -5,7 -1,4
Weak
0.01 10 mM -11,4 -2,7
0.001 1 mM -17,1 -4,1
0.0001 100 uM -22,8 -5,5
0.00001 10 uM -28,5 -6,8
Medium
1.00E-06 1 uM -34,2 -8,2
1.00E-07 100 nM -39,9 -9,5
Strong
1.00E-08 10 nM -45,6 -10,9
1.00E-09 1 nM -51,3 -12,3
1.00E-10 100 pM -57,0 -13,6
Very strong
1.00E-11 10 pM -62,8 -15,0
1.00E-12 1 pM -68,5 -16,4
19. Ki or IC50 or EC50
M [mol/dm3] ΔG [kJ/mol] ΔG [kCal/mol] Affinity
0.1 100 mM -5,7 -1,4
Weak
0.01 10 mM -11,4 -2,7
0.001 1 mM -17,1 -4,1
0.0001 100 uM -22,8 -5,5
0.00001 10 uM -28,5 -6,8
Medium
1.00E-06 1 uM -34,2 -8,2
1.00E-07 100 nM -39,9 -9,5
Strong
1.00E-08 10 nM -45,6 -10,9
1.00E-09 1 nM -51,3 -12,3
1.00E-10 100 pM -57,0 -13,6
Very strong
1.00E-11 10 pM -62,8 -15,0
1.00E-12 1 pM -68,5 -16,4
Scale
https://doi.org/10.1021/jm100112j
20. Ki of 1 nM.
Replacing the isopropyl group (marked in red) by hydrogen reduces the affinity to 39 μM.
https://doi.org/10.1021/jm100112j
Ki or IC50 or EC50
M [mol/dm3] ΔG [kJ/mol] ΔG [kCal/mol] Affinity
0.1 100 mM -5,7 -1,4
Weak
0.01 10 mM -11,4 -2,7
0.001 1 mM -17,1 -4,1
0.0001 100 uM -22,8 -5,5
0.00001 10 uM -28,5 -6,8
Medium
1.00E-06 1 uM -34,2 -8,2
1.00E-07 100 nM -39,9 -9,5
Strong
1.00E-08 10 nM -45,6 -10,9
1.00E-09 1 nM -51,3 -12,3
1.00E-10 100 pM -57,0 -13,6
Very strong
1.00E-11 10 pM -62,8 -15,0
1.00E-12 1 pM -68,5 -16,4
Scale
33. „The fundamental laws necessary for the
mathematical treatment of a large part of physics and
the whole of chemistry are thus completely known
34. „The fundamental laws necessary for the
mathematical treatment of a large part of physics and
the whole of chemistry are thus completely known, and
the difficulty lies only in the fact that application of
these laws leads to equations that are too complex to
be solved. „
(Paul Dirac, 1929)
47. Alchemical transformation
FEP +
- Hamiltonian replica exchange method
- region surrounding the protein binding pocket is “heated up”
- the rest of the system stays “cold”
- GPU calculation involving ~6000 atoms requires ~6h (4/day)
https://doi.org/10.1007/978-1-4939-9608-7
49. 1. Availability of at least one high-quality crystal structure with
co-crystallized series ligand.
2. A reasonable expectation of a conserved binding mode across the
series.
3. Minimal tautomeric, ionization state, and stereochemistry uncertainties
across the series.
4. High reliability experimental binding data from the same assay for all
compounds.
5. Assay data and crystal structures are for the same protein construct.
Constraints
https://doi.org/10.1007/978-1-4939-9608-7
51. In God we trust, all others bring data.
William Edwards Deming
Trevor Hastie, Robert Tibshirani, Jerome Friedman
The Elements of Statistical Learning Data Mining, Inference, and Prediction
65. Z = (z1 , z2 , . . . , zN ) where zi = (xi , yi )
B times producing B bootstrap datasets with
replacement
S(Z) is any quantity computed from the data Z
Bootstrap methods
71. P(A|B)=P(B|A)xP(A)/P(B)
99% sensitivity
99% specificity
0.5% positive cases
P(TP|+)=0.99*0.005/[0.99*0.005+0.01*0.995]=33.2%
If the test is positive, still there is only 33.2% chance to be true positive.
1000 cases, 995 negative, 5 positive
995*0.01 = 10 false positive
5*0.99~5
Sum positive 15, true positive = 5 (33%)
Bayes theorem
87. - Linear regression (PLS, LASSO)
- Decision tree (CART) and Random forest
- Support Vector Machine
- Neural Network (Deep, Convolutional Neural
Network)
Model building
106. Workflow overview
Training data
(sdf, with labelled
data)
Training
module
Build
- Descriptor generation
- Model building
- Validation
Model management
- Persistence
- Execution
New model
Icon by Aficons from Noun Project
https://disco.chemaxon.com/calculators/trainer-engine/
107. - Feature engineering
ChemAxon descriptors
User defined descriptors
- Model building
Type: regression, classification
Models: RF, SVR, GB, GC
Hyperparameter optimization: pre-optimized preset, optimizer
Precise automatic models: under the hood
Icon by modgekar from Noun Project
108. - Validation statistics and report
Training test set split
Retrospective accuracy
- Reliability
- Applicability domain: most similar structures
- Prediction error: Conformal prediction
- Overfitting
- Scramble Y
Quality assessment
109. Application Study on ChEMBL
Dataset: Journal of Cheminformatics volume 9, Article number: 45 (2017)
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0232-0
- 163 ChEMBL targets
- Data points in range: 500-4703 per target
- 10-90% test-training set split
- Target pAct
- Pearson, RMSE
118. Specs:
- Using MadFast dev version
- machine: a single Amazon EC2 x1.16xlarge instance
(976 GiB RAM, 64 cores, 2T SSD, $6.7 / h on-demand)
- dataset: Enamine Real 2019q34, 1.2B molecules,
- fingerprint: CFP7, 512 bit
Importing:
- importing time was 6h 16m (ran concurrently with an
ECFP import, using half of the cores)
- result binary blob: 167 GiB
Server startup
- 448 s (~7.5 min) to read 167 GiB to memory (~380 MB/s
throughput)
- Of which 169 s for mols, 74 s for ids, 203 s for fingerprints
Fast similarity search Dissim limit Hit count limit Runs Avg search time
0.4 1 500 0.45 s
0.4 9 500 0.96 s
0.4 81 500 1.16 s
0.4 729 500 1.25 s
0.4 2187 500 1.26 s
0.4 6561 500 1.42 s
0.4 15000 500 1.89 s
1.0 1 50 0.61 s
1.0 9 50 0.98 s
1.0 81 50 1.29 s
1.0 729 50 1.39 s
1.0 2187 50 1.65 s
1.0 6561 50 3.22 s
1.0 15000 50 9.67 s
119. - Pre-screen
Fingerprint match all query bits present in target
Descriptor screen (Mw, counts)
- Graph isomorph check
A graph S is a subgraph of a graph G if S is isomorphic to a
subgraph of G (Ullmann, VF2, VF2+)
Substructure search
Tutorials in Chemoinformatics, 395-448 John Wiley & Sons Ltd, Chichester, UK, 2017; https://doi.org/10.1186/1758-2946-4-13
120. ● Data set: The Enamine library containing 1.2 billion structures was imported in the database cluster.
● Hardware:
○ Citus cluster was set up in AWS to use a distributed PostgreSQL database.
○ The cluster included one coordinator node and 20 worker nodes.
■ Coordinator node was installed on a t2.xlarge type EC2 instance was used (4 cores, 16 GiB memory)
■ Worker nodes were installed on c5a.4xlarge type instances (16 cores and 32 GiB memory per instance)
● Data upload and chemical indexing:
○ Upload of the data took ~12h;
○ Chemical index creation with JChem PostgreSQL Cartridge: 19.3h
● Search types:
○ Full structure, substructure and similarity search, as well as different combined queries were used with one, two or
three additional properties.
○ The number of records returned by the queries was limited to return only the top 100 results.
JChem PostgreSQL cartridge test runs
128. ● Complexity of the human body
○ Single target to interact
○ Multiple targets to avoid
● Complexity of the binding interactions
○ Influence of small structural changes
○ Balancing speed and accuracy
○ Need for structural information
● Pitfalls of machine learning
○ Validation strategy
○ Overfitting
● Size of chemical space
○ Searching in the (multi)billion chemical space
● Accessibility
○ Connecting all the models with designers and medicinal chemists
Challenges
129. „ Essentially, all models are wrong, but some are useful. ”
Box, G. E. P., and Draper, N. R., (1987), Empirical Model Building and Response Surfaces
John Wiley & Sons, New York, NY.