Successfully reported this slideshow.
Upcoming SlideShare
×

Developing Chemoinformatics Models

1,451 views

Published on

AACIMP 2009 Summer School lecture by Igor Tetko. "Environmental Chemoinfornatics" course.

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Developing Chemoinformatics Models

1. 1. Developing chemoinformatics models: One can’t embrace unembraceable Igor V. Tetko Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH) Institute of Bioinformatics & Systems Biology Kyiv, 10 August 2009, Summer School
2. 2. Representation of Molecules Can be defined with calculated properties (logP, quantum- chemical parameters, etc.) HO "12.3% \$ ' \$ 4.6 ' Can be defined with a set of \$ M ' structural descriptors \$ 13.2' ' \$ (topological 2D, 3D, etc.). N \$10.1' # & Goal is to correlate descriptors with some properties. ! HO One of these sets of descriptors "13.7% \$ ' could be used for determine an \$ 4.8 ' applicability domain of a model. \$ M ' N \$ ' \$15.8' \$12.0' # & ! Distance to model:
3. 3. "One can not embrace the unembraceable.” Possible: 1060 - 10100 molecules theoretically exist ( > 1080 atoms in the Universe) Achievable: 1020 - 1024 can be synthesized now by companies (weight of the Moon is ca 1023 kg) Available: 2*107 molecules are on the market Measured: 102 - 104 molecules with data Kozma Prutkov Problem: To predict these properties of just molecules on the market we must extrapolate data from one to 1,000 - 100,000 molecules! OH O O N There is a need for methods which can estimate the accuracy of predictions!
4. 4. Models can fail due to chemical diversity of training & test sets (i.e. outside of applicability domain) Training set data used to develop a model Our model given the training set New data to be estimated Correct model
5. 5. Declining R&D productivity in the pharmaceutical industry Approved medicine http://www.frost.com/prod/servlet/market-insight-top.pag?docid=128394740
6. 6. Benchmarking data: logP prediction Public dataset: Industrial datasets: N=266 molecules1 N=95809 (Pfizer Inc.)2 • N=233 Star set (supported with N=882 (Nycomed GmbH)3 experimental values from CLOGP v5.0 program) 2logP and logD (pH 7.4) • N=43 Non-Star set (no experimental logP values in measurements CLOGP v5.0) 3logP measurements only 1Provided by A. Avdeef, “Absorption and drug development. Solubility, permeability and charge state”, ed. Hoboken, NJ: Wiley–Interscience, 2003. Mannhold, R. et al, J. Pharm. Sci., 2009, 98(3), 861-893.
7. 7. Performance of algorithms for the public dataset Star set (N = 223) Non-Star set (N = 43) Method % within error range % within error range RMSE rank <0.5 0.5-1 >1 RMSE rank <0.5 0.5-1 >1 AB/LogP 0.41 I 84 12 4 1.00 I 42 23 35 S+logP 0.45 I 76 22 3 0.87 I 40 35 26 ACD/logP 0.50 I 75 17 7 1.00 I 44 33 23 Consensus log P 0.50 I 74 18 8 0.80 I 47 28 26 CLOGP 0.52 II 74 20 6 0.91 I 47 28 26 VLOGP OPS 0.52 II 64 21 7 1.07 I 33 28 26 AAM = average logP used ALOGPS 0.53 II 71 23 6 0.82 I 42 30 28 as predicted value for all MiLogP 0.57 II 69 22 9 0.86 I 49 30 21 XLOGP 0.62 II 60 30 10 0.89 I 47 23 30 molecules R2=0 KowWIN 0.64 II 68 21 11 1.05 I 40 30 30 CSlogP 0.65 II 66 22 12 0.93 I 58 19 23 Bootstrap test: ALOGP (Dragon) 0.69 II 60 25 16 0.92 I 28 40 33 • rank I - similar to “best MolLogP 0.69 II 61 25 14 0.93 I 40 35 26 ALOGP98 0.70 II 61 26 13 1.00 I 30 37 33 model” OsirisP 0.71 II 59 26 16 0.94 I 42 26 33 • rank II -- better than AAM VLOGP 0.72 II 65 22 14 1.13 I 40 28 33 • rank III - similar to AAM TLOGP 0.74 II 67 16 13 1.12 I 30 37 30 ABSOLV 0.75 II 53 30 17 1.02 I 49 28 23 QikProp 0.77 II 53 30 17 1.24 II 40 26 35 QuantlogP 0.80 II 47 30 22 1.17 II 35 26 40 SLIPPER-2002 0.80 II 62 22 15 1.16 II 35 23 42 COSMOFrag 0.84 II 48 26 19 1.23 II 26 40 33 1Provided by A. Avdeef, XLOGP2 0.87 II 57 22 20 1.16 II 35 23 42 Absorption and drug QLOGP 0.96 II 48 26 25 1.42 II 21 26 53 VEGA 1.04 II 47 27 26 1.24 II 28 30 42 development. Solubility, CLIP 1.05 II 41 25 30 1.54 III 33 9 49 permeability and charge LSER 1.07 II 44 26 30 1.26 II 35 16 49 state, ed. Hoboken, NJ: MLOGP (Sim+) 1.26 II 38 30 33 1.56 III 26 28 47 Wiley–Interscience, NC+NHET 1.35 III 29 26 45 1.71 III 19 16 65 SPARC 1.36 III 45 22 32 1.70 III 28 21 49 2003. MLOGP(Dragon) 1.52 III 39 26 35 2.45 III 23 30 47 LSER UFZ 1.60 III 36 23 41 2.79 III 19 12 67 AAM 1.62 III 22 24 53 2.10 III 19 28 53 VLOGP-NOPS 1.76 III 1 1 7 1.39 III 7 0 7 HINT 1.80 III 34 22 44 2.72 III 30 5 65 GBLOGP 1.98 III 32 26 42 1.75 III 19 16 65
8. 8. Performance of algorithms for in-house datasets Benchmarking of logP Pfizer set (N = 95 809) Nycomed set (N = 882) methods for in-house Method RMSE Failed1 rank % in error range <0.5 0.5- 1 >1 RMSE, zwitterions excluded2 RMSE rank % in error range <0.5 0.5- 1 >1 data of Pfizer & Nycomed Consensus log P 0.95 I 48 29 24 0.94 0.58 I 61 32 7 ALOGPS 1.02 I 41 30 29 1.01 0.68 I 51 34 15 • Large number of methods could not S+logP 1.02 I 44 29 27 1.00 0.69 I 58 27 15 perform better than AAM model NC+NHET 1.04 II 38 30 32 1.04 0.88 III 42 32 26 MLOGP(S+) 1.05 II 40 29 31 1.05 1.17 III 32 26 41 • Best results are calculated using XLOGP3 1.07 II 43 28 29 1.06 0.65 I 55 34 12 Consensus logP model MiLogP 1.10 27 II 41 28 30 1.09 0.67 I 60 26 14 • log P = 1.46 + 0.11 (NC - NHET) AB/LogP 1.12 24 II 39 29 33 1.11 0.88 III 45 28 27 N=95 809, RMSE=1.04, R2=0.2 ALOGP 1.12 II 39 29 32 1.12 0.72 II 52 33 15 ALOGP98 1.12 II 40 28 32 1.10 0.73 II 52 31 17 OsirisP 1.13 6 II 39 28 33 1.12 0.85 II 43 33 24 AAM 1.16 III 33 29 38 1.16 0.94 III 42 31 27 Different MlogP implementations CLOGP 1.23 III 37 28 35 1.21 1.01 III 46 28 22 ACD/logP 1.28 III 35 27 38 1.28 0.87 III 46 34 21 demonstrate very different CSlogP 1.29 20 III 37 27 36 1.28 1.06 III 38 29 33 performances for both sets COSMOFrag 1.30 10883 III 32 27 40 1.30 1.06 III 29 31 40 QikProp 1.32 103 III 31 26 43 1.32 1.17 III 27 24 49 KowWIN 1.32 16 III 33 26 41 1.31 1.20 III 29 27 44 QLogP 1.33 24 III 34 27 39 1.32 0.80 II 50 33 17 XLOGP2 1.80 III 15 17 68 1.80 0.94 III 39 31 29 N.B! Do we really compare MLOGP(Dragon) 2.03 III 34 24 42 2.03 0.90 III 45 30 25 methods or their implementations? 1Nr of molecu les with ca lculations failures due to errors or crash of programs. All methods predicted all molecules for the Nycomed dataset. 2RMSE calculated after excluding of 769 zwitterionic compounds from the Pfizer dataset. 3Most molecules failed by COSMOFrag are zwitterions. Mannhold, R. et al, J. Pharm. Sci., 2009, 98(3), 861-893.
9. 9. Optimization of a drug: we need to find a road in multidimensional space
10. 10. …and to arrive to a good result as soon as possible!
11. 11. We need to optimize our road: we can’t see mountains!
12. 12. Examples of distances to models Jaworska et al, ATLA, 33, 445- 459, 2005. Euclidian Mahalanobis Glende et al, Mutation Res., 2001. City-block Probability-density
13. 13. Distances to a model in descriptor space Jaworska et al, ATLA, 33, 445-459, 2005.
14. 14. Nearest neighbors in the input space x2 4 2 0 x -2 -4 -3 -2 -1 0 1 2 3 x1
15. 15. Nearest neighbors and activity 1 y 0.8 0.6 0.4 0.2 y=exp(-(x1+x2)2) 0 -3 -2 -1 0 1 2 3 x The nearest x=x1+x2 ! x2 4 neighbors in descriptor space are not the 2 neighbors in 0 x property! -2 -4 -3 -2 -1 0 1 2 3 x1
16. 16. Ensemble methods Hansen, L.K.; Salamon, P. IEEE Trans. Pattern. Anal. Mach. Learn., 1990, 12, 993. Tetko, I. V.; Luik, A. I.; Poda, G. I. J. Med. Chem., 1993, 36, 811. Tetko, I.V.; Livingstone, D. J.; Luik, A. I. Neural Network Studies. 1. Comparison of Overfitting and Overtraining. J. Chem. Inf. Comput. Sci. 1995, 35(5), 826.
17. 17. Who of them would be elected in 1904? 2000 G.W. Bush A. Gore republican democrat
18. 18. Elections results in 1904 1904 T. Roosevelt A. Parker republican democrat
19. 19. President elections in the USA by states The distribution of votes in states creates a typical fingerprint, which is invisible when we use just average values Democrats of 2000 == Republicans of 1904
20. 20. Ensemble Based Distance HO " net 1 % "12.3% \$ ' \$ ' \$ net 2 ' \$ 4.6 ' \$ M ' logP=3.11 \$ M ' \$ ' Morphinan-3-ol, 17-methyl- \$net 63' \$ ' N \$13.2' \$net 64' # & \$ #10.1&' HO " net 1 % "13.7% \$ ' \$ ' \$ net 2 ' logP=3.48 \$ 4.8 ' ! Levallorphan ! \$ M ' \$ M ' \$ ' \$net 63' N \$ ' \$15.8' \$ ' \$12.0' #net 64& # & ! ! CORREL -- correlation between model vectors for any two molecules
21. 21. Nearest neighbors and activity A B 1 y 1 y 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 x x x=x1+x2 C x2 4 2 CORREL measure 0 x correctly detect the nearest neighbors! -2 -4 -3 -2 -1 0 1 2 3 x1
22. 22. Local models: Instant learning of logP for Pt(II) molecules H2 N Cl Pt Cl H2N Cl Cl N Cl H2N Cl Pt N Cl Pt H 2N Cl Cl N Pt H2N Cl N Cl H2N Pt Cl H2N Cl Pt H2N Cl O N O H3N Cl Pt N O Pt O H3N Cl Prediction of new classes of compounds can be extremely difficult as exemplified by an absence of correlations between predicted and experimental values using developed by us ALOGPS program. Tetko et al, J. Inorg. Biochem, 2008, 102, 1424-37.
23. 23. Local models: Instant learning by knowledge transfer The use of LIBRARY mode (local correction of the global model) dramatically (5 times!) decreased logP errors, Tetko et al, J. Inorg. Biochem, 2008, 102, 1424-37.
24. 24. Performance of algorithms for in-house datasets Pfizer set (N = 95 809) Nycomed set (N = 882) % in error range RMSE, % in error range RMSE Failed1 rank RMSE rank <0.5 0.5- >1 zwitterions <0.5 0.5- >1 Method 1 excluded2 1 Consensus log P 0.95 I 48 29 24 0.94 0.58 I 61 32 7 ALOGPS 1.02 I 41 30 29 1.01 0.68 I 51 34 15 S+logP 1.02 I 44 29 27 1.00 0.69 I 58 27 15 NC+NHET 1.04 II 38 30 32 1.04 0.88 III 42 32 26 MLOGP(S+) 1.05 II 40 29 31 1.05 1.17 III 32 26 41 XLOGP3 1.07 II 43 28 29 1.06 0.65 I 55 34 12 MiLogP 1.10 27 II 41 28 30 1.09 0.67 I 60 26 14 AB/LogP 1.12 24 II 39 29 33 1.11 0.88 III 45 28 27 ALOGP 1.12 II 39 29 32 1.12 0.72 II 52 33 15 ALOGP98 1.12 II 40 28 32 1.10 0.73 II 52 31 17 OsirisP 1.13 6 II 39 28 33 1.12 0.85 II 43 33 24 AAM 1.16 III 33 29 38 1.16 0.94 III 42 31 27 CLOGP 1.23 III 37 28 35 1.21 1.01 III 46 28 22 ACD/logP 1.28 III 35 27 38 1.28 0.87 III 46 34 21 CSlogP 1.29 20 III 37 27 36 1.28 1.06 III 38 29 33 COSMOFrag 1.30 10883 III 32 27 40 1.30 1.06 III 29 31 40 QikProp 1.32 103 III 31 26 43 1.32 1.17 III 27 24 49 KowWIN 1.32 16 III 33 26 41 1.31 1.20 III 29 27 44 QLogP 1.33 24 III 34 27 39 1.32 0.80 II 50 33 17 XLOGP2 1.80 III 15 17 68 1.80 0.94 III 39 31 29 MLOGP(Dragon) 2.03 III 34 24 42 2.03 0.90 III 45 30 25 1Nr of molecu les with ca lculations failures due to errors or crash of programs. All methods predicted all molecules for the Nycomed dataset. 2RMSE calculated after excluding of 769 zwitterionic compounds from the Pfizer dataset. 3Most molecules failed by COSMOFrag are zwitterions. Mannhold, R. et al, J. Pharm. Sci., 2009, 98(3), 861-893.
25. 25. Local models: Instant learning of in-house data (Pfizer Inc.), N=95809 ALOGPS Blind prediction ALOGPS LIBRARY RMSE=1.02 RMSE=0.59 in less than 30 minutes of calculations on a notebook! Tetko, Poda, Ostermann, Mannhold, QSAR Comb. Sci, 2009, in press.
26. 26. REACH and QSAR (Quantitative Structure Activity Relationship) models > 140,000 chemicals to be registered … is a lot! It is expensive to measure all of them (\$200,000 per compound), a lot of animal testing QSAR models can be used to prioritize compounds • Compound is predicted to be toxic • Biological testing will be done to prove/ disprove the models • Compound is predicted to be not toxic • tests can be avoided, saving money, animals • but ... only if we are confident in the predictions • FP7 project CADASTER http://www.cadaster.eu has goals to develop a strategy for use of in silico methods in REACH
27. 27. Estimation toxicity of T. pyriformis Initial Dataset1,2 n=983 molecules n=644 training set n=339 test set 1 Test set 2: n=110 molecules1,2 The overall goal is to predict and to assess the reliability of predictions toxicity against T. pyriformis for chemicals directly from their structure. 1Zhu et al, J. Chem. Inf. Comput. Sci, 2008, 48(4), 766-784. 2Schultz et al, QSAR Comb Sci, 2007, 26(2), 238-254.
28. 28. Overview of analyzed distances to models (DMs) EUCLID k TANIMOTO "d j "x a,i x b,i EUm= j=1 k is number of nearest Tanimoto(a,b) = k "x a,i x a,i +" x b,i x b,i # " x a,i x b,i neighbors, m index of EUCLID = EU m xa,i and xb,i are fragment counts model ! ! LEVERAGE PLSEU (DModX) ! LEVERAGE=xT(XTX)-1x Error in approximation (restoration) of the vector of input variables from the latent variables and PLS weights. 1 2 CORREL STD STD = N "1 #( yi " y) CORREL(a) =maxj CORREL(a,j)=R2(Yacalc,Yjcalc) yi is value calculated with model i and y is average Ya=(y ,…,y ) is vector of predictions of molecule i ! 1 N value !
29. 29. Mixture of Gaussian Distributions (MGD) Idea is to find a MGD, which maximize likelihood (probability) ! N(0,!2(ei)) of the observed distribution of errors
30. 30. MGDs for the simulated datasets A) No significant MGD was found B) A MGD composed of 3 Gaussian distributions was found
31. 31. Distance does not work: Neural Network model
32. 32. Distance works: Neural Network model
33. 33. Estimations based on training set errors Tetko et al, J Chem Inf Model, 2008, 48(9):1733-46.
34. 34. Estimations of errors using MGD and 5-fold cross-validation Tetko et al, J Chem Inf Model, 2008, 48(9):1733-46.
35. 35. Prediction accuracy for training and two external sets Estimated experimental accuracy is about SE = 0.38 HPV (High Production Volume): 3182 EINECS (REACH): 48774 Challenge to predict toxicity organized in collaboration with European Neural Network Society at http://www.cadaster.eu Tetko et al, J Chem Inf Model, 2008, 48(9):1733-46.
36. 36. Challenges and solutions new measurement N O O reliable predictions O N O O OH N O new series to predict Our methodology allows confident navigation in a defined molecular space. ! It can be used to develop targeted (local) models covering specific series. ! It can be used to reliably estimate which compounds can/can’t be reliably predicted. ! It can be used to guide experimental design and to minimize costs of new measurements.
37. 37. Acknowledgements My group Collaborators: Mr I. Sushko Dr. C. Höfer (DMPKore) Mr S. Novotarskyi Dr. G. Poda (Pfizer) Mr A.K. Pandey Dr. C. Ostermann (Nycomed) Mr R. Körner Prof. R. Mannhold (Düsseldorf Mr S. Brandmaier University) Dr M. Rupp Prof. A. Tropsha (NC, USA) Prof. T. Oprea (New Mexico, USA) + many other colleagues & Visiting Scientists co-authors Dr. V. Kovalishyn Dr. V. Prokopenko Prof. J. Emmersen Funding FP7 MC ITN ECO GO-Bio BMBF http://qspr.eu FP7 CADASTER http://www.cadaster.ue Germany-Ukraine grant UKR 08/006 FP6 INTAS VCCLAB http://www.vcclab.org DFG TE 380/1-1