Predicting Activity Cliffs - Can Machine Learning Handle Special Cases?


Published on

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Outliers in a cliff prediction model are not as severe since SALI changes more slowly than just activity differences
  • For SALI = 0, had to set log10(SALI) = 0Similar performance if we use SALI and not log10(SALI) at least more % variance is explained. Still fail on most significant cliffs
  • Predicting Activity Cliffs - Can Machine Learning Handle Special Cases?

    1. 1. Predicting Activity Cliffs - Can We Use Machine Learning for Special Cases?<br />Rajarshi Guha<br />NIH Center for Translational Therapeutics<br />August 4, 2011<br />Joint Statistical Meeting, Miami Beach<br />
    2. 2. Outline<br />Structure-activity landscapes<br />Characterization<br />Prediction<br />
    3. 3. Structure Activity Relationships<br />Similar molecules will have similar activities<br />Small changes in structure will lead to small changes in activity<br />One implication is that SAR’s are additive<br />This is the basis for QSAR modeling<br />Martin, Y.C. et al., J. Med. Chem., 2002, 45, 4350–4358<br />
    4. 4. Exceptions Are Easy to Find<br />Ki = 39.0 nM<br />Ki = 1.8 nM<br />Ki = 10.0 nM<br />Ki = 1.0 nM<br />Tran, J.A. et al., Bioorg. Med. Chem. Lett., 2007, 15, 5166–5176<br />
    5. 5. Structure Activity Landscapes<br />Rugged gorges or rolling hills?<br />Small structural changes associated with large activity changes represent steep slopes in the landscape<br />But traditionally, QSAR assumes gentle slopes<br />Machine learning is not very good for special cases<br />Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535<br />
    6. 6. Characterizing the Landscape<br />A cliff can be numerically characterized<br />Structure Activity Landscape Index (SALI)<br />Cliffs are characterized by elements of the matrix with very large values<br />Guha, R.; Van Drie, J.H., J. Chem. Inf. Model., 2008, 48, 646–658<br />
    7. 7. Visualizing SALI Values<br />The SALI graph<br />Compounds are nodes<br />Nodes i,j are connected if SALI(i,j) > X<br />Only display connected nodes<br />
    8. 8. What Can We Do With SALI’s?<br />SALI characterizes cliffs & non-cliffs<br />For a given molecular representation, SALI’s gives us an idea of thesmoothness of the SAR landscape<br />Models try and encodethis landscape<br />Use the landscape to guidedescriptor or model selection<br />
    9. 9. Descriptor Space Smoothness<br />Edge count of the SALI graph for varying cutoffs<br />Measures smoothness of the descriptor space<br />Can reduce this to a single number (AUC)<br />
    10. 10. Feature Selection Using SALI<br />Instead of fingerprints, we use molecular descriptors<br />SALI denominator now uses Euclidean distance<br />2D & 3D random descriptor sets<br />None are really good<br />Too rough, or<br />Too flat<br />2D<br />3D<br />
    11. 11. Measuring Model Quality<br />A QSAR model should easily encode the “rolling hills”<br />A good model captures the most significantcliffs<br />Can be formalized as <br />How many of the edge orderings of a SALI graph does the model predict correctly?<br />Define S (X ), representing the number of edges correctly predicted for a SALI network at a threshold X<br />Repeat for varying X and obtain the SALI curve<br />
    12. 12. SALI Curves<br />
    13. 13. Predicting the Landscape<br />Rather than predicting activity directly, we can try to predict the SAR landscape<br />Implies that we attempt to directly predict cliffs<br />Observations are now pairs of molecules<br />A more complex problem<br />Choice of features is trickier<br />Still face the problem of cliffs as outliers<br />Somewhat similar to predicting activity differences<br />Scheiber et al, Statistical Analysis and Data Mining, 2009, 2, 115-122<br />
    14. 14. Motivation<br />Predicting activity cliffs corresponds to extending the SAR landscape<br />Identify whether a new molecule will perform better or worse compared to the specific molecules in the dataset<br />Can be useful for guiding lead optimization, but not necessarily useful for lead hopping<br />
    15. 15. Predicting Cliffs<br />Dependent variable are pairwise SALI values, calculated using fingerprints<br />Independent variables are molecular descriptors – but considered pairwise<br />Absolute difference of descriptor pairs, or<br />Geometric mean of descriptor pairs<br />…<br />Develop a model to correlate pairwise descriptors to pairwise SALI values<br />
    16. 16. A Test Case<br />We first consider the CavalliCoMFA dataset of 30 molecules with pIC50’s<br />Evaluate topological and physicochemical descriptors<br />Developed random forest models<br />On the original observed values (30 obs)<br />On the SALI values (435 observations)<br />Cavalli, A. et al, J Med Chem, 2002, 45, 3844-3853<br />
    17. 17. Double Counting Structures?<br />The dependent and independent variables both encode structure. <br />But pretty low correlations between individual pairwisedescriptors and the SALI values<br />
    18. 18. Model Summaries<br />Original pIC50<br />RMSE = 0.97<br />SALI, AbsDiff<br />RMSE = 1.10<br />SALI, GeoMean<br />RMSE = 1.04<br />All models explain similar % of variance of their respective datasets <br />Using geometric mean as the descriptor aggregation function seems to perform best<br />SALI models are more robust due to larger size of the dataset<br />
    19. 19. Test Case 2<br />Considered the Holloway docking dataset, 32 molecules with pIC50’s and Einter<br />Similar strategy as before<br />Need to transform SALI values <br />Descriptors show minimal correlation<br />Holloway, M.K. et al, J Med Chem, 1995, 38, 305-317<br />
    20. 20. Model Summaries<br />Original pIC50<br />RMSE = 1.05<br />SALI, AbsDiff<br />RMSE = 0.48<br />SALI, GeoMean<br />RMSE = 0.48<br />The SALI models perform much poorer in terms of % of variance explained<br />Descriptor aggregation method does not seem to have much effect<br />The SALI models appear to perform decently on the cliffs – but misses the most significant <br />
    21. 21. Model Summaries<br />Original pIC50<br />RMSE = 1.05<br />SALI, AbsDiff<br />RMSE = 9.76<br />SALI, GeoMean<br />RMSE = 10.01<br />With untransformed SALI values, models perform similarly in terms of % of variance explained<br />The most significant cliffs correspond to stereoisomers<br />
    22. 22. Test Case 3<br />38 adenosine receptor antagonists with reported Ki values; use 35 for training and 3 for testing<br />Random forest model on the SALI values performed reasonable well (RMSE = 7.51, R2=0.62)<br />Upper end ofSALI rangeis better predicted<br />Kalla, R.V. et al, J. Med. Chem., 2006, 48, 1984-2008<br />
    23. 23. Test Case 3<br /><ul><li>The dataset does not containing really big cliffs
    24. 24. Generally, performance is poorer for smaller cliffs</li></ul>For any given hold out molecule, range of error in SALI prediction is large<br />Suggests that some form of domain applicability metric would be useful <br />
    25. 25. Model Caveats<br />Models based on SALI values are dependent on their being an SAR in the original activity data<br />Scrambling results for these models are poorer than the original models but aren’t as random as expected<br />
    26. 26. Conclusions<br />SALI is the first step in characterizing the SAR landscape<br />Allows us to directly analyze the landscape, as opposed to individual molecules<br />Being able to predict the landscape could serve as a useful way to extend an SAR landscape<br />
    27. 27. Acknowledgements<br />John Van Drie<br />Gerry Maggiora<br />MicLajiness<br />JurgenBajorath<br />