Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Che Bel Progresso Abbiamo Fatto! Iged by Sandro Fontana 419 views
- Toyota 52 6 fgcu33 52-6fgcu35 52-6f... by zjlnfdkkjdwh 158 views
- The Trans-NIH RNAi Initiative: Inf... by rguha 1365 views
- Sp analyst seriesbuildprocess by Scott Brewster 270 views
- High throughput functional genomics... by Australian Bioinf... 714 views
- Fabi by guesta61d28 213 views

1,739 views

Published on

No Downloads

Total views

1,739

On SlideShare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

30

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Predicting Activity Cliffs - Can We Use Machine Learning for Special Cases?<br />Rajarshi Guha<br />NIH Center for Translational Therapeutics<br />August 4, 2011<br />Joint Statistical Meeting, Miami Beach<br />
- 2. Outline<br />Structure-activity landscapes<br />Characterization<br />Prediction<br />
- 3. Structure Activity Relationships<br />Similar molecules will have similar activities<br />Small changes in structure will lead to small changes in activity<br />One implication is that SAR’s are additive<br />This is the basis for QSAR modeling<br />Martin, Y.C. et al., J. Med. Chem., 2002, 45, 4350–4358<br />
- 4. Exceptions Are Easy to Find<br />Ki = 39.0 nM<br />Ki = 1.8 nM<br />Ki = 10.0 nM<br />Ki = 1.0 nM<br />Tran, J.A. et al., Bioorg. Med. Chem. Lett., 2007, 15, 5166–5176<br />
- 5. Structure Activity Landscapes<br />Rugged gorges or rolling hills?<br />Small structural changes associated with large activity changes represent steep slopes in the landscape<br />But traditionally, QSAR assumes gentle slopes<br />Machine learning is not very good for special cases<br />Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535<br />
- 6. Characterizing the Landscape<br />A cliff can be numerically characterized<br />Structure Activity Landscape Index (SALI)<br />Cliffs are characterized by elements of the matrix with very large values<br />Guha, R.; Van Drie, J.H., J. Chem. Inf. Model., 2008, 48, 646–658<br />
- 7. Visualizing SALI Values<br />The SALI graph<br />Compounds are nodes<br />Nodes i,j are connected if SALI(i,j) > X<br />Only display connected nodes<br />
- 8. What Can We Do With SALI’s?<br />SALI characterizes cliffs & non-cliffs<br />For a given molecular representation, SALI’s gives us an idea of thesmoothness of the SAR landscape<br />Models try and encodethis landscape<br />Use the landscape to guidedescriptor or model selection<br />
- 9. Descriptor Space Smoothness<br />Edge count of the SALI graph for varying cutoffs<br />Measures smoothness of the descriptor space<br />Can reduce this to a single number (AUC)<br />
- 10. Feature Selection Using SALI<br />Instead of fingerprints, we use molecular descriptors<br />SALI denominator now uses Euclidean distance<br />2D & 3D random descriptor sets<br />None are really good<br />Too rough, or<br />Too flat<br />2D<br />3D<br />
- 11. Measuring Model Quality<br />A QSAR model should easily encode the “rolling hills”<br />A good model captures the most signiﬁcantcliﬀs<br />Can be formalized as <br />How many of the edge orderings of a SALI graph does the model predict correctly?<br />Deﬁne S (X ), representing the number of edges correctly predicted for a SALI network at a threshold X<br />Repeat for varying X and obtain the SALI curve<br />
- 12. SALI Curves<br />
- 13. Predicting the Landscape<br />Rather than predicting activity directly, we can try to predict the SAR landscape<br />Implies that we attempt to directly predict cliffs<br />Observations are now pairs of molecules<br />A more complex problem<br />Choice of features is trickier<br />Still face the problem of cliffs as outliers<br />Somewhat similar to predicting activity differences<br />Scheiber et al, Statistical Analysis and Data Mining, 2009, 2, 115-122<br />
- 14. Motivation<br />Predicting activity cliffs corresponds to extending the SAR landscape<br />Identify whether a new molecule will perform better or worse compared to the specific molecules in the dataset<br />Can be useful for guiding lead optimization, but not necessarily useful for lead hopping<br />
- 15. Predicting Cliffs<br />Dependent variable are pairwise SALI values, calculated using fingerprints<br />Independent variables are molecular descriptors – but considered pairwise<br />Absolute difference of descriptor pairs, or<br />Geometric mean of descriptor pairs<br />…<br />Develop a model to correlate pairwise descriptors to pairwise SALI values<br />
- 16. A Test Case<br />We first consider the CavalliCoMFA dataset of 30 molecules with pIC50’s<br />Evaluate topological and physicochemical descriptors<br />Developed random forest models<br />On the original observed values (30 obs)<br />On the SALI values (435 observations)<br />Cavalli, A. et al, J Med Chem, 2002, 45, 3844-3853<br />
- 17. Double Counting Structures?<br />The dependent and independent variables both encode structure. <br />But pretty low correlations between individual pairwisedescriptors and the SALI values<br />
- 18. Model Summaries<br />Original pIC50<br />RMSE = 0.97<br />SALI, AbsDiff<br />RMSE = 1.10<br />SALI, GeoMean<br />RMSE = 1.04<br />All models explain similar % of variance of their respective datasets <br />Using geometric mean as the descriptor aggregation function seems to perform best<br />SALI models are more robust due to larger size of the dataset<br />
- 19. Test Case 2<br />Considered the Holloway docking dataset, 32 molecules with pIC50’s and Einter<br />Similar strategy as before<br />Need to transform SALI values <br />Descriptors show minimal correlation<br />Holloway, M.K. et al, J Med Chem, 1995, 38, 305-317<br />
- 20. Model Summaries<br />Original pIC50<br />RMSE = 1.05<br />SALI, AbsDiff<br />RMSE = 0.48<br />SALI, GeoMean<br />RMSE = 0.48<br />The SALI models perform much poorer in terms of % of variance explained<br />Descriptor aggregation method does not seem to have much effect<br />The SALI models appear to perform decently on the cliffs – but misses the most significant <br />
- 21. Model Summaries<br />Original pIC50<br />RMSE = 1.05<br />SALI, AbsDiff<br />RMSE = 9.76<br />SALI, GeoMean<br />RMSE = 10.01<br />With untransformed SALI values, models perform similarly in terms of % of variance explained<br />The most significant cliffs correspond to stereoisomers<br />
- 22. Test Case 3<br />38 adenosine receptor antagonists with reported Ki values; use 35 for training and 3 for testing<br />Random forest model on the SALI values performed reasonable well (RMSE = 7.51, R2=0.62)<br />Upper end ofSALI rangeis better predicted<br />Kalla, R.V. et al, J. Med. Chem., 2006, 48, 1984-2008<br />
- 23. Test Case 3<br /><ul><li>The dataset does not containing really big cliffs
- 24. Generally, performance is poorer for smaller cliffs</li></ul>For any given hold out molecule, range of error in SALI prediction is large<br />Suggests that some form of domain applicability metric would be useful <br />
- 25. Model Caveats<br />Models based on SALI values are dependent on their being an SAR in the original activity data<br />Scrambling results for these models are poorer than the original models but aren’t as random as expected<br />
- 26. Conclusions<br />SALI is the first step in characterizing the SAR landscape<br />Allows us to directly analyze the landscape, as opposed to individual molecules<br />Being able to predict the landscape could serve as a useful way to extend an SAR landscape<br />
- 27. Acknowledgements<br />John Van Drie<br />Gerry Maggiora<br />MicLajiness<br />JurgenBajorath<br />

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment