Outliers in a cliff prediction model are not as severe since SALI changes more slowly than just activity differences
For SALI = 0, had to set log10(SALI) = 0Similar performance if we use SALI and not log10(SALI) at least more % variance is explained. Still fail on most significant cliffs
Transcript of "Predicting Activity Cliffs - Can Machine Learning Handle Special Cases?"
1.
Predicting Activity Cliffs - Can We Use Machine Learning for Special Cases?<br />Rajarshi Guha<br />NIH Center for Translational Therapeutics<br />August 4, 2011<br />Joint Statistical Meeting, Miami Beach<br />
3.
Structure Activity Relationships<br />Similar molecules will have similar activities<br />Small changes in structure will lead to small changes in activity<br />One implication is that SAR’s are additive<br />This is the basis for QSAR modeling<br />Martin, Y.C. et al., J. Med. Chem., 2002, 45, 4350–4358<br />
4.
Exceptions Are Easy to Find<br />Ki = 39.0 nM<br />Ki = 1.8 nM<br />Ki = 10.0 nM<br />Ki = 1.0 nM<br />Tran, J.A. et al., Bioorg. Med. Chem. Lett., 2007, 15, 5166–5176<br />
5.
Structure Activity Landscapes<br />Rugged gorges or rolling hills?<br />Small structural changes associated with large activity changes represent steep slopes in the landscape<br />But traditionally, QSAR assumes gentle slopes<br />Machine learning is not very good for special cases<br />Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535<br />
6.
Characterizing the Landscape<br />A cliff can be numerically characterized<br />Structure Activity Landscape Index (SALI)<br />Cliffs are characterized by elements of the matrix with very large values<br />Guha, R.; Van Drie, J.H., J. Chem. Inf. Model., 2008, 48, 646–658<br />
7.
Visualizing SALI Values<br />The SALI graph<br />Compounds are nodes<br />Nodes i,j are connected if SALI(i,j) > X<br />Only display connected nodes<br />
8.
What Can We Do With SALI’s?<br />SALI characterizes cliffs & non-cliffs<br />For a given molecular representation, SALI’s gives us an idea of thesmoothness of the SAR landscape<br />Models try and encodethis landscape<br />Use the landscape to guidedescriptor or model selection<br />
9.
Descriptor Space Smoothness<br />Edge count of the SALI graph for varying cutoffs<br />Measures smoothness of the descriptor space<br />Can reduce this to a single number (AUC)<br />
10.
Feature Selection Using SALI<br />Instead of fingerprints, we use molecular descriptors<br />SALI denominator now uses Euclidean distance<br />2D & 3D random descriptor sets<br />None are really good<br />Too rough, or<br />Too flat<br />2D<br />3D<br />
11.
Measuring Model Quality<br />A QSAR model should easily encode the “rolling hills”<br />A good model captures the most signiﬁcantcliﬀs<br />Can be formalized as <br />How many of the edge orderings of a SALI graph does the model predict correctly?<br />Deﬁne S (X ), representing the number of edges correctly predicted for a SALI network at a threshold X<br />Repeat for varying X and obtain the SALI curve<br />
13.
Predicting the Landscape<br />Rather than predicting activity directly, we can try to predict the SAR landscape<br />Implies that we attempt to directly predict cliffs<br />Observations are now pairs of molecules<br />A more complex problem<br />Choice of features is trickier<br />Still face the problem of cliffs as outliers<br />Somewhat similar to predicting activity differences<br />Scheiber et al, Statistical Analysis and Data Mining, 2009, 2, 115-122<br />
14.
Motivation<br />Predicting activity cliffs corresponds to extending the SAR landscape<br />Identify whether a new molecule will perform better or worse compared to the specific molecules in the dataset<br />Can be useful for guiding lead optimization, but not necessarily useful for lead hopping<br />
15.
Predicting Cliffs<br />Dependent variable are pairwise SALI values, calculated using fingerprints<br />Independent variables are molecular descriptors – but considered pairwise<br />Absolute difference of descriptor pairs, or<br />Geometric mean of descriptor pairs<br />…<br />Develop a model to correlate pairwise descriptors to pairwise SALI values<br />
16.
A Test Case<br />We first consider the CavalliCoMFA dataset of 30 molecules with pIC50’s<br />Evaluate topological and physicochemical descriptors<br />Developed random forest models<br />On the original observed values (30 obs)<br />On the SALI values (435 observations)<br />Cavalli, A. et al, J Med Chem, 2002, 45, 3844-3853<br />
17.
Double Counting Structures?<br />The dependent and independent variables both encode structure. <br />But pretty low correlations between individual pairwisedescriptors and the SALI values<br />
18.
Model Summaries<br />Original pIC50<br />RMSE = 0.97<br />SALI, AbsDiff<br />RMSE = 1.10<br />SALI, GeoMean<br />RMSE = 1.04<br />All models explain similar % of variance of their respective datasets <br />Using geometric mean as the descriptor aggregation function seems to perform best<br />SALI models are more robust due to larger size of the dataset<br />
19.
Test Case 2<br />Considered the Holloway docking dataset, 32 molecules with pIC50’s and Einter<br />Similar strategy as before<br />Need to transform SALI values <br />Descriptors show minimal correlation<br />Holloway, M.K. et al, J Med Chem, 1995, 38, 305-317<br />
20.
Model Summaries<br />Original pIC50<br />RMSE = 1.05<br />SALI, AbsDiff<br />RMSE = 0.48<br />SALI, GeoMean<br />RMSE = 0.48<br />The SALI models perform much poorer in terms of % of variance explained<br />Descriptor aggregation method does not seem to have much effect<br />The SALI models appear to perform decently on the cliffs – but misses the most significant <br />
21.
Model Summaries<br />Original pIC50<br />RMSE = 1.05<br />SALI, AbsDiff<br />RMSE = 9.76<br />SALI, GeoMean<br />RMSE = 10.01<br />With untransformed SALI values, models perform similarly in terms of % of variance explained<br />The most significant cliffs correspond to stereoisomers<br />
22.
Test Case 3<br />38 adenosine receptor antagonists with reported Ki values; use 35 for training and 3 for testing<br />Random forest model on the SALI values performed reasonable well (RMSE = 7.51, R2=0.62)<br />Upper end ofSALI rangeis better predicted<br />Kalla, R.V. et al, J. Med. Chem., 2006, 48, 1984-2008<br />
23.
Test Case 3<br /><ul><li>The dataset does not containing really big cliffs
24.
Generally, performance is poorer for smaller cliffs</li></ul>For any given hold out molecule, range of error in SALI prediction is large<br />Suggests that some form of domain applicability metric would be useful <br />
25.
Model Caveats<br />Models based on SALI values are dependent on their being an SAR in the original activity data<br />Scrambling results for these models are poorer than the original models but aren’t as random as expected<br />
26.
Conclusions<br />SALI is the first step in characterizing the SAR landscape<br />Allows us to directly analyze the landscape, as opposed to individual molecules<br />Being able to predict the landscape could serve as a useful way to extend an SAR landscape<br />
27.
Acknowledgements<br />John Van Drie<br />Gerry Maggiora<br />MicLajiness<br />JurgenBajorath<br />
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.
Be the first to comment