Your SlideShare is downloading. ×
Predicting Activity Cliffs - Can Machine Learning Handle Special Cases?
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Predicting Activity Cliffs - Can Machine Learning Handle Special Cases?


Published on

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Outliers in a cliff prediction model are not as severe since SALI changes more slowly than just activity differences
  • For SALI = 0, had to set log10(SALI) = 0Similar performance if we use SALI and not log10(SALI) at least more % variance is explained. Still fail on most significant cliffs
  • Transcript

    • 1. Predicting Activity Cliffs - Can We Use Machine Learning for Special Cases?
      Rajarshi Guha
      NIH Center for Translational Therapeutics
      August 4, 2011
      Joint Statistical Meeting, Miami Beach
    • 2. Outline
      Structure-activity landscapes
    • 3. Structure Activity Relationships
      Similar molecules will have similar activities
      Small changes in structure will lead to small changes in activity
      One implication is that SAR’s are additive
      This is the basis for QSAR modeling
      Martin, Y.C. et al., J. Med. Chem., 2002, 45, 4350–4358
    • 4. Exceptions Are Easy to Find
      Ki = 39.0 nM
      Ki = 1.8 nM
      Ki = 10.0 nM
      Ki = 1.0 nM
      Tran, J.A. et al., Bioorg. Med. Chem. Lett., 2007, 15, 5166–5176
    • 5. Structure Activity Landscapes
      Rugged gorges or rolling hills?
      Small structural changes associated with large activity changes represent steep slopes in the landscape
      But traditionally, QSAR assumes gentle slopes
      Machine learning is not very good for special cases
      Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535
    • 6. Characterizing the Landscape
      A cliff can be numerically characterized
      Structure Activity Landscape Index (SALI)
      Cliffs are characterized by elements of the matrix with very large values
      Guha, R.; Van Drie, J.H., J. Chem. Inf. Model., 2008, 48, 646–658
    • 7. Visualizing SALI Values
      The SALI graph
      Compounds are nodes
      Nodes i,j are connected if SALI(i,j) > X
      Only display connected nodes
    • 8. What Can We Do With SALI’s?
      SALI characterizes cliffs & non-cliffs
      For a given molecular representation, SALI’s gives us an idea of thesmoothness of the SAR landscape
      Models try and encodethis landscape
      Use the landscape to guidedescriptor or model selection
    • 9. Descriptor Space Smoothness
      Edge count of the SALI graph for varying cutoffs
      Measures smoothness of the descriptor space
      Can reduce this to a single number (AUC)
    • 10. Feature Selection Using SALI
      Instead of fingerprints, we use molecular descriptors
      SALI denominator now uses Euclidean distance
      2D & 3D random descriptor sets
      None are really good
      Too rough, or
      Too flat
    • 11. Measuring Model Quality
      A QSAR model should easily encode the “rolling hills”
      A good model captures the most significantcliffs
      Can be formalized as
      How many of the edge orderings of a SALI graph does the model predict correctly?
      Define S (X ), representing the number of edges correctly predicted for a SALI network at a threshold X
      Repeat for varying X and obtain the SALI curve
    • 12. SALI Curves
    • 13. Predicting the Landscape
      Rather than predicting activity directly, we can try to predict the SAR landscape
      Implies that we attempt to directly predict cliffs
      Observations are now pairs of molecules
      A more complex problem
      Choice of features is trickier
      Still face the problem of cliffs as outliers
      Somewhat similar to predicting activity differences
      Scheiber et al, Statistical Analysis and Data Mining, 2009, 2, 115-122
    • 14. Motivation
      Predicting activity cliffs corresponds to extending the SAR landscape
      Identify whether a new molecule will perform better or worse compared to the specific molecules in the dataset
      Can be useful for guiding lead optimization, but not necessarily useful for lead hopping
    • 15. Predicting Cliffs
      Dependent variable are pairwise SALI values, calculated using fingerprints
      Independent variables are molecular descriptors – but considered pairwise
      Absolute difference of descriptor pairs, or
      Geometric mean of descriptor pairs

      Develop a model to correlate pairwise descriptors to pairwise SALI values
    • 16. A Test Case
      We first consider the CavalliCoMFA dataset of 30 molecules with pIC50’s
      Evaluate topological and physicochemical descriptors
      Developed random forest models
      On the original observed values (30 obs)
      On the SALI values (435 observations)
      Cavalli, A. et al, J Med Chem, 2002, 45, 3844-3853
    • 17. Double Counting Structures?
      The dependent and independent variables both encode structure.
      But pretty low correlations between individual pairwisedescriptors and the SALI values
    • 18. Model Summaries
      Original pIC50
      RMSE = 0.97
      SALI, AbsDiff
      RMSE = 1.10
      SALI, GeoMean
      RMSE = 1.04
      All models explain similar % of variance of their respective datasets
      Using geometric mean as the descriptor aggregation function seems to perform best
      SALI models are more robust due to larger size of the dataset
    • 19. Test Case 2
      Considered the Holloway docking dataset, 32 molecules with pIC50’s and Einter
      Similar strategy as before
      Need to transform SALI values
      Descriptors show minimal correlation
      Holloway, M.K. et al, J Med Chem, 1995, 38, 305-317
    • 20. Model Summaries
      Original pIC50
      RMSE = 1.05
      SALI, AbsDiff
      RMSE = 0.48
      SALI, GeoMean
      RMSE = 0.48
      The SALI models perform much poorer in terms of % of variance explained
      Descriptor aggregation method does not seem to have much effect
      The SALI models appear to perform decently on the cliffs – but misses the most significant
    • 21. Model Summaries
      Original pIC50
      RMSE = 1.05
      SALI, AbsDiff
      RMSE = 9.76
      SALI, GeoMean
      RMSE = 10.01
      With untransformed SALI values, models perform similarly in terms of % of variance explained
      The most significant cliffs correspond to stereoisomers
    • 22. Test Case 3
      38 adenosine receptor antagonists with reported Ki values; use 35 for training and 3 for testing
      Random forest model on the SALI values performed reasonable well (RMSE = 7.51, R2=0.62)
      Upper end ofSALI rangeis better predicted
      Kalla, R.V. et al, J. Med. Chem., 2006, 48, 1984-2008
    • 23. Test Case 3
      • The dataset does not containing really big cliffs
      • 24. Generally, performance is poorer for smaller cliffs
      For any given hold out molecule, range of error in SALI prediction is large
      Suggests that some form of domain applicability metric would be useful
    • 25. Model Caveats
      Models based on SALI values are dependent on their being an SAR in the original activity data
      Scrambling results for these models are poorer than the original models but aren’t as random as expected
    • 26. Conclusions
      SALI is the first step in characterizing the SAR landscape
      Allows us to directly analyze the landscape, as opposed to individual molecules
      Being able to predict the landscape could serve as a useful way to extend an SAR landscape
    • 27. Acknowledgements
      John Van Drie
      Gerry Maggiora