Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Navigating Molecular Haystacks: Tools & Applications

1,160 views

Published on

  • Login to see the comments

  • Be the first to like this

Navigating Molecular Haystacks: Tools & Applications

  1. 1. Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Navigating Molecular Haystacks: Tools & Applications Finding Interesting Molecules & Doing it Fast Rajarshi Guha Department of Chemistry Pennsylvania State University 20th April, 2006
  2. 2. Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Past Work QSAR Applications Artemisinin analogs PDGFR inhibitors Bleaching agents Linear & non-linear methods QSAR Methods Representative QSAR sets Model interpretation Model applicability
  3. 3. Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Past Work Cheminformatics Automated QSAR pipeline Contributions to the CDK Cheminformatics webservices R packages and snippets
  4. 4. Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Past Work Chemical Data Mining Approximate k-NN Local regression Outlier detection VS protocol
  5. 5. Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Past Work Chemical Data Mining Approximate k-NN Local regression Outlier detection VS protocol
  6. 6. Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Outline 1 Outlier Detection Using R-NN Curves Methods for Diversity Analysis Generating & Summarizing R-NN Curves Using R-NN Curves 2 Searching for HIV Integrase Inhibitors Previous Work on HIV Integrase A Tiered Virtual Screening Protocol What Does the Pipeline Give Us? 3 Summary
  7. 7. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Outline 1 Outlier Detection Using R-NN Curves Methods for Diversity Analysis Generating & Summarizing R-NN Curves Using R-NN Curves 2 Searching for HIV Integrase Inhibitors Previous Work on HIV Integrase A Tiered Virtual Screening Protocol What Does the Pipeline Give Us? 3 Summary
  8. 8. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Diversity Analysis Why is it Important? Compound acquisition Lead hopping Knowledge of the distribution of compounds in a descriptor space may improve predictive models
  9. 9. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Approaches to Diversity Analysis Cell based q q q Divide space into bins q q q q qq q q qq q q q qq q q q q q q q qq q q qq qq q q qq q qq q q q q qq q q q Compounds are mapped to bins q q q q q qqq q qq q q qqq q qq q q q q q q q q q q q qqq q q q q q qqq q q q q q q q q q q qq q q q q q q q q qq q q q q q q qq q q q q q q q q q q q qq q q q q q q Disadvantages q qq q q q q Not useful for high dimensional data Choosing the bin size can be tricky Schnur, D.; J. Chem. Inf. Comput. Sci. 1999, 39, 36–45 Agrafiotis, D.; Rassokhin, D.; J. Chem. Inf. Comput. Sci. 2002, 42, 117–12
  10. 10. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Approaches to Diversity Analysis q Distance based q q q q q q Considers distance between q q q q q compounds in a space q q q q q q q q q q q Generally requires pairwise distance q q q q calculation q q q q q q q q q q q q q Can be sped up by kD trees, MVP q q q q q trees etc. q q q q q Agrafiotis, D. K.; Lobanov, V. S.; J. Chem. Inf. Comput. Sci. 1999, 39, 51–58
  11. 11. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Generating an R-NN Curve Observations Consider a query point with a hypersphere, of radius R, centered on it For small R, the hypersphere will contain very few or no neighbors For larger R, the number of neighbors will increase When R ≥ Dmax , the neighbor set is the whole dataset The question is . . . Does the variation of nearest neighbor count with radius allow us to characterize the location of a query point in a dataset?
  12. 12. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Generating an R-NN Curve Observations Consider a query point with a hypersphere, of radius R, centered on it For small R, the hypersphere will contain very few or no neighbors For larger R, the number of neighbors will increase When R ≥ Dmax , the neighbor set is the whole dataset The question is . . . Does the variation of nearest neighbor count with radius allow us to characterize the location of a query point in a dataset?
  13. 13. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Generating an R-NN Curve Algorithm q q 50% q q Dmax ← max pairwise distance q q q q qq 30% for molecule in dataset do q q q q q q q q q q q q q q q q q q q R ← 0.01 × Dmax q q q q q q q qq q q q q q q q q qq q q q q q q 10% q q q q q q q q q q q 5% while R ≤ Dmax do q q q q qq qq q q q q q q qq q q q q qq q q q q q q q q q q q q qq q q q q q q q q Find NN’s within radius R q q q q q q q q q q q q q q q q q q q Increment R q q q q q q q q q q q q q q q q q end while q end for q Plot NN count vs. R Guha, R.; Dutta, D.; Jurs, P.C.; Chen, T.; J. Chem. Inf. Model., submitted
  14. 14. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Generating an R-NN Curve qq qq qqqq qqq qqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqq q qqqqqq qqqqq q qqqq qqqq 250 q 250 qq qqq qq Number of Neighbors Number of Neighbors q q 200 200 qq q q q 150 150 qq q q q q 100 100 q qq q q 50 q 50 qq q qq qq q qqq qq qqqqqqqqqq qqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqq 0 qqq qq 0 0 20 40 60 80 100 0 20 40 60 80 100 R R Sparse Dense
  15. 15. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Characterizing an R-NN Curve Converting the Plot to Numbers Since R-NN curves are sigmoidal, fit them to the logistic equation 1 + m e −R/τ NN = a · 1 + n e −R/τ m, n should characterize the curve Problems Two parameters Non-linear fitting is dependent on the starting point For some starting points, the fit does not converge and requires repetition
  16. 16. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Characterizing an R-NN Curve Converting the Plot to a Number Determine the value of R where the lower tail transitions to the linear portion of the curve S=0 S’’ Solution Determine the slope at various points on the curve S’ S=0 Find R for the first occurence of the maximal slope (Rmax(S) ) Can be achieved using a finite difference approach
  17. 17. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Characterizing an R-NN Curve Converting the Plot to a Number Determine the value of R where the lower tail transitions to the linear portion of the curve S=0 S’’ Solution Determine the slope at various points on the curve S’ S=0 Find R for the first occurence of the maximal slope (Rmax(S) ) Can be achieved using a finite difference approach
  18. 18. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Characterizing Multiple R-NN Curves Problem Visual inspection of curves is useful for a few molecules 80 For larger datasets we need to summarize R-NN curves 60 Rmax S 40 Solution Plot Rmax(S) values for each 20 molecule in the dataset Points at the top of the plot are 0 0 50 100 150 200 250 Serial Number located in the sparsest regions Points at the bottom are located in the densest regions
  19. 19. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves How Can We Use It For Large Datasets? Breaking the O(n2 ) barrier Traditional NN detection has a time complexity of O(n2 ) Modern NN algorithms such as kD-trees have lower time complexity restricted to the exact NN problem Solution is to use approximate NN algorithms such as Locality Sensitive Hashing (LSH) Bentley, J.; Commun. ACM 1980, 23, 214–229 Datar, M. et al.; SCG ’04: Proc. 20th Symp. Comp. Geom.; ACM Press, 2004 Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
  20. 20. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves How Can We Use It For Large Datasets? −1.0 −1.5 log [Mean Query Time (sec)] Why LSH? −2.0 Theoretically sublinear −2.5 Shown to be 3 orders of −3.0 LSH (142 descriptors) LSH (20 descriptors) magnitude faster than kNN (142 descriptors) kNN (20 descriptors) −3.5 traditional kNN 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Very accurate (> 94%) Radius Comparison of NN detection speed on a 42,000 compound dataset using a 200 point query set Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
  21. 21. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Alternatives? Why not use PCA? R-NN curves are fundamentally a 4 form of dimension reduction 2 Principal Components Analysis is Principal Component 2 0 also a form of dimension reduction −2 Disadvantages −4 Eigendecomposition via SVD is −6 O(n3 ) 0 5 10 Principal Component 1 Difficult to visualize more than 2 or 3 PC’s at the same time
  22. 22. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Datasets Boiling point dataset 277 molecules Average: MW = 115, Stanimoto = 0.20 Calculated 214 descriptors, reduced to 64 Kazius-AMES dataset 4337 molecules Average MW = 240, Stanimoto = 0.21 Known to have a number of significant outliers Calculated 142 MOE descriptors, reduced to 45 Goll, E.; Jurs, P.; J. Chem. Inf. Comput. Sci. 1999, 39, 974–983 Kazius, J.; McGuire, R.; Bursi, R.; J. Med. Chem. 2005, 48, 312–320
  23. 23. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Choosing a Descriptor Space Boiling point dataset We had previously generated linear regression models using a GA to search for subsets Best model had 4 descriptors Kazius-AMES dataset No models were generated for this dataset Considered random descriptor subsets Used a 5-descriptor subset to represent the results The R-NN curve approach focuses on the distribution of molecules in a given descriptor space. Hence a good or random descriptor subset should be able to highlight the utility of the method
  24. 24. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Choosing a Descriptor Space Boiling point dataset We had previously generated linear regression models using a GA to search for subsets Best model had 4 descriptors Kazius-AMES dataset No models were generated for this dataset Considered random descriptor subsets Used a 5-descriptor subset to represent the results The R-NN curve approach focuses on the distribution of molecules in a given descriptor space. Hence a good or random descriptor subset should be able to highlight the utility of the method

×