Outlier Detection Using R-NN Curves
Searching for HIV Integrase Inhibitors
                            Summary




     Na...
Outlier Detection Using R-NN Curves
           Searching for HIV Integrase Inhibitors
                                    ...
Outlier Detection Using R-NN Curves
          Searching for HIV Integrase Inhibitors
                                     ...
Outlier Detection Using R-NN Curves
           Searching for HIV Integrase Inhibitors
                                    ...
Outlier Detection Using R-NN Curves
           Searching for HIV Integrase Inhibitors
                                    ...
Outlier Detection Using R-NN Curves
            Searching for HIV Integrase Inhibitors
                                   ...
Outlier Detection Using R-NN Curves      Methods for Diversity Analysis
            Searching for HIV Integrase Inhibitors...
Outlier Detection Using R-NN Curves      Methods for Diversity Analysis
          Searching for HIV Integrase Inhibitors  ...
Outlier Detection Using R-NN Curves       Methods for Diversity Analysis
                  Searching for HIV Integrase Inh...
Outlier Detection Using R-NN Curves       Methods for Diversity Analysis
                  Searching for HIV Integrase Inh...
Outlier Detection Using R-NN Curves      Methods for Diversity Analysis
           Searching for HIV Integrase Inhibitors ...
Outlier Detection Using R-NN Curves      Methods for Diversity Analysis
           Searching for HIV Integrase Inhibitors ...
Outlier Detection Using R-NN Curves        Methods for Diversity Analysis
                  Searching for HIV Integrase In...
Outlier Detection Using R-NN Curves                        Methods for Diversity Analysis
                                ...
Outlier Detection Using R-NN Curves      Methods for Diversity Analysis
           Searching for HIV Integrase Inhibitors ...
Outlier Detection Using R-NN Curves      Methods for Diversity Analysis
           Searching for HIV Integrase Inhibitors ...
Outlier Detection Using R-NN Curves      Methods for Diversity Analysis
           Searching for HIV Integrase Inhibitors ...
Outlier Detection Using R-NN Curves      Methods for Diversity Analysis
           Searching for HIV Integrase Inhibitors ...
Outlier Detection Using R-NN Curves        Methods for Diversity Analysis
                  Searching for HIV Integrase In...
Outlier Detection Using R-NN Curves        Methods for Diversity Analysis
                  Searching for HIV Integrase In...
Outlier Detection Using R-NN Curves      Methods for Diversity Analysis
          Searching for HIV Integrase Inhibitors  ...
Outlier Detection Using R-NN Curves         Methods for Diversity Analysis
                  Searching for HIV Integrase I...
Outlier Detection Using R-NN Curves      Methods for Diversity Analysis
           Searching for HIV Integrase Inhibitors ...
Outlier Detection Using R-NN Curves      Methods for Diversity Analysis
           Searching for HIV Integrase Inhibitors ...
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Navigating Molecular Haystacks: Tools & Applications
Upcoming SlideShare
Loading in...5
×

Navigating Molecular Haystacks: Tools & Applications

810

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
810
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Navigating Molecular Haystacks: Tools & Applications

  1. 1. Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Navigating Molecular Haystacks: Tools & Applications Finding Interesting Molecules & Doing it Fast Rajarshi Guha Department of Chemistry Pennsylvania State University 20th April, 2006
  2. 2. Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Past Work QSAR Applications Artemisinin analogs PDGFR inhibitors Bleaching agents Linear & non-linear methods QSAR Methods Representative QSAR sets Model interpretation Model applicability
  3. 3. Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Past Work Cheminformatics Automated QSAR pipeline Contributions to the CDK Cheminformatics webservices R packages and snippets
  4. 4. Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Past Work Chemical Data Mining Approximate k-NN Local regression Outlier detection VS protocol
  5. 5. Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Past Work Chemical Data Mining Approximate k-NN Local regression Outlier detection VS protocol
  6. 6. Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Outline 1 Outlier Detection Using R-NN Curves Methods for Diversity Analysis Generating & Summarizing R-NN Curves Using R-NN Curves 2 Searching for HIV Integrase Inhibitors Previous Work on HIV Integrase A Tiered Virtual Screening Protocol What Does the Pipeline Give Us? 3 Summary
  7. 7. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Outline 1 Outlier Detection Using R-NN Curves Methods for Diversity Analysis Generating & Summarizing R-NN Curves Using R-NN Curves 2 Searching for HIV Integrase Inhibitors Previous Work on HIV Integrase A Tiered Virtual Screening Protocol What Does the Pipeline Give Us? 3 Summary
  8. 8. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Diversity Analysis Why is it Important? Compound acquisition Lead hopping Knowledge of the distribution of compounds in a descriptor space may improve predictive models
  9. 9. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Approaches to Diversity Analysis Cell based q q q Divide space into bins q q q q qq q q qq q q q qq q q q q q q q qq q q qq qq q q qq q qq q q q q qq q q q Compounds are mapped to bins q q q q q qqq q qq q q qqq q qq q q q q q q q q q q q qqq q q q q q qqq q q q q q q q q q q qq q q q q q q q q qq q q q q q q qq q q q q q q q q q q q qq q q q q q q Disadvantages q qq q q q q Not useful for high dimensional data Choosing the bin size can be tricky Schnur, D.; J. Chem. Inf. Comput. Sci. 1999, 39, 36–45 Agrafiotis, D.; Rassokhin, D.; J. Chem. Inf. Comput. Sci. 2002, 42, 117–12
  10. 10. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Approaches to Diversity Analysis q Distance based q q q q q q Considers distance between q q q q q compounds in a space q q q q q q q q q q q Generally requires pairwise distance q q q q calculation q q q q q q q q q q q q q Can be sped up by kD trees, MVP q q q q q trees etc. q q q q q Agrafiotis, D. K.; Lobanov, V. S.; J. Chem. Inf. Comput. Sci. 1999, 39, 51–58
  11. 11. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Generating an R-NN Curve Observations Consider a query point with a hypersphere, of radius R, centered on it For small R, the hypersphere will contain very few or no neighbors For larger R, the number of neighbors will increase When R ≥ Dmax , the neighbor set is the whole dataset The question is . . . Does the variation of nearest neighbor count with radius allow us to characterize the location of a query point in a dataset?
  12. 12. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Generating an R-NN Curve Observations Consider a query point with a hypersphere, of radius R, centered on it For small R, the hypersphere will contain very few or no neighbors For larger R, the number of neighbors will increase When R ≥ Dmax , the neighbor set is the whole dataset The question is . . . Does the variation of nearest neighbor count with radius allow us to characterize the location of a query point in a dataset?
  13. 13. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Generating an R-NN Curve Algorithm q q 50% q q Dmax ← max pairwise distance q q q q qq 30% for molecule in dataset do q q q q q q q q q q q q q q q q q q q R ← 0.01 × Dmax q q q q q q q qq q q q q q q q q qq q q q q q q 10% q q q q q q q q q q q 5% while R ≤ Dmax do q q q q qq qq q q q q q q qq q q q q qq q q q q q q q q q q q q qq q q q q q q q q Find NN’s within radius R q q q q q q q q q q q q q q q q q q q Increment R q q q q q q q q q q q q q q q q q end while q end for q Plot NN count vs. R Guha, R.; Dutta, D.; Jurs, P.C.; Chen, T.; J. Chem. Inf. Model., submitted
  14. 14. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Generating an R-NN Curve qq qq qqqq qqq qqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqq q qqqqqq qqqqq q qqqq qqqq 250 q 250 qq qqq qq Number of Neighbors Number of Neighbors q q 200 200 qq q q q 150 150 qq q q q q 100 100 q qq q q 50 q 50 qq q qq qq q qqq qq qqqqqqqqqq qqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqq 0 qqq qq 0 0 20 40 60 80 100 0 20 40 60 80 100 R R Sparse Dense
  15. 15. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Characterizing an R-NN Curve Converting the Plot to Numbers Since R-NN curves are sigmoidal, fit them to the logistic equation 1 + m e −R/τ NN = a · 1 + n e −R/τ m, n should characterize the curve Problems Two parameters Non-linear fitting is dependent on the starting point For some starting points, the fit does not converge and requires repetition
  16. 16. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Characterizing an R-NN Curve Converting the Plot to a Number Determine the value of R where the lower tail transitions to the linear portion of the curve S=0 S’’ Solution Determine the slope at various points on the curve S’ S=0 Find R for the first occurence of the maximal slope (Rmax(S) ) Can be achieved using a finite difference approach
  17. 17. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Characterizing an R-NN Curve Converting the Plot to a Number Determine the value of R where the lower tail transitions to the linear portion of the curve S=0 S’’ Solution Determine the slope at various points on the curve S’ S=0 Find R for the first occurence of the maximal slope (Rmax(S) ) Can be achieved using a finite difference approach
  18. 18. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Characterizing Multiple R-NN Curves Problem Visual inspection of curves is useful for a few molecules 80 For larger datasets we need to summarize R-NN curves 60 Rmax S 40 Solution Plot Rmax(S) values for each 20 molecule in the dataset Points at the top of the plot are 0 0 50 100 150 200 250 Serial Number located in the sparsest regions Points at the bottom are located in the densest regions
  19. 19. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves How Can We Use It For Large Datasets? Breaking the O(n2 ) barrier Traditional NN detection has a time complexity of O(n2 ) Modern NN algorithms such as kD-trees have lower time complexity restricted to the exact NN problem Solution is to use approximate NN algorithms such as Locality Sensitive Hashing (LSH) Bentley, J.; Commun. ACM 1980, 23, 214–229 Datar, M. et al.; SCG ’04: Proc. 20th Symp. Comp. Geom.; ACM Press, 2004 Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
  20. 20. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves How Can We Use It For Large Datasets? −1.0 −1.5 log [Mean Query Time (sec)] Why LSH? −2.0 Theoretically sublinear −2.5 Shown to be 3 orders of −3.0 LSH (142 descriptors) LSH (20 descriptors) magnitude faster than kNN (142 descriptors) kNN (20 descriptors) −3.5 traditional kNN 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Very accurate (> 94%) Radius Comparison of NN detection speed on a 42,000 compound dataset using a 200 point query set Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
  21. 21. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Alternatives? Why not use PCA? R-NN curves are fundamentally a 4 form of dimension reduction 2 Principal Components Analysis is Principal Component 2 0 also a form of dimension reduction −2 Disadvantages −4 Eigendecomposition via SVD is −6 O(n3 ) 0 5 10 Principal Component 1 Difficult to visualize more than 2 or 3 PC’s at the same time
  22. 22. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Datasets Boiling point dataset 277 molecules Average: MW = 115, Stanimoto = 0.20 Calculated 214 descriptors, reduced to 64 Kazius-AMES dataset 4337 molecules Average MW = 240, Stanimoto = 0.21 Known to have a number of significant outliers Calculated 142 MOE descriptors, reduced to 45 Goll, E.; Jurs, P.; J. Chem. Inf. Comput. Sci. 1999, 39, 974–983 Kazius, J.; McGuire, R.; Bursi, R.; J. Med. Chem. 2005, 48, 312–320
  23. 23. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Choosing a Descriptor Space Boiling point dataset We had previously generated linear regression models using a GA to search for subsets Best model had 4 descriptors Kazius-AMES dataset No models were generated for this dataset Considered random descriptor subsets Used a 5-descriptor subset to represent the results The R-NN curve approach focuses on the distribution of molecules in a given descriptor space. Hence a good or random descriptor subset should be able to highlight the utility of the method
  24. 24. Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Choosing a Descriptor Space Boiling point dataset We had previously generated linear regression models using a GA to search for subsets Best model had 4 descriptors Kazius-AMES dataset No models were generated for this dataset Considered random descriptor subsets Used a 5-descriptor subset to represent the results The R-NN curve approach focuses on the distribution of molecules in a given descriptor space. Hence a good or random descriptor subset should be able to highlight the utility of the method
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×