Characterizing the Density of Chemical Spaces and
    its Use in Outlier Analysis and Clustering

                   Rajar...
Outline




      R-NN curves for diversity analysis
      Counting clusters with R-NN curves
      Density and domain app...
R-NN Curves for Diversity Analysis
                                       Linking R-NN Curves to Cluster Counts
          ...
R-NN Curves for Diversity Analysis
                                   Linking R-NN Curves to Cluster Counts
              ...
R-NN Curves for Diversity Analysis
                                                            Linking R-NN Curves to Clus...
R-NN Curves for Diversity Analysis
                                                            Linking R-NN Curves to Clus...
R-NN Curves for Diversity Analysis
                                     Linking R-NN Curves to Cluster Counts
            ...
R-NN Curves for Diversity Analysis
                                     Linking R-NN Curves to Cluster Counts
            ...
R-NN Curves for Diversity Analysis
                                                               Linking R-NN Curves to C...
R-NN Curves for Diversity Analysis
                                                                                       ...
R-NN Curves for Diversity Analysis
                                      Linking R-NN Curves to Cluster Counts
           ...
R-NN Curves for Diversity Analysis
                                       Linking R-NN Curves to Cluster Counts
          ...
R-NN Curves for Diversity Analysis
                                       Linking R-NN Curves to Cluster Counts
          ...
R-NN Curves for Diversity Analysis
                                                       Linking R-NN Curves to Cluster C...
R-NN Curves for Diversity Analysis
                                                                                       ...
R-NN Curves for Diversity Analysis
                                                             Linking R-NN Curves to Clu...
R-NN Curves for Diversity Analysis
                                                             Linking R-NN Curves to Clu...
R-NN Curves for Diversity Analysis
                                       Linking R-NN Curves to Cluster Counts
          ...
R-NN Curves for Diversity Analysis
                                                     Linking R-NN Curves to Cluster Cou...
R-NN Curves for Diversity Analysis
                                   Linking R-NN Curves to Cluster Counts
              ...
R-NN Curves for Diversity Analysis
                                                                                     Li...
R-NN Curves for Diversity Analysis
                                                                                     Li...
R-NN Curves for Diversity Analysis
                                   Linking R-NN Curves to Cluster Counts
              ...
R-NN Curves for Diversity Analysis
                                                              Linking R-NN Curves to Cl...
R-NN Curves for Diversity Analysis
                                                              Linking R-NN Curves to Cl...
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Upcoming SlideShare
Loading in …5
×

Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

743
-1

Published on

Published in: Education, Technology, Career
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
743
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

  1. 1. Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering Rajarshi Guha School of Informatics Indiana University 17th August, 2007 Cambridge, MA
  2. 2. Outline R-NN curves for diversity analysis Counting clusters with R-NN curves Density and domain applicability
  3. 3. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Nearest Neighbor Methods Traditional kNN methods are q simple, fast, intuitive q q q q q q q Applications in q q q q q q q regression & classification q q q q diversity analysis q q q q q q q q Can be misleading if the nearest q q q q q q q q q q q neighbor is far away q q q q q q q R-NN methods may be more q q q q q suitable
  4. 4. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Diversity Analysis Why is it Important? Compound acquisition Lead hopping Knowledge of the distribution of compounds in a descriptor space may improve predictive models
  5. 5. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Approaches to Diversity Analysis Cell based q q q Divide space into bins q q q q qq q q qq q q q qq q q q q q q q qq q q qq qq q q qq q qq q q q q qq q q q Compounds are mapped to bins q q q q q qqq q qq q q qqq q qq q q q q q q q q q q q qqq q q q q q qqq q q q q q q q q q q qq q q q q q q q q qq q q q q q q qq q q q q q q q q q q q qq q q q q q q Disadvantages q qq q q q q Not useful for high dimensional data Choosing the bin size can be tricky Schnur, D.; J. Chem. Inf. Comput. Sci. 1999, 39, 36–45 Agrafiotis, D.; Rassokhin, D.; J. Chem. Inf. Comput. Sci. 2002, 42, 117–12
  6. 6. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Approaches to Diversity Analysis q Distance based q q q q q q Considers distance between q q q q q compounds in a space q q q q q q q q q q q Generally requires pairwise distance q q q q calculation q q q q q q q q q q q q q Can be sped up by kD trees, MVP q q q q q trees etc. q q q q q Agrafiotis, D. K. and Lobanov, V. S.; J. Chem. Inf. Comput. Sci. 1999, 39, 51–58
  7. 7. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Generating an R-NN Curve Observations Consider a query point with a hypersphere, of radius R, centered on it For small R, the hypersphere will contain very few or no neighbors For larger R, the number of neighbors will increase When R ≥ Dmax , the neighbor set is the whole dataset The question is . . . Does the variation of nearest neighbor count with radius allow us to characterize the location of a query point in a dataset?
  8. 8. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Generating an R-NN Curve Observations Consider a query point with a hypersphere, of radius R, centered on it For small R, the hypersphere will contain very few or no neighbors For larger R, the number of neighbors will increase When R ≥ Dmax , the neighbor set is the whole dataset The question is . . . Does the variation of nearest neighbor count with radius allow us to characterize the location of a query point in a dataset?
  9. 9. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Generating an R-NN Curve Algorithm q q 50% q q Dmax ← max pairwise distance q q q q qq 30% for molecule in dataset do q q q q q q q q q q q q q q q q q q q R ← 0.01 × Dmax q q q q q q q qq q q q q q q q q qq q q q q q q 10% q q q q q q q q q q q 5% while R ≤ Dmax do q q q q qq qq q q q q q q qq q q q q qq q q q q q q q q q q q q qq q q q q q q q q Find NN’s within radius R q q q q q q q q q q q q q q q q q q q Increment R q q q q q q q q q q q q q q q q q end while q end for q Plot NN count vs. R Guha, R. et al.; J. Chem. Inf. Model., 2006, 46, 1713–1772
  10. 10. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Generating an R-NN Curve qq qq qqqq qqq qqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqq q qqqqqq qqqqq q qqqq qqqq 250 q 250 qq qqq qq Number of Neighbors Number of Neighbors q q 200 200 qq q q q 150 150 qq q q q q 100 100 q qq q q 50 q 50 qq q qq qq q qqq qq qqqqqqqqqq qqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqq 0 qqq qq 0 0 20 40 60 80 100 0 20 40 60 80 100 R R Sparse Dense
  11. 11. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Characterizing an R-NN Curve Converting the Plot to Numbers Since R-NN curves are sigmoidal, fit them to the logistic equation 1 + me −R/τ NN = a · 1 + ne −R/τ m, n should characterize the curve Problems Two parameters Non-linear fitting is dependent on the starting point For some starting points, the fir does not converge and requires repetition
  12. 12. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Characterizing an R-NN Curve Converting the Plot to a Number Determine the value of R where the lower tail transitions to the linear portion of the curve S=0 S’’ Solution Determine the slope at various points on the curve S’ S=0 Find R for the first occurence of the maximal slope (Rmax(S) ) Can be achieved using a finite difference approach
  13. 13. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Characterizing an R-NN Curve Converting the Plot to a Number Determine the value of R where the lower tail transitions to the linear portion of the curve S=0 S’’ Solution Determine the slope at various points on the curve S’ S=0 Find R for the first occurence of the maximal slope (Rmax(S) ) Can be achieved using a finite difference approach
  14. 14. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Characterizing Multiple R-NN Curves Problem Visual inspection of curves is useful 3904 for a few molecules For larger datasets we need to 80 1861 summarize R-NN curves 60 Rmax S Solution 40 Plot Rmax(S) values for each molecule in the dataset 20 Points at the top of the plot are 0 1000 2000 3000 4000 located in the sparsest regions Serial Number Points at the bottom are located in the densest regions Kazius, J. et al.; J. Med. Chem. 2005, 48, 312–320
  15. 15. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability R-NN Curves and Outliers qq q q 4000 q H2N HN N• Number of Neighbors O q HO O OH H2N O 3000 O O O HN H N N q H O H HN O N N H O O O O 2000 NH HN HO N q H N H HO O NH O H2N N H q O HN O •N N O O H 1000 NH NH q H2N S q S •N q HN q O NH2 O q qq OH NH qq qq O HN qq HN OH qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq HN 0 NH H O N O H S N O N NH 0 20 40 60 80 100 H O OH O O O O R HN NH NH O H2N N• NH2 HN 3904 •N NH NH2 O HO q qqqqqqqqqqqqqq qqqqqqqqqqqqqq 3904 4000 q q Number of Neighbors q 3000 q H HO O N S H2N N O O O OH N 2000 q NH O H OH N O HO O S H N N O NH NH2 O O q O N HO OH O N O O N 1000 N N q HO NH NH2 q OH HN O q H2N O q q q q qq O qqq qqq qqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqq H2N 0 0 20 40 60 80 100 R 1861 1861
  16. 16. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability How Can We Use It For Large Datasets? Breaking the O(n2 ) barrier Traditional NN detection has a time complexity of O(n2 ) Modern NN algorithms such as kD-trees have lower time complexity restricted to the exact NN problem Solution is to use approximate NN algorithms such as Locality Sensitive Hashing (LSH) Bentley, J.; Commun. ACM 1980, 23, 214–229 Datar, M. et al.; SCG ’04: Proc. 20th Symp. Comp. Geom.; ACM Press, 2004 Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
  17. 17. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability How Can We Use It For Large Datasets? −1.0 −1.5 log [Mean Query Time (sec)] Why LSH? −2.0 Theoretically sublinear −2.5 Shown to be 3 orders of −3.0 LSH (142 descriptors) LSH (20 descriptors) magnitude faster than kNN (142 descriptors) kNN (20 descriptors) −3.5 traditional kNN 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Very accurate (> 94%) Radius Comparison of NN detection speed on a 42,000 compound dataset using a 200 point query set Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
  18. 18. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Alternatives? Why not use PCA? R-NN curves are fundamentally a form of dimension reduction 4 Principal Components Analysis is 2 also a form of dimension reduction Principal Component 2 0 Disadvantages −2 Eigendecomposition via SVD is −4 O(n3 ) −6 Difficult to visualize more than 2 or 0 5 10 Principal Component 1 3 PC’s at the same time We are no longer in the original descriptor space
  19. 19. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability R-NN Curves and Clusters 251 84 300 300 250 250 200 200 15 150 150 100 100 251 50 50 10 0 0 0 20 40 60 80 100 0 20 40 60 80 100 88 137 88 300 300 250 250 5 84 200 200 150 150 137 100 100 0 50 50 0 0 −5 0 5 10 15 20 25 0 20 40 60 80 100 0 20 40 60 80 100 Smoothed R-NN Curves R-NN curves are indicative of the number of clusters
  20. 20. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability R-NN Curves and Clusters Counting the steps Approaches Essentially a curve matching Hausdorff / Fr´chet distance e problem - requires canonical curves All points will not be RMSE from distance matrix indicative of the number of Slope analysis clusters Not applicable for radially distributed clusters
  21. 21. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability R-NN Curves and Their Slopes 251 84 251 84 300 300 14 12 20 250 250 Slope of the RNN Curve Slope of the RNN Curve 10 200 200 15 8 150 150 10 6 100 100 4 5 2 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 88 137 88 137 300 300 15 250 250 15 Slope of the RNN Curve Slope of the RNN Curve 200 200 10 10 150 150 100 100 5 5 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Smoothed first derivative of the R-NN Curves Guha, R. et al., J. Chem. Inf. Model., 2007, 47, 1308–1318
  22. 22. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability R-NN Curves and Their Slopes 251 84 251 84 300 300 14 12 20 250 250 Slope of the RNN Curve Slope of the RNN Curve 10 200 200 15 8 150 150 10 6 100 100 4 5 2 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 88 137 88 137 300 300 15 250 250 15 Slope of the RNN Curve Slope of the RNN Curve 200 200 10 10 150 150 100 100 5 5 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Smoothed first derivative of the R-NN Curves Identifying peaks identifies the number of clusters Automated picking can identify spurious peaks Guha, R. et al., J. Chem. Inf. Model., 2007, 47, 1308–1318
  23. 23. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Slope Analysis of R-NN Curves RNN Curve 300 250 200 150 100 Procedure 50 for i in molecules do 0 0 20 40 60 80 100 D' Evaluate R-NN curve 8 F ← smoothed R-NN curve 6 Evaluate F 4 2 Smooth F Nroot,i ← no. of roots of F 0 0 20 40 60 80 100 end for D'' Ncluster = [max(Nroot ) + 1]/2 0.5 qq q q q q q q qqqq qqq q q q q q q q q qq q q q q q q q q q q q q q q q q 0.0 q q qqqqqqqq qqqqqqq q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq qq −0.5 qqq q 0 20 40 60 80 100
  24. 24. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Simulated Data Simulated 2D data using a Thomas point process Predicted k, followed by kmeans clustering using k Investigated similar values of k k ASW k ASW k ASW 4 0.61 3 0.44 4 0.64 3 0.74 2 0.65 3 0.48 5 0.70 4 0.47 5 0.56 ASW - average silhouette width, higher is better; k - number of clusters
  25. 25. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability A Mixed Dataset Dataset composition 277 DHFR inhibitors based on substituteed pyrimidinediamine and diaminopteridine scaffolds 277 molecules from the DIPPR project mainly simple hydrocarbons boiling point was modeled Evaluated 147 Molconn-Z descriptors, reduced to 24 We expect at least 3 clusters Sutherland, J. et al.; J. Chem. Inf. Comput. Sci., 2003, 43, 1906–1915 Goll, E. and Jurs, P.; J. Chem. Inf. Comput. Sci., 1999, 39, 974–983
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×