Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

on

  • 1,374 views

 

Statistics

Views

Total Views
1,374
Views on SlideShare
1,371
Embed Views
3

Actions

Likes
0
Downloads
16
Comments
0

1 Embed 3

http://www.slideshare.net 3

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering Presentation Transcript

  • 1. Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering Rajarshi Guha School of Informatics Indiana University 17th August, 2007 Cambridge, MA
  • 2. Outline R-NN curves for diversity analysis Counting clusters with R-NN curves Density and domain applicability
  • 3. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Nearest Neighbor Methods Traditional kNN methods are q simple, fast, intuitive q q q q q q q Applications in q q q q q q q regression & classification q q q q diversity analysis q q q q q q q q Can be misleading if the nearest q q q q q q q q q q q neighbor is far away q q q q q q q R-NN methods may be more q q q q q suitable
  • 4. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Diversity Analysis Why is it Important? Compound acquisition Lead hopping Knowledge of the distribution of compounds in a descriptor space may improve predictive models
  • 5. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Approaches to Diversity Analysis Cell based q q q Divide space into bins q q q q qq q q qq q q q qq q q q q q q q qq q q qq qq q q qq q qq q q q q qq q q q Compounds are mapped to bins q q q q q qqq q qq q q qqq q qq q q q q q q q q q q q qqq q q q q q qqq q q q q q q q q q q qq q q q q q q q q qq q q q q q q qq q q q q q q q q q q q qq q q q q q q Disadvantages q qq q q q q Not useful for high dimensional data Choosing the bin size can be tricky Schnur, D.; J. Chem. Inf. Comput. Sci. 1999, 39, 36–45 Agrafiotis, D.; Rassokhin, D.; J. Chem. Inf. Comput. Sci. 2002, 42, 117–12
  • 6. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Approaches to Diversity Analysis q Distance based q q q q q q Considers distance between q q q q q compounds in a space q q q q q q q q q q q Generally requires pairwise distance q q q q calculation q q q q q q q q q q q q q Can be sped up by kD trees, MVP q q q q q trees etc. q q q q q Agrafiotis, D. K. and Lobanov, V. S.; J. Chem. Inf. Comput. Sci. 1999, 39, 51–58
  • 7. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Generating an R-NN Curve Observations Consider a query point with a hypersphere, of radius R, centered on it For small R, the hypersphere will contain very few or no neighbors For larger R, the number of neighbors will increase When R ≥ Dmax , the neighbor set is the whole dataset The question is . . . Does the variation of nearest neighbor count with radius allow us to characterize the location of a query point in a dataset?
  • 8. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Generating an R-NN Curve Observations Consider a query point with a hypersphere, of radius R, centered on it For small R, the hypersphere will contain very few or no neighbors For larger R, the number of neighbors will increase When R ≥ Dmax , the neighbor set is the whole dataset The question is . . . Does the variation of nearest neighbor count with radius allow us to characterize the location of a query point in a dataset?
  • 9. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Generating an R-NN Curve Algorithm q q 50% q q Dmax ← max pairwise distance q q q q qq 30% for molecule in dataset do q q q q q q q q q q q q q q q q q q q R ← 0.01 × Dmax q q q q q q q qq q q q q q q q q qq q q q q q q 10% q q q q q q q q q q q 5% while R ≤ Dmax do q q q q qq qq q q q q q q qq q q q q qq q q q q q q q q q q q q qq q q q q q q q q Find NN’s within radius R q q q q q q q q q q q q q q q q q q q Increment R q q q q q q q q q q q q q q q q q end while q end for q Plot NN count vs. R Guha, R. et al.; J. Chem. Inf. Model., 2006, 46, 1713–1772
  • 10. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Generating an R-NN Curve qq qq qqqq qqq qqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqq q qqqqqq qqqqq q qqqq qqqq 250 q 250 qq qqq qq Number of Neighbors Number of Neighbors q q 200 200 qq q q q 150 150 qq q q q q 100 100 q qq q q 50 q 50 qq q qq qq q qqq qq qqqqqqqqqq qqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqq 0 qqq qq 0 0 20 40 60 80 100 0 20 40 60 80 100 R R Sparse Dense
  • 11. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Characterizing an R-NN Curve Converting the Plot to Numbers Since R-NN curves are sigmoidal, fit them to the logistic equation 1 + me −R/τ NN = a · 1 + ne −R/τ m, n should characterize the curve Problems Two parameters Non-linear fitting is dependent on the starting point For some starting points, the fir does not converge and requires repetition
  • 12. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Characterizing an R-NN Curve Converting the Plot to a Number Determine the value of R where the lower tail transitions to the linear portion of the curve S=0 S’’ Solution Determine the slope at various points on the curve S’ S=0 Find R for the first occurence of the maximal slope (Rmax(S) ) Can be achieved using a finite difference approach
  • 13. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Characterizing an R-NN Curve Converting the Plot to a Number Determine the value of R where the lower tail transitions to the linear portion of the curve S=0 S’’ Solution Determine the slope at various points on the curve S’ S=0 Find R for the first occurence of the maximal slope (Rmax(S) ) Can be achieved using a finite difference approach
  • 14. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Characterizing Multiple R-NN Curves Problem Visual inspection of curves is useful 3904 for a few molecules For larger datasets we need to 80 1861 summarize R-NN curves 60 Rmax S Solution 40 Plot Rmax(S) values for each molecule in the dataset 20 Points at the top of the plot are 0 1000 2000 3000 4000 located in the sparsest regions Serial Number Points at the bottom are located in the densest regions Kazius, J. et al.; J. Med. Chem. 2005, 48, 312–320
  • 15. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability R-NN Curves and Outliers qq q q 4000 q H2N HN N• Number of Neighbors O q HO O OH H2N O 3000 O O O HN H N N q H O H HN O N N H O O O O 2000 NH HN HO N q H N H HO O NH O H2N N H q O HN O •N N O O H 1000 NH NH q H2N S q S •N q HN q O NH2 O q qq OH NH qq qq O HN qq HN OH qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq HN 0 NH H O N O H S N O N NH 0 20 40 60 80 100 H O OH O O O O R HN NH NH O H2N N• NH2 HN 3904 •N NH NH2 O HO q qqqqqqqqqqqqqq qqqqqqqqqqqqqq 3904 4000 q q Number of Neighbors q 3000 q H HO O N S H2N N O O O OH N 2000 q NH O H OH N O HO O S H N N O NH NH2 O O q O N HO OH O N O O N 1000 N N q HO NH NH2 q OH HN O q H2N O q q q q qq O qqq qqq qqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqq H2N 0 0 20 40 60 80 100 R 1861 1861
  • 16. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability How Can We Use It For Large Datasets? Breaking the O(n2 ) barrier Traditional NN detection has a time complexity of O(n2 ) Modern NN algorithms such as kD-trees have lower time complexity restricted to the exact NN problem Solution is to use approximate NN algorithms such as Locality Sensitive Hashing (LSH) Bentley, J.; Commun. ACM 1980, 23, 214–229 Datar, M. et al.; SCG ’04: Proc. 20th Symp. Comp. Geom.; ACM Press, 2004 Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
  • 17. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability How Can We Use It For Large Datasets? −1.0 −1.5 log [Mean Query Time (sec)] Why LSH? −2.0 Theoretically sublinear −2.5 Shown to be 3 orders of −3.0 LSH (142 descriptors) LSH (20 descriptors) magnitude faster than kNN (142 descriptors) kNN (20 descriptors) −3.5 traditional kNN 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Very accurate (> 94%) Radius Comparison of NN detection speed on a 42,000 compound dataset using a 200 point query set Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
  • 18. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Alternatives? Why not use PCA? R-NN curves are fundamentally a form of dimension reduction 4 Principal Components Analysis is 2 also a form of dimension reduction Principal Component 2 0 Disadvantages −2 Eigendecomposition via SVD is −4 O(n3 ) −6 Difficult to visualize more than 2 or 0 5 10 Principal Component 1 3 PC’s at the same time We are no longer in the original descriptor space
  • 19. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability R-NN Curves and Clusters 251 84 300 300 250 250 200 200 15 150 150 100 100 251 50 50 10 0 0 0 20 40 60 80 100 0 20 40 60 80 100 88 137 88 300 300 250 250 5 84 200 200 150 150 137 100 100 0 50 50 0 0 −5 0 5 10 15 20 25 0 20 40 60 80 100 0 20 40 60 80 100 Smoothed R-NN Curves R-NN curves are indicative of the number of clusters
  • 20. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability R-NN Curves and Clusters Counting the steps Approaches Essentially a curve matching Hausdorff / Fr´chet distance e problem - requires canonical curves All points will not be RMSE from distance matrix indicative of the number of Slope analysis clusters Not applicable for radially distributed clusters
  • 21. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability R-NN Curves and Their Slopes 251 84 251 84 300 300 14 12 20 250 250 Slope of the RNN Curve Slope of the RNN Curve 10 200 200 15 8 150 150 10 6 100 100 4 5 2 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 88 137 88 137 300 300 15 250 250 15 Slope of the RNN Curve Slope of the RNN Curve 200 200 10 10 150 150 100 100 5 5 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Smoothed first derivative of the R-NN Curves Guha, R. et al., J. Chem. Inf. Model., 2007, 47, 1308–1318
  • 22. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability R-NN Curves and Their Slopes 251 84 251 84 300 300 14 12 20 250 250 Slope of the RNN Curve Slope of the RNN Curve 10 200 200 15 8 150 150 10 6 100 100 4 5 2 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 88 137 88 137 300 300 15 250 250 15 Slope of the RNN Curve Slope of the RNN Curve 200 200 10 10 150 150 100 100 5 5 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Smoothed first derivative of the R-NN Curves Identifying peaks identifies the number of clusters Automated picking can identify spurious peaks Guha, R. et al., J. Chem. Inf. Model., 2007, 47, 1308–1318
  • 23. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Slope Analysis of R-NN Curves RNN Curve 300 250 200 150 100 Procedure 50 for i in molecules do 0 0 20 40 60 80 100 D' Evaluate R-NN curve 8 F ← smoothed R-NN curve 6 Evaluate F 4 2 Smooth F Nroot,i ← no. of roots of F 0 0 20 40 60 80 100 end for D'' Ncluster = [max(Nroot ) + 1]/2 0.5 qq q q q q q q qqqq qqq q q q q q q q q qq q q q q q q q q q q q q q q q q 0.0 q q qqqqqqqq qqqqqqq q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq qq −0.5 qqq q 0 20 40 60 80 100
  • 24. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Simulated Data Simulated 2D data using a Thomas point process Predicted k, followed by kmeans clustering using k Investigated similar values of k k ASW k ASW k ASW 4 0.61 3 0.44 4 0.64 3 0.74 2 0.65 3 0.48 5 0.70 4 0.47 5 0.56 ASW - average silhouette width, higher is better; k - number of clusters
  • 25. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability A Mixed Dataset Dataset composition 277 DHFR inhibitors based on substituteed pyrimidinediamine and diaminopteridine scaffolds 277 molecules from the DIPPR project mainly simple hydrocarbons boiling point was modeled Evaluated 147 Molconn-Z descriptors, reduced to 24 We expect at least 3 clusters Sutherland, J. et al.; J. Chem. Inf. Comput. Sci., 2003, 43, 1906–1915 Goll, E. and Jurs, P.; J. Chem. Inf. Comput. Sci., 1999, 39, 974–983
  • 26. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability A Mixed Dataset DHFR q 5 DIPPR q q qq q q qqq q q q qqq qq q qq q qq q q q qq qq qq qq q qq qq q qqq qq qqq q q q qq q q q q q q q qq q qq qqq q qq q q q qq q q qq qq qq q qq Desc. Set k ASW Purity q q q q qq qq qqq q qqq q q qq qq qq qq qq qqq q qq qq q qq q qq q qq q qqq 0 q qq q q q qqq q q qqq q qq q q qq q q qq q q qq q q q qq q q q qqqqq qq q q q q q q q q q qq qq q qq q q q qq q q q q qq q qq qq q q qq q q qq q q qq qqq q qq q 4 descriptors 2 0.71 0.63 qq q q qq q qq q PC2 q q q q q q −5 3 0.67 0.89 q q q 4 0.73 0.84 −10 6 descriptors 2 0.67 0.94 q 3 0.70 0.97 −10 −5 0 5 4 0.61 0.94 PC1 All 24 descriptors 2 0.29 0.96 PC plot indicates 3 main 3 0.33 0.96 clusters 4 0.23 0.90 In all cases, 3 clusters is optimal for both quality measures
  • 27. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Drawbacks Currently works well for well separated clusters But radial clusters will not be detected We now consider all the points in the dataset Inefficient Sampling experiments seem to indicate that we can make do with 45% of the points Could prioritize by considering points wth low Rmax(S) values Small changes in local density can mislead the algorithm But this is somewhat subjective
  • 28. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Domain Applicability for QSAR Models What is it? How similar should a new structure be to the training set to get a reliable prediction Approaches Descriptor based - distance from the centroid, leverage Structure based - fingerprint similarity Cluster based Stanforth, R.W. et al., QSAR. Comb. Sci., 2007, 26, 837–844 Netzeva, T.I. et al., Altern. Lab. Anim., 2005, 33, 155–173 Sheridan, R.P. et al., J. Chem. Inf. Comput. Sci., 2004, 44, 1912–1928
  • 29. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Domain Applicability for QSAR Models One Domain or Multiple Domains? It’s possible to encompass the entire training set into a single domain This is a very broad approach Modeling approaches based on neighborhoods essentially consider multiple, local, domains Even for a global model, some observations are better predicted than others Suggests that certain regions of a descriptor space are predicted better than others Guha, R. et al., J. Chem. Inf. Model., 2006, 46, 1836–1847
  • 30. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Domain Applicability for QSAR Models One Domain or Multiple Domains? It’s possible to encompass the entire training set into a single domain This is a very broad approach Modeling approaches based on neighborhoods essentially consider multiple, local, domains Even for a global model, some observations are better predicted than others Suggests that certain regions of a descriptor space are predicted better than others Guha, R. et al., J. Chem. Inf. Model., 2006, 46, 1836–1847
  • 31. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Local Descriptor Distributions PEOE_VSA.0.1 (TSET) PEOE_VSA.0.1 (Neighborhood) PEOE_VSA.0.1 (Neighborhood) 0.015 0.06 0.020 Density Density Density 0.010 0.04 0.010 0.005 0.02 0.000 0.000 0.00 For local regression models, 0 20 40 60 80 100 120 40 50 60 70 80 90 100 40 50 60 70 80 PEOE_VSA_NEG (TSET) PEOE_VSA_NEG (Neighborhood) PEOE_VSA_NEG (Neighborhood) the descriptor distribution in 0.020 0.06 0.008 0.015 Density Density Density 0.04 0.010 0.004 the neighborhood will affect 0.02 0.005 0.000 0.000 0.00 50 100 150 200 250 300 130 140 150 160 170 180 190 200 80 90 100 110 120 predictive ability SlogP (TSET) SlogP (Neighborhood) SlogP (Neighborhood) 1.2 0.5 0.4 1.0 0.4 The effect is more severe 0.8 0.3 Density Density Density 0.3 0.6 0.2 0.2 0.4 0.1 when methods such as kNN 0.1 0.2 0.0 0.0 0.0 0 1 2 3 4 5 6 1 2 3 4 2.0 2.5 3.0 are used which do not do SlogP_VSA1 (TSET) SlogP_VSA1 (Neighborhood) SlogP_VSA1 (Neighborhood) 0.020 0.10 0.04 0.015 0.08 0.03 any interpolation Density Density Density 0.06 0.010 0.02 0.04 0.005 0.01 0.02 Nearest neighbors may not 0.000 0.00 0.00 0 50 100 150 30 40 50 60 70 80 90 50 55 60 65 70 SMR_VSA4 (TSET) SMR_VSA4 (Neighborhood) SMR_VSA4 (Neighborhood) actually be nearby 0.04 0.12 0.04 0.03 0.03 0.08 Density Density Density 0.02 0.02 0.04 0.01 0.01 0.00 0.00 0.00 0 20 40 60 80 100 10 15 20 25 30 35 40 5 10 15 20 25 30 35 Guha, R. et al., J. Chem. Inf. Model., 2006, 46, 1836–1847
  • 32. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Mapping Domain Applicability via Density Query molecules that occur in dense regions of of the training set descriptor space are expected to be well predicted Doesn’t penalize for being a little far from the centroid of the space
  • 33. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability An Aqeuous Solubility Dataset q q q q qq q qq qq q q q q q qq q qq q qq q q q qq q q qq q qq q q q q qq q q q qqqq q qq q q q q q q q q q q q qq q q 0 q q q q qq q q q q qq q q q q q qq qq q qq q q qq q q qqqq qq q q q q qq qq qqq q q q q qq q q qq q q q q qqqq qqq q q qq q q q qqqq qqqqq q q q q q q qq q q q qq q qqqqq q qq q q qq q q q q qq q q q q q qq q q qqqq qq q q qq q q q q qq q qq q q qqq q q q qq q q qq q q q q qq q qqq q q qq q qqqqqqqq qq q q qqq qq q qqq qq q q qq q qq qqqq q qq qq qq q q qq qqqq q q q qq q q qqq q q qq q qq q q q q qq qqqqqqqq q q q qq q q Predicted Aq. Solubility q q qqqqqq q qq qqqqqqqq q q q qq qq qqqqq qq q qq q qqq qqqq q q q qqq qqq qqqqq qq q q qq qqqq q qq q q qq q q q q qq qqqq q q q qq qqq qqq q qqq q q qq qq qqqqqqqq q q qq q qq q q q q q q qq q q q qq qq q qq q q qqqqqqqq Estimate Std. Error t value q q q qq qqqqqq qq qq q qqqq qqqq qq qq q q qqqq qq q q q q q q qqqqqq q q q qq q q qqqqq qq qq q qq q q q qqqq qq q qq q q qq q q q q q q q q qq q q qq q qq q q q q q q q qqqqqqqqq qq qq q q q qq q q q q q q qq qqqqq q q q q qqq q q q q q q qq q qq q q q q qq q q q q q q qq qqq q q q q q qq qq q q q q q qqq q q q qq q q qq q q qq q qqqqq q q q q qq qqqq q qqqqq qq q q q q qq q qqqqq q qqq qq q q q q q q qqq qqqqqqq q q q qq q q qq q qq q qqq qq q q qq q q q qq qq qq q q q qq qq q q qq q q qq q q qqq qq qqqq qq q q q q qq q q q qqqq q q q qq q q q qqqq q qqq qq q q q qqq q q qqq q q q qq qqq qq q qq qq q q (Intercept) -2.1103 0.0909 -23.20 q q q q qqq q q qq q qqq qqqq q q q q q q q qqqqq qq q qqq −5 q qq q q qq q q qq q q q qq qq q q qq q q q q q q qqq q q q q qqq q qq q q q q qq q qq qq qq qq q qq qq q q qq q q q PEOE RPC+1 1.8754 0.2437 7.70 q q q q q q qq q q q q q qq q q qq q q q q q qq q q q q q q q qq q q q q q qq q q PEOE RPC-1 3.4954 0.1738 20.11 q q q qq q q qq q q q qq q qq q q qq SlogP -0.8199 0.0131 -62.55 q q q q CASA 0.0008 0.0001 -9.83 −10 qq q RMSE = 0.7542 F (4, 1231) = 1993 −10 −5 0 Observed Aq. Solubility 100 q q 1236 compounds 80 q q q q q Evaluated 147 descriptors, q 60 q q Rmax(S) q q reduced to 61 q q q qq q q qq q q q q q q qq q q qq 40 q q qq q q q q q qq q q qq q q q qq q q q qqq q q qq q q qq qq q q q q qq q qq qq q qq Searched for a good subset qq qq q q qq q q q q qq q q qqq q q q qq q q q q q q q q q q q q q q q qq q qqq qq q q qq q q qq q q q q q q q q q q qqq qq q qq q q q q q qq qq q q q q q q qq q q qq q q qq qq q q qqq qq q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q q q qqqqq q q q q q q qq 20 q q qq q q qq q qq q q q q q q q q q q qq q qqqq q q qq q q q q q q q q qq q q q q q qq qq q q q qqq q q q qq q q q q qqqq qq q q q qq q q q q q q q q q using a GA q q qqqqqq qq q qq qq q q q q q q qq q q q qq q q qqq qq qq q q qq q qq q q qq q q qqq q q q qq q q qq q q q q qq q q q q q qq q q q qqq q qq q qq qqqqqq q q q q q q qqq qq qqq qqqq q q q qq qq q qqq q qq qq q q qq q q q qq q qqq q qq q q qq qq q q q q q q q q q qqq q q q q q q qq q q q q qq qq qq qq q q q q q qqq q qqq q qqqq q q qqq qqqq qqqqq q q q qq q qq qq q qq qq qqq qqq qqq qqq qq q qq qq q q q qq qq qq q q qq qq q q qq qqq qqqq qq qqq qqq q q qq qq qq qq q qqqqq q q q qqqqqqqqq q q q qq q q q q qq q q q q qq q qq q q qq qq q q q qqq qq q q qqq q qq q qq q q q q q q q qq q q q q q q qq qq qq q qqq q q q q qqq qq q q qq q q q q q q q q q q qq qq q qq q q q q qq qqqq q q q q q qqqqqqq q q qq q q q q q q qq qq q q q q q q q q qq qq q qq qqqqq q q q qq q q q q q qq q q q q qqqq q q q qq q qq q q q q q q q qqqqq q qq q qqqqq q q q qqq q qqqqq qq q q qq q qqq q q q q q q qqqqqqqq qqq q qqq qqq q q q qq qqqqq qq q qq qqqq q q qq q qqq qqq qq q qq q q q qq q qqqqq q qqq q q qqq q qqq q q q qq qqq q q q qq q qq qq q qq q qq q q q qq q q q qq q q q q qq q q q qq q qq q qq qq q qq q 0 0 200 400 600 800 1000 1200 Huuskonen, J., J. Chem. Inf. Comput. Sci., 2000, 40, 773–777 Index
  • 34. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability An Aqueous Solubility Dataset 100 q q q q q q qq q qq qq q q q q q qq q qq q qq q q q qq q q qq q qq q q q qq q q q qqqq q qq 80 q q q q q q q q q q q q qq q q 0 q q q q qq q q q q qq q q q q q qq qq q qq q q qq q q qqqqq q q q q q q q qq qq qqqqq q q q qq qqqqq q q q qqq q q q q q q qq qqq qq q q q qq q qq q q q q q q qq q q q q q qq qq q qq qq q qq q q q q q qq q q qqqq qq q qq q qq q qq q q q q q q qqqq qq q q q q q qq q qq qq q q qqq q qq q qqqqqqqq qq q q qqq qq q qqq q q qq q q q qq qqq q q qq q qq qqqq q q q qqqqqqqqqq q q q q qq q q q q qqq q q q qqq qq qqq q qq q Predicted Aq. Solubility q q qq q q q q q qq qqq q q qq qqqqq qqq q qq q q q q q qq qq qqqqq qq qq qqqqq q q qq qq q q q qqq qqq qqqqq qq q q q qq qq qqq q q qq q q q q qq qqqq q q q qq qqq qqq q qqq q qq q qq qq q q q q qq q q q q q q qqq q q qq qqqqqqq q qq q q qqqqqqqq q q q qq qqqqqq qq qq q q q qqq q q qqqq qqqq q q qq q qq q q q q q q q q q qqqq qq q q q q q q q qqq q q q q q q q 60 q qq q q qq q q qq qq q q qq qqq qqq q q qqqq q qqq q q q q q qqqqqqqqq qq qq q qq qq q q q q q q q q q q qq qqqqq q q q q qq q qq q q q q q qq q q q q q qq qq qq q q q q q q q qqq qq qq qq q qq q qqqqq qq q Rmax(S) qqq q q q q q q qqqqq q q q q q qqqq q qqqq qq q q qq qq q q q q q q q qqq qqq qq q q q q q q q q qq q q q q q q q q qq qq q q q qq q qqqqqqqqq q q q q q qqqq qqq q q q q q q qqq qq q q q qqqqq qqq qq qqqq qq q qq q q q q q q q qqqq q qq q q q qqqq q qqq qq q qq q q q q qqq q q qqq q q q qq qqq qq q qqq q q q q qqq q q q q qq q qq q q qqq qqqq q q q q q q q q qqqqq qq qq q q −5 q qq q q qq q q qq q q q qq q q q q qq qq q q q qq q q q q qqq q q q q q qq q q q q q q q q q qq q qq qq qq qq q qq qq q q q q q 40 q qq q q q q q q q q q qq q q q q q q q q qq q q qq q q q qq q q q q q q q q q q q qq q q qqq q q q qq q qq q q q qq q qq q q q q q q q q q q q qq q q q q q q qq q q q qq q q q qq q qq q q q qq q q qq q qq q q q qq q qq q q qq q q qq q q q qq q q qq q q qq q qq q q q qq q q q q q q q q q q qq q q q q q q qq q qq q qq q q qq q qq q q qqq q q q qq q q q q q q q qq q q qq qq q q q qq q qqq q q q q q q q q q q qq q q q q q q qq q q q q q q qq q q qqqqq q q q q qqq q q q q 20 q q q q q qq qq q q q q q q q q q q q q qq q q q qq q q q qqq q q q q qq q qqq q q q q q q q qq q q q q q q q q q qqq qq q q q q q q q q q q q q qq q q qq q qq q q q q q qq qq qq qq q q q q qqq q q qq q q q q qq qq q qq q q qq q q qq q q qq q q qqq q qq q q qq q q qq q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q q qq q q q q qq q q q q q qq qq q q −10 qq q q q q q q qq q q qqqq q q qq qq qq q q q q q q qq q qq qq q q q q qq q q q q q q q q q qq q q q qq qq qq q q q q q q q q q q qq q qq qq q q q qq qq q q q q q q q qq q q q q qqqqq qq q qq q q q qq q qqqqqq q q q q q q q q q qq qq q qq q q q q q q qqq q q qq q qqqqqq q q q qqqq qq q q qqq q qq q qq qq q q qq qq q q q q q q qq qqq qqqq q q qq q q qqqq q q q q q qq qq qq q q q q q q q q q q q qq q q qq q q q q qq q q qq q q q q q q qq qqq qq q q q q q q q q qqq qq q q q q qq q q qq q q q q q qq q q q q q q qq q q qq q qq q q q q q qqqqqq q q q qq qqqqq qq qq qq q q qq qqqq qq q qq q q q qq q qqq qqq qq q q q q qqq q q q q q qqqqqqqqqqq qqqqq q qq q q q qq q q qqqqqqq qqqqqqq qqq q q qqqqqqqqq q qq qqq q qqqqqq q qq q qq q qq qqq qq qqq q q q q q qq q q q qqqqqqqqq qqqqq q q qq q qq qqqqq qq qqqqq q q q qq q q q q qq q q q qq qq q q q q q q q q q q qq q qq q qq q q 0 −10 −5 0 0 1 2 3 4 5 Observed Aq. Solubility Absolute Studentized Residual Isolated compounds can be predicted well The Rmax(S) values could be calculated more rigorously from a smoothed curve
  • 35. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Mapping Domain Applicability via Density Caveats The approach assumes that similar compounds will have similar activities This is not always true (activity cliffs) Correlation between density as measured by Rmax(S) and prediction accuracy needs more investigation Take into account descriptor importance via weighted Euclidean distances Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535
  • 36. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Summary R-NN curves . . . Simple way to characterize spatial distributions and identify outliers Applicable to datasets of arbitrary dimensions and size, via approximate NN algorithms such as LSH Summarizing a dataset does not require user-defined parameters . . . Clustering Provides an approach to a priori identification of the number of clusters, avoiding trial and error Appears to be more reliable than the silhouette width Probably not useful for hierarchical clusterings
  • 37. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability