Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering - Presentation Transcript
Characterizing the Density of Chemical Spaces and
its Use in Outlier Analysis and Clustering
Rajarshi Guha
School of Informatics
Indiana University
17th August, 2007
Cambridge, MA
Outline
R-NN curves for diversity analysis
Counting clusters with R-NN curves
Density and domain applicability
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Nearest Neighbor Methods
Traditional kNN methods are q
simple, fast, intuitive q q
q
q
q
q
q
Applications in q
q
q
q
q q q
regression & classification q
q
q
q
diversity analysis q
q
q q q
q q
q
Can be misleading if the nearest
q q q
q
q q
q
q q
q q
neighbor is far away q
q
q
q
q
q
q
R-NN methods may be more q
q
q
q
q
suitable
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Diversity Analysis
Why is it Important?
Compound acquisition
Lead hopping
Knowledge of the distribution of compounds in a descriptor
space may improve predictive models
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Approaches to Diversity Analysis
Cell based q q
q
Divide space into bins q
q q
q
qq
q q
qq
q
q
q
qq
q
q
q
q
q q
q
qq
q
q
qq qq q
q
qq q
qq q q
q q
qq q q q
Compounds are mapped to bins
q q q q
q qqq q qq q
q qqq q qq q
q q q q
q q q q q q qqq q
q q
q q qqq q q q
q q q q q
q q
qq q q
q
q q q q q
qq q q q
q q q qq q
q q
q q
q q q
q
q
q
qq q q
q q q q
Disadvantages q
qq q
q
q
q
Not useful for high dimensional
data
Choosing the bin size can be tricky
Schnur, D.; J. Chem. Inf. Comput. Sci. 1999, 39, 36–45
Agrafiotis, D.; Rassokhin, D.; J. Chem. Inf. Comput. Sci. 2002, 42, 117–12
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Approaches to Diversity Analysis
q
Distance based q q
q
q
q
q
Considers distance between q
q
q
q
q
compounds in a space q q
q
q
q
q q
q q q q
Generally requires pairwise distance q
q
q
q
calculation q q q
q
q q
q
q q
q q
q q
Can be sped up by kD trees, MVP q
q
q
q
q
trees etc. q
q
q
q
q
Agrafiotis, D. K. and Lobanov, V. S.; J. Chem. Inf. Comput. Sci. 1999, 39, 51–58
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Generating an R-NN Curve
Observations
Consider a query point with a hypersphere, of radius R,
centered on it
For small R, the hypersphere will contain very few or no
neighbors
For larger R, the number of neighbors will increase
When R ≥ Dmax , the neighbor set is the whole dataset
The question is . . .
Does the variation of nearest neighbor count with radius allow us
to characterize the location of a query point in a dataset?
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Generating an R-NN Curve
Observations
Consider a query point with a hypersphere, of radius R,
centered on it
For small R, the hypersphere will contain very few or no
neighbors
For larger R, the number of neighbors will increase
When R ≥ Dmax , the neighbor set is the whole dataset
The question is . . .
Does the variation of nearest neighbor count with radius allow us
to characterize the location of a query point in a dataset?
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Generating an R-NN Curve
Algorithm q
q 50%
q q
Dmax ← max pairwise distance q q
q
q
qq
30%
for molecule in dataset do q q
q q
q
q q
q
q q
q
q
q
q
q
q
q q q
R ← 0.01 × Dmax
q q q
q
q q
q qq q q q q
q q q q
qq q
q
q q q
q 10%
q q q q q q
q
q q q
q 5%
while R ≤ Dmax do
q q q
q qq qq
q q q q q
q qq
q q q q
qq q
q q q
q q q
q q q
q
q qq q q q
q q
q q q
Find NN’s within radius R
q q
q q
q
q
q
q
q q q q
q q q
q
q q q
Increment R
q q
q q q
q q q
q
q q q q q
q
q
q
end while q
end for q
Plot NN count vs. R
Guha, R. et al.; J. Chem. Inf. Model., 2006, 46, 1713–1772
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Generating an R-NN Curve
qq
qq
qqqq
qqq qqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqq
q qqqqqq
qqqqq
q
qqqq
qqqq
250
q
250
qq
qqq
qq
Number of Neighbors
Number of Neighbors
q q
200
200
qq
q
q
q
150
150
qq
q
q
q
q
100
100
q
qq
q
q
50
q
50
qq
q
qq
qq q
qqq
qq
qqqqqqqqqq
qqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqq
0
qqq
qq
0
0 20 40 60 80 100 0 20 40 60 80 100
R R
Sparse Dense
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Characterizing an R-NN Curve
Converting the Plot to Numbers
Since R-NN curves are sigmoidal, fit them to the logistic
equation
1 + me −R/τ
NN = a ·
1 + ne −R/τ
m, n should characterize the curve
Problems
Two parameters
Non-linear fitting is dependent on the starting point
For some starting points, the fir does not converge and
requires repetition
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Characterizing an R-NN Curve
Converting the Plot to a Number
Determine the value of R where the
lower tail transitions to the linear
portion of the curve S=0
S’’
Solution
Determine the slope at various
points on the curve S’
S=0
Find R for the first occurence of
the maximal slope (Rmax(S) )
Can be achieved using a finite
difference approach
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Characterizing an R-NN Curve
Converting the Plot to a Number
Determine the value of R where the
lower tail transitions to the linear
portion of the curve S=0
S’’
Solution
Determine the slope at various
points on the curve S’
S=0
Find R for the first occurence of
the maximal slope (Rmax(S) )
Can be achieved using a finite
difference approach
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Characterizing Multiple R-NN Curves
Problem
Visual inspection of curves is useful
3904
for a few molecules
For larger datasets we need to
80
1861
summarize R-NN curves
60
Rmax S
Solution
40
Plot Rmax(S) values for each
molecule in the dataset
20
Points at the top of the plot are
0 1000 2000 3000 4000
located in the sparsest regions Serial Number
Points at the bottom are located in
the densest regions
Kazius, J. et al.; J. Med. Chem. 2005, 48, 312–320
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
R-NN Curves and Outliers
qq
q
q
4000
q H2N
HN N•
Number of Neighbors
O
q
HO
O OH
H2N
O
3000
O O
O HN H
N N
q H
O
H HN O
N N
H O
O O O
2000
NH
HN HO N
q H
N
H HO
O NH
O H2N
N
H
q O HN O
•N N O
O H
1000
NH NH
q
H2N S
q
S •N
q HN
q O NH2
O
q
qq
OH NH
qq
qq O HN
qq HN OH
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq HN
0
NH
H
O N O
H
S N O
N NH
0 20 40 60 80 100 H
O OH O
O
O
O
R HN
NH
NH
O
H2N
N• NH2
HN
3904 •N
NH
NH2
O
HO
q
qqqqqqqqqqqqqq
qqqqqqqqqqqqqq
3904
4000
q
q
Number of Neighbors
q
3000
q H
HO
O N S
H2N
N O
O
O OH N
2000
q NH O H
OH N O
HO O
S H
N N
O
NH NH2
O O
q O N
HO
OH O N
O O
N
1000
N
N
q
HO NH NH2
q OH HN
O
q
H2N O
q
q
q
q
qq O
qqq
qqq
qqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqq H2N
0
0 20 40 60 80 100
R
1861
1861
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
How Can We Use It For Large Datasets?
Breaking the O(n2 ) barrier
Traditional NN detection has a time complexity of O(n2 )
Modern NN algorithms such as kD-trees
have lower time complexity
restricted to the exact NN problem
Solution is to use approximate NN algorithms such as Locality
Sensitive Hashing (LSH)
Bentley, J.; Commun. ACM 1980, 23, 214–229
Datar, M. et al.; SCG ’04: Proc. 20th Symp. Comp. Geom.; ACM Press, 2004
Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
How Can We Use It For Large Datasets?
−1.0
−1.5
log [Mean Query Time (sec)]
Why LSH?
−2.0
Theoretically sublinear
−2.5
Shown to be 3 orders of
−3.0
LSH (142 descriptors)
LSH (20 descriptors)
magnitude faster than kNN (142 descriptors)
kNN (20 descriptors)
−3.5
traditional kNN
0.02 0.03 0.04 0.05 0.06 0.07 0.08
Very accurate (> 94%) Radius
Comparison of NN detection speed
on a 42,000 compound dataset
using a 200 point query set
Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Alternatives?
Why not use PCA?
R-NN curves are fundamentally a
form of dimension reduction
4
Principal Components Analysis is
2
also a form of dimension reduction
Principal Component 2
0
Disadvantages
−2
Eigendecomposition via SVD is
−4
O(n3 )
−6
Difficult to visualize more than 2 or 0 5 10
Principal Component 1
3 PC’s at the same time
We are no longer in the original
descriptor space
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
R-NN Curves and Clusters
Counting the steps Approaches
Essentially a curve matching Hausdorff / Fr´chet distance
e
problem - requires canonical curves
All points will not be RMSE from distance matrix
indicative of the number of Slope analysis
clusters
Not applicable for radially
distributed clusters
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
R-NN Curves and Their Slopes
251 84 251 84
300
300
14
12
20
250
250
Slope of the RNN Curve
Slope of the RNN Curve
10
200
200
15
8
150
150
10
6
100
100
4
5
2
50
50
0
0
0
0
0 20 40 60 80 100 0 20 40 60 80 100
0 20 40 60 80 100 0 20 40 60 80 100
88 137 88 137
300
300
15
250
250
15
Slope of the RNN Curve
Slope of the RNN Curve
200
200
10
10
150
150
100
100
5
5
50
50
0
0
0
0
0 20 40 60 80 100 0 20 40 60 80 100
0 20 40 60 80 100 0 20 40 60 80 100
Smoothed first derivative of the R-NN Curves
Guha, R. et al., J. Chem. Inf. Model., 2007, 47, 1308–1318
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
R-NN Curves and Their Slopes
251 84 251 84
300
300
14
12
20
250
250
Slope of the RNN Curve
Slope of the RNN Curve
10
200
200
15
8
150
150
10
6
100
100
4
5
2
50
50
0
0
0
0
0 20 40 60 80 100 0 20 40 60 80 100
0 20 40 60 80 100 0 20 40 60 80 100
88 137 88 137
300
300
15
250
250
15
Slope of the RNN Curve
Slope of the RNN Curve
200
200
10
10
150
150
100
100
5
5
50
50
0
0
0
0
0 20 40 60 80 100 0 20 40 60 80 100
0 20 40 60 80 100 0 20 40 60 80 100
Smoothed first derivative of the R-NN Curves
Identifying peaks identifies the number of clusters
Automated picking can identify spurious peaks
Guha, R. et al., J. Chem. Inf. Model., 2007, 47, 1308–1318
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Simulated Data
Simulated 2D data using a Thomas point process
Predicted k, followed by kmeans clustering using k
Investigated similar values of k
k ASW k ASW k ASW
4 0.61 3 0.44 4 0.64
3 0.74 2 0.65 3 0.48
5 0.70 4 0.47 5 0.56
ASW - average silhouette width, higher is better; k - number of clusters
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
A Mixed Dataset
Dataset composition
277 DHFR inhibitors based on
substituteed pyrimidinediamine and
diaminopteridine scaffolds
277 molecules from the DIPPR project
mainly simple hydrocarbons
boiling point was modeled
Evaluated 147 Molconn-Z descriptors,
reduced to 24
We expect at least 3 clusters
Sutherland, J. et al.; J. Chem. Inf. Comput. Sci., 2003, 43, 1906–1915
Goll, E. and Jurs, P.; J. Chem. Inf. Comput. Sci., 1999, 39, 974–983
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Drawbacks
Currently works well for well separated clusters
But radial clusters will not be detected
We now consider all the points in the dataset
Inefficient
Sampling experiments seem to indicate that we can make do
with 45% of the points
Could prioritize by considering points wth low Rmax(S) values
Small changes in local density can mislead the algorithm
But this is somewhat subjective
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Domain Applicability for QSAR Models
What is it?
How similar should a new structure be to the training set to
get a reliable prediction
Approaches
Descriptor based - distance from the centroid, leverage
Structure based - fingerprint similarity
Cluster based
Stanforth, R.W. et al., QSAR. Comb. Sci., 2007, 26, 837–844
Netzeva, T.I. et al., Altern. Lab. Anim., 2005, 33, 155–173
Sheridan, R.P. et al., J. Chem. Inf. Comput. Sci., 2004, 44, 1912–1928
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Domain Applicability for QSAR Models
One Domain or Multiple Domains?
It’s possible to encompass the entire training set into a single
domain
This is a very broad approach
Modeling approaches based on neighborhoods essentially
consider multiple, local, domains
Even for a global model, some observations are better
predicted than others
Suggests that certain regions of a descriptor space are
predicted better than others
Guha, R. et al., J. Chem. Inf. Model., 2006, 46, 1836–1847
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Domain Applicability for QSAR Models
One Domain or Multiple Domains?
It’s possible to encompass the entire training set into a single
domain
This is a very broad approach
Modeling approaches based on neighborhoods essentially
consider multiple, local, domains
Even for a global model, some observations are better
predicted than others
Suggests that certain regions of a descriptor space are
predicted better than others
Guha, R. et al., J. Chem. Inf. Model., 2006, 46, 1836–1847
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Local Descriptor Distributions
PEOE_VSA.0.1 (TSET) PEOE_VSA.0.1 (Neighborhood) PEOE_VSA.0.1 (Neighborhood)
0.015
0.06
0.020
Density
Density
Density
0.010
0.04
0.010
0.005
0.02
0.000
0.000
0.00
For local regression models,
0 20 40 60 80 100 120 40 50 60 70 80 90 100 40 50 60 70 80
PEOE_VSA_NEG (TSET) PEOE_VSA_NEG (Neighborhood) PEOE_VSA_NEG (Neighborhood)
the descriptor distribution in
0.020
0.06
0.008
0.015
Density
Density
Density
0.04
0.010
0.004
the neighborhood will affect
0.02
0.005
0.000
0.000
0.00
50 100 150 200 250 300 130 140 150 160 170 180 190 200 80 90 100 110 120
predictive ability SlogP (TSET) SlogP (Neighborhood) SlogP (Neighborhood)
1.2
0.5
0.4
1.0
0.4
The effect is more severe
0.8
0.3
Density
Density
Density
0.3
0.6
0.2
0.2
0.4
0.1
when methods such as kNN
0.1
0.2
0.0
0.0
0.0
0 1 2 3 4 5 6 1 2 3 4 2.0 2.5 3.0
are used which do not do SlogP_VSA1 (TSET) SlogP_VSA1 (Neighborhood) SlogP_VSA1 (Neighborhood)
0.020
0.10
0.04
0.015
0.08
0.03
any interpolation Density
Density
Density
0.06
0.010
0.02
0.04
0.005
0.01
0.02
Nearest neighbors may not
0.000
0.00
0.00
0 50 100 150 30 40 50 60 70 80 90 50 55 60 65 70
SMR_VSA4 (TSET) SMR_VSA4 (Neighborhood) SMR_VSA4 (Neighborhood)
actually be nearby
0.04
0.12
0.04
0.03
0.03
0.08
Density
Density
Density
0.02
0.02
0.04
0.01
0.01
0.00
0.00
0.00
0 20 40 60 80 100 10 15 20 25 30 35 40 5 10 15 20 25 30 35
Guha, R. et al., J. Chem. Inf. Model., 2006, 46, 1836–1847
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Mapping Domain Applicability via Density
Query molecules that occur
in dense regions of of the
training set descriptor space
are expected to be well
predicted
Doesn’t penalize for being a
little far from the centroid of
the space
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Mapping Domain Applicability via Density
Caveats
The approach assumes that similar compounds will have
similar activities
This is not always true (activity cliffs)
Correlation between density as measured by Rmax(S) and
prediction accuracy needs more investigation
Take into account descriptor importance via weighted
Euclidean distances
Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Summary
R-NN curves . . .
Simple way to characterize spatial distributions and identify
outliers
Applicable to datasets of arbitrary dimensions and size, via
approximate NN algorithms such as LSH
Summarizing a dataset does not require user-defined
parameters
. . . Clustering
Provides an approach to a priori identification of the number
of clusters, avoiding trial and error
Appears to be more reliable than the silhouette width
Probably not useful for hierarchical clusterings
R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
0 comments
Post a comment