Navigating Molecular Haystacks: Tools & Applications
Upcoming SlideShare
Loading in...5
×
 

Navigating Molecular Haystacks: Tools & Applications

on

  • 1,329 views

 

Statistics

Views

Total Views
1,329
Views on SlideShare
1,328
Embed Views
1

Actions

Likes
0
Downloads
20
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Navigating Molecular Haystacks: Tools & Applications Navigating Molecular Haystacks: Tools & Applications Presentation Transcript

  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Navigating Molecular Haystacks: Tools & Applications Finding Interesting Molecules & Doing it Fast Rajarshi Guha Department of Chemistry Pennsylvania State University 20th April, 2006
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Past Work QSAR Applications Artemisinin analogs PDGFR inhibitors Bleaching agents Linear & non-linear methods QSAR Methods Representative QSAR sets Model interpretation Model applicability
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Past Work Cheminformatics Automated QSAR pipeline Contributions to the CDK Cheminformatics webservices R packages and snippets
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Past Work Chemical Data Mining Approximate k-NN Local regression Outlier detection VS protocol
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Past Work Chemical Data Mining Approximate k-NN Local regression Outlier detection VS protocol
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Outline 1 Outlier Detection Using R-NN Curves Methods for Diversity Analysis Generating & Summarizing R-NN Curves Using R-NN Curves 2 Searching for HIV Integrase Inhibitors Previous Work on HIV Integrase A Tiered Virtual Screening Protocol What Does the Pipeline Give Us? 3 Summary
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Outline 1 Outlier Detection Using R-NN Curves Methods for Diversity Analysis Generating & Summarizing R-NN Curves Using R-NN Curves 2 Searching for HIV Integrase Inhibitors Previous Work on HIV Integrase A Tiered Virtual Screening Protocol What Does the Pipeline Give Us? 3 Summary
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Diversity Analysis Why is it Important? Compound acquisition Lead hopping Knowledge of the distribution of compounds in a descriptor space may improve predictive models
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Approaches to Diversity Analysis Cell based q q q Divide space into bins q q q q qq q q qq q q q qq q q q q q q q qq q q qq qq q q qq q qq q q q q qq q q q Compounds are mapped to bins q q q q q qqq q qq q q qqq q qq q q q q q q q q q q q qqq q q q q q qqq q q q q q q q q q q qq q q q q q q q q qq q q q q q q qq q q q q q q q q q q q qq q q q q q q Disadvantages q qq q q q q Not useful for high dimensional data Choosing the bin size can be tricky Schnur, D.; J. Chem. Inf. Comput. Sci. 1999, 39, 36–45 Agrafiotis, D.; Rassokhin, D.; J. Chem. Inf. Comput. Sci. 2002, 42, 117–12
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Approaches to Diversity Analysis q Distance based q q q q q q Considers distance between q q q q q compounds in a space q q q q q q q q q q q Generally requires pairwise distance q q q q calculation q q q q q q q q q q q q q Can be sped up by kD trees, MVP q q q q q trees etc. q q q q q Agrafiotis, D. K.; Lobanov, V. S.; J. Chem. Inf. Comput. Sci. 1999, 39, 51–58
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Generating an R-NN Curve Observations Consider a query point with a hypersphere, of radius R, centered on it For small R, the hypersphere will contain very few or no neighbors For larger R, the number of neighbors will increase When R ≥ Dmax , the neighbor set is the whole dataset The question is . . . Does the variation of nearest neighbor count with radius allow us to characterize the location of a query point in a dataset?
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Generating an R-NN Curve Observations Consider a query point with a hypersphere, of radius R, centered on it For small R, the hypersphere will contain very few or no neighbors For larger R, the number of neighbors will increase When R ≥ Dmax , the neighbor set is the whole dataset The question is . . . Does the variation of nearest neighbor count with radius allow us to characterize the location of a query point in a dataset?
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Generating an R-NN Curve Algorithm q q 50% q q Dmax ← max pairwise distance q q q q qq 30% for molecule in dataset do q q q q q q q q q q q q q q q q q q q R ← 0.01 × Dmax q q q q q q q qq q q q q q q q q qq q q q q q q 10% q q q q q q q q q q q 5% while R ≤ Dmax do q q q q qq qq q q q q q q qq q q q q qq q q q q q q q q q q q q qq q q q q q q q q Find NN’s within radius R q q q q q q q q q q q q q q q q q q q Increment R q q q q q q q q q q q q q q q q q end while q end for q Plot NN count vs. R Guha, R.; Dutta, D.; Jurs, P.C.; Chen, T.; J. Chem. Inf. Model., submitted
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Generating an R-NN Curve qq qq qqqq qqq qqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqq q qqqqqq qqqqq q qqqq qqqq 250 q 250 qq qqq qq Number of Neighbors Number of Neighbors q q 200 200 qq q q q 150 150 qq q q q q 100 100 q qq q q 50 q 50 qq q qq qq q qqq qq qqqqqqqqqq qqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqq 0 qqq qq 0 0 20 40 60 80 100 0 20 40 60 80 100 R R Sparse Dense
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Characterizing an R-NN Curve Converting the Plot to Numbers Since R-NN curves are sigmoidal, fit them to the logistic equation 1 + m e −R/τ NN = a · 1 + n e −R/τ m, n should characterize the curve Problems Two parameters Non-linear fitting is dependent on the starting point For some starting points, the fit does not converge and requires repetition
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Characterizing an R-NN Curve Converting the Plot to a Number Determine the value of R where the lower tail transitions to the linear portion of the curve S=0 S’’ Solution Determine the slope at various points on the curve S’ S=0 Find R for the first occurence of the maximal slope (Rmax(S) ) Can be achieved using a finite difference approach
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Characterizing an R-NN Curve Converting the Plot to a Number Determine the value of R where the lower tail transitions to the linear portion of the curve S=0 S’’ Solution Determine the slope at various points on the curve S’ S=0 Find R for the first occurence of the maximal slope (Rmax(S) ) Can be achieved using a finite difference approach
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Characterizing Multiple R-NN Curves Problem Visual inspection of curves is useful for a few molecules 80 For larger datasets we need to summarize R-NN curves 60 Rmax S 40 Solution Plot Rmax(S) values for each 20 molecule in the dataset Points at the top of the plot are 0 0 50 100 150 200 250 Serial Number located in the sparsest regions Points at the bottom are located in the densest regions
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves How Can We Use It For Large Datasets? Breaking the O(n2 ) barrier Traditional NN detection has a time complexity of O(n2 ) Modern NN algorithms such as kD-trees have lower time complexity restricted to the exact NN problem Solution is to use approximate NN algorithms such as Locality Sensitive Hashing (LSH) Bentley, J.; Commun. ACM 1980, 23, 214–229 Datar, M. et al.; SCG ’04: Proc. 20th Symp. Comp. Geom.; ACM Press, 2004 Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves How Can We Use It For Large Datasets? −1.0 −1.5 log [Mean Query Time (sec)] Why LSH? −2.0 Theoretically sublinear −2.5 Shown to be 3 orders of −3.0 LSH (142 descriptors) LSH (20 descriptors) magnitude faster than kNN (142 descriptors) kNN (20 descriptors) −3.5 traditional kNN 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Very accurate (> 94%) Radius Comparison of NN detection speed on a 42,000 compound dataset using a 200 point query set Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Alternatives? Why not use PCA? R-NN curves are fundamentally a 4 form of dimension reduction 2 Principal Components Analysis is Principal Component 2 0 also a form of dimension reduction −2 Disadvantages −4 Eigendecomposition via SVD is −6 O(n3 ) 0 5 10 Principal Component 1 Difficult to visualize more than 2 or 3 PC’s at the same time
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Datasets Boiling point dataset 277 molecules Average: MW = 115, Stanimoto = 0.20 Calculated 214 descriptors, reduced to 64 Kazius-AMES dataset 4337 molecules Average MW = 240, Stanimoto = 0.21 Known to have a number of significant outliers Calculated 142 MOE descriptors, reduced to 45 Goll, E.; Jurs, P.; J. Chem. Inf. Comput. Sci. 1999, 39, 974–983 Kazius, J.; McGuire, R.; Bursi, R.; J. Med. Chem. 2005, 48, 312–320
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Choosing a Descriptor Space Boiling point dataset We had previously generated linear regression models using a GA to search for subsets Best model had 4 descriptors Kazius-AMES dataset No models were generated for this dataset Considered random descriptor subsets Used a 5-descriptor subset to represent the results The R-NN curve approach focuses on the distribution of molecules in a given descriptor space. Hence a good or random descriptor subset should be able to highlight the utility of the method
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Choosing a Descriptor Space Boiling point dataset We had previously generated linear regression models using a GA to search for subsets Best model had 4 descriptors Kazius-AMES dataset No models were generated for this dataset Considered random descriptor subsets Used a 5-descriptor subset to represent the results The R-NN curve approach focuses on the distribution of molecules in a given descriptor space. Hence a good or random descriptor subset should be able to highlight the utility of the method
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves R-NN Curves for the BP Dataset 6 18 21 22 24 qqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqq qqqqqqqqqqqqqqq qqqqqqq qqqqq qqqqqqqqqqqqqq qqqqqqq qqqqqq qqqqqqqqqqqqqq qqqqq qqqqqqqqqqqqqq qqqqqq qqqqqq qqqqqq qqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqq qqqqqqqqqqqqqqq qqqqqqqq qqqqqq qqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqq qqqqqqqq qqqqqq qqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqq qqqqq qqqqq qqqq qqqq qqqq qqq qqqqqq qqqqqq q qq qq qq q q q qq q q qq qq qq qq Number of Neighbors Number of Neighbors Number of Neighbors Number of Neighbors Number of Neighbors q q q q q q qq q qq qq q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q qq qq q q q qq qq q qq qq q q q qqq qq qq qqq qq qq qqq qqq qq qqq qqq qq qq qq qq qqq qqq qq qqq qqq qq R R R R R 37 42 85 90 111 qqqqqqqqqqqqqqqqq qqqqqqqq qqqqqqqq qqqqqqqqqqqqq qqqqqqqqqqqqq qqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqq qqqqqqqq qqqqqqqqqqqqqqqqqq qqqqqqqq qqqqqqqqq qqqqqqqqqqqqqqq qqqqqqqqqqqqqq qqqqqqqqqqqqqq qqqqqqqqqqqqq qqqqqqqqqqqqq qqqqqqq qqqqqq qqqqqq qqqqqqqqqqqqq qqqqqqqqqqqqqq qqqqqqqqqqqqq qqqqqq qqqqqq qqqqqq qqqqq qqqqqq qqqqqq qqqqq qqq qqqq qqq qqqq qqq qqq q q qq qq q qq qq q qq qq Number of Neighbors Number of Neighbors Number of Neighbors Number of Neighbors Number of Neighbors q q q q q qq qq q qq q q qq qq q q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q qq qq q q q q q q q q q q q q q q q qq q qq qqq qq q qq qq q q q q qq qq qq q qqq qq qq q qqq qq qq q qq qqq qq qqq qq qqq qq qqq qqq qqq qqq qq q q R R R R R 141 148 149 151 165 qqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqq qqqq qqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqq qqqq qqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqq qqqqqq qqqqqqqqq qqqqqqqqq qqqqqqqqq qqqqqq qqqqqq qqqqqq qqqqqqqqqqqqq qqqqqqqqqqqqq qqqqqqqqqqqq qqqqqq qqqqq qqqqq qqqqq qqqqqq qqqq qqqqqq qqqqq qqqq qqqqq qqqqqq qqq qqq qqq qqq qq qq qqq qq qq qq qq q q q q q q qq qq Number of Neighbors Number of Neighbors Number of Neighbors Number of Neighbors Number of Neighbors qq qq q qq q qq q qq q qq q q q q q q q q q q q q qq qq q q q q qq qq q qq qq q q q qq q qq q qq q q q q q qq qq q q q q q q q qq q qq qq qq qq qqq
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Outliers for the Kazius-AMES Dataset qq q q 4000 q Number of Neighbors q 3000 q 2000 q Defining outliers q 1000 q q R = 0.5 × Dmax qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq qq qq qq q q q 0 0 20 40 60 80 100 NN count ≤ 10% of the dataset R 3904 Outliers detected q qqqqqqqqqqqqqq qqqqqqqqqqqqqq 4000 q q Number of Neighbors Molecule 3904 had 2 neighbors q 3000 q Molecule 1861 had 41 neighbors 2000 q q R = 0.4 × Dmax → 6 outliers 1000 q q q q q q Clearly, this process is subjective q qq qqq qqq qqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqq 0 0 20 40 60 80 100 R 1861
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Outlier Structures H2N O HN N• HO O OH H2N O O O O HN H N N H O H HN O N N H O O O O NH HN HO N H N H HO O NH O H2N N H O HN O •N N O O H NH NH H2N S H HO O N S •N S HN H2N O NH2 O N O O OH O NH O OH N HN NH O H HN HN OH OH N O NH HO O H S H O N O N N H O NH NH2 S N O O O N NH H OH O O N O O HO O OH O N O O O NH N NH N N HN HO NH NH2 O H2N OH HN O N• NH2 HN H2N O NH O O H2N •N HO NH2 3904 1861
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Summary of the Kazius-AMES Dataset 3904 & 1861 are immediately 3904 identifiable as outliers 80 Excluding these two, there are 1861 no significant outliers 60 Rmax S The dataset appears to be 40 relatively dense 20 One plot summarizes the location characteristics of all the molecules in 0 1000 2000 3000 4000 the dataset Serial Number
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves R-NN Curves & Clustering 251 84 300 300 250 250 200 200 15 150 150 100 100 251 50 50 10 0 0 0 20 40 60 80 100 0 20 40 60 80 100 88 137 88 300 300 250 250 5 84 200 200 150 150 137 100 100 0 50 50 0 0 −5 0 5 10 15 20 25 0 20 40 60 80 100 0 20 40 60 80 100
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves R-NN Curves & Clustering 251 84 251 84 300 300 14 12 20 250 250 Slope of the RNN Curve Slope of the RNN Curve 10 200 200 15 8 150 150 10 6 100 100 4 5 2 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 88 137 88 137 300 300 15 250 250 15 Slope of the RNN Curve Slope of the RNN Curve 200 200 10 10 150 150 100 100 5 5 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Smoothed with Friedmans Super Smoother Friedman, J.H.; Laboratory for Computational Statistics, Stanford University, Technical Report 5, 1984
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Conclusions Future Work Allow comparison over multiple datasets Automated detection of clustering Conclusions Applicable to arbitrary dimensions Can be easily extended to large datasets Summarizing a dataset does not require any user defined parameters Simple and intuitive approach to visualizing distribution of molecules in a descriptor space
  • Outlier Detection Using R-NN Curves Methods for Diversity Analysis Searching for HIV Integrase Inhibitors Generating & Summarizing R-NN Curves Summary Using R-NN Curves Conclusions Future Work Allow comparison over multiple datasets Automated detection of clustering Conclusions Applicable to arbitrary dimensions Can be easily extended to large datasets Summarizing a dataset does not require any user defined parameters Simple and intuitive approach to visualizing distribution of molecules in a descriptor space
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Outline 1 Outlier Detection Using R-NN Curves Methods for Diversity Analysis Generating & Summarizing R-NN Curves Using R-NN Curves 2 Searching for HIV Integrase Inhibitors Previous Work on HIV Integrase A Tiered Virtual Screening Protocol What Does the Pipeline Give Us? 3 Summary
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Background Most drugs for HIV treatment target the protease or reverse transcriptase enzymes Integrase is essential for the replication of HIV There are no FDA approved drugs that target HIV integrase 1BIS
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Previous Work Previous approaches Pharmacophore based models Docking Disadvantages Requires a reliable receptor structure Computationally intensive Does not necessarily lead to a diverse hit list Nair, V; Chi, G.; Neamati, N.; J. Med. Chem., 2006, 49, 445–447 Dayam, R; Sanchez, T.; Neamati, N.; J. Med. Chem., 2005, 48, 8009–8015 Deng, J. et al.; J. Med. Chem., 2005, 48, 1496–1505
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Goals High speed Be able to process large libraries rapidly Avoid docking till required Try and use connectivity information only Reliability Use a consensus approach for predictions Use molecular similarity to further prioritize predictions Novelty Try to obtain hits suitable for lead hopping Try to obtain a diverse set of hits
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Goals High speed Be able to process large libraries rapidly Avoid docking till required Try and use connectivity information only Reliability Use a consensus approach for predictions Use molecular similarity to further prioritize predictions Novelty Try to obtain hits suitable for lead hopping Try to obtain a diverse set of hits
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Goals High speed Be able to process large libraries rapidly Avoid docking till required Try and use connectivity information only Reliability Use a consensus approach for predictions Use molecular similarity to further prioritize predictions Novelty Try to obtain hits suitable for lead hopping Try to obtain a diverse set of hits
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Datasets & Tools Training data Software Curated dataset MOE for descriptors 529 inactives R for the VS protocol 395 actives MASS Binary valued activity randomForest fingerprint Vendor database 50,000 compounds 2D structures
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Overview of the Screening Protocol Training Descriptor Calculation Dataset and Reduction RF Model LDA Model Vendor DB Keep Predicted Keep Predicted Actives with Actives with Majority Vote > M Posterior > P Hit List Similarity filter Similarity filter Hit List Consensus Hit List
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Descriptor Calculations Restricted ourselves to topological & 2.5D descriptors 147 descriptors were evaluated Correlated and low variance descriptors were removed resulting in a reduced pool of 45 descriptors
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Predictive Models - LDA Why? Simple May be sufficient How? Use a GA to search for good descriptor subsets Used a 6-descriptor model Accuracy? Percent Correct On whole dataset 72 On TSET/PSET 72 / 71 With leave-10%-out 71
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Predictive Models - Random Forest Why? No feature selection required Does not overfit Useful for non-linearities in the dataset How? Number of trees = 500 Number of features sampled = 6 Accuracy Percent Correct On whole dataset 72 On TSET/PSET 75 / 70
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Consensus Predictions Vendor Database Goal Try and be sure that a predicted active is active LDA RF Predicted Active Consider compounds predicted active by Posterior & Majority both models Constrains LDA: consider compounds with posterior > P RF: consider compounds with Similarity Filter majority vote > M Apply similarity filter We now have 2 hit lists A final hit list is obtained from the intersection of the individual hitlists
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Consensus Predictions Vendor Database Goal Try and be sure that a predicted active is active LDA RF Predicted Active Consider compounds predicted active by Posterior & Majority both models Constrains LDA: consider compounds with posterior > P RF: consider compounds with Similarity Filter majority vote > M Apply similarity filter We now have 2 hit lists A final hit list is obtained from the intersection of the individual hitlists
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Similarity Filter Goal Predict actives from vendor database that are more similar to actives than inactives Evaluate 166 bit MACCS fingerprints Evaluate average Tanimoto similarity between Vendor compound and TSET actives (S1 ) Vendor compound and TSET inactives (S2 ) Select compounds where S1 − S2 ≥
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Similarity Filter Goal Predict actives from vendor database that are more similar to actives than inactives Evaluate 166 bit MACCS fingerprints Evaluate average Tanimoto similarity between Vendor compound and TSET actives (S1 ) Vendor compound and TSET inactives (S2 ) Select compounds where S1 − S2 ≥
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Suggesting Good Leads Can we prioritize further? 80 Look for TSET actives that are 60 outliers Rmax S These compounds may be good 40 starting points for further modifications 20 Evaluate similarity of hit list 0 200 400 600 800 compounds to these isolated TSET Compound Serial No. compounds The points marked red are Hit list compounds most similar to the most outlying the most isolated TSET active may compounds in the TSET and be good leads are also active Saeh, J.C.; Lyne, P.D.; Takasaki, B.K.; Cosgrove, D.A.; J. Chem. Inf. Comput. Sci., 2005, 45, 1122-1133 Guha, R.; Dutta, D.;Jurs, P.C.; Chen, T; J. Chem. Inf. Model., submitted
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Results LDA RF Summary of the hits Parameter settings 313 hits 57 hits P, M > 0.7 ≥ 0.01 34 hits Of the 34 hits, 7 had a similarity > 0.5 with the most outlying TSET active Average similarity = 0.64 We obtained 66 more hits Not significantly diverse from an ensemble of LDA models None were in common with those predicted by the pharmacophore models
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? The Search for β-diketones A recently published inhibitor exhibited a β-diketone moiety None of our hits had this moiety R [C,N] R O O Searched the vendor DB for structures with this feature and predicted them LDA model → 244 actives (posterior > 0.8) RF model → 4 actives (score > 0.85) The intersection set contained 4 compounds We missed these hits due to our similarity constraints Nair, V; Chi, G.; Neamati, N.; J. Med. Chem., 2006, 49, 445–447
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? The Search for β-diketones A recently published inhibitor exhibited a β-diketone moiety None of our hits had this moiety R [C,N] R O O Searched the vendor DB for structures with this feature and predicted them LDA model → 244 actives (posterior > 0.8) RF model → 4 actives (score > 0.85) The intersection set contained 4 compounds We missed these hits due to our similarity constraints Nair, V; Chi, G.; Neamati, N.; J. Med. Chem., 2006, 49, 445–447
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Similarity to the Published Compound O O OH OH N O O N Published O O F HN N O HO O O Predicted Nair, V; Chi, G.; Neamati, N.; J. Med. Chem., 2006, 49, 445–447
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Future Work Reliability Consider similarity to known actives in terms of pharmacophores Dock our best hits (may not be conclusive) Improve performance with local methods Diversity Investigate distribution of vendor compounds in the descriptor space Cluster the vendor DB and predict representative members of the clusters Assay!
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Future Work Reliability Consider similarity to known actives in terms of pharmacophores Dock our best hits (may not be conclusive) Improve performance with local methods Diversity Investigate distribution of vendor compounds in the descriptor space Cluster the vendor DB and predict representative members of the clusters Assay!
  • Outlier Detection Using R-NN Curves Previous Work on HIV Integrase Searching for HIV Integrase Inhibitors A Tiered Virtual Screening Protocol Summary What Does the Pipeline Give Us? Future Work Reliability Consider similarity to known actives in terms of pharmacophores Dock our best hits (may not be conclusive) Improve performance with local methods Diversity Investigate distribution of vendor compounds in the descriptor space Cluster the vendor DB and predict representative members of the clusters Assay!
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Outline 1 Outlier Detection Using R-NN Curves Methods for Diversity Analysis Generating & Summarizing R-NN Curves Using R-NN Curves 2 Searching for HIV Integrase Inhibitors Previous Work on HIV Integrase A Tiered Virtual Screening Protocol What Does the Pipeline Give Us? 3 Summary
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Summary The outlier detection scheme provides a way to analyze spatial distributions of molecules in arbitrary descriptor spaces The method can be extended to large datasets using approximate NN algorithms The technique results in a single intuitive plot that can be used to easily identify outliers A consensus approach to predictive models allows us to increase confidence in activity predictions Prioritization based on similarity is intuitive, but may not work well with homogenous datasets The protocol appears to have identified possible leads which await confirmation
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Summary The outlier detection scheme provides a way to analyze spatial distributions of molecules in arbitrary descriptor spaces The method can be extended to large datasets using approximate NN algorithms The technique results in a single intuitive plot that can be used to easily identify outliers A consensus approach to predictive models allows us to increase confidence in activity predictions Prioritization based on similarity is intuitive, but may not work well with homogenous datasets The protocol appears to have identified possible leads which await confirmation
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Acknowledgements Prof. P. C. Jurs Dr. D. Dutta Prof. N. Neamati Dr. R. Dayam
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Why are R-NN Curves Sigmoidal? Let the neighbor density be ρ0 At some critical radius, r0 , the neighbor density changes to ρ1 , ρ1 >> ρ0 After a certain radius, R, the number of NN’s becomes constant, N For a radius r < r0 , consider a small change dr The number of NN’s on the strip is 2πρ0 dr which on 2 integration gives 2πρ0 r0
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Why are R-NN Curves Sigmoidal? Let the neighbor density be ρ0 At some critical radius, r0 , the neighbor density changes to ρ1 , ρ1 >> ρ0 After a certain radius, R, the number of NN’s becomes constant, N For a radius r < r0 , consider a small change dr The number of NN’s on the strip is 2πρ0 dr which on 2 integration gives 2πρ0 r0
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Why are R-NN Curves Sigmoidal? For a radius r0 < r < R, the number of NN’s is given by r 2 2πρ0 r0 + 2πr ρ1 dr r0 The number of nearest neighbors as a function of r is   0 if r ≤0 2πρ0 r 2  if 0 < r ≤ r0  NN(r ) =   2 2 2πρ0 r0 + 2πρ1 (r 2 − r0 ) if r0 < r < R N if r ≥R  From Zhang et al., the above function is an approximation of the logistic function Zhang, M.; Vassiliadis, S.; Delgado-Frias, J.; IEEE Trans. Comput., 1996, 45, 1045–1049
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Why are R-NN Curves Sigmoidal? For a radius r0 < r < R, the number of NN’s is given by r 2 2πρ0 r0 + 2πr ρ1 dr r0 The number of nearest neighbors as a function of r is   0 if r ≤0 2πρ0 r 2  if 0 < r ≤ r0  NN(r ) =   2 2 2πρ0 r0 + 2πρ1 (r 2 − r0 ) if r0 < r < R N if r ≥R  From Zhang et al., the above function is an approximation of the logistic function Zhang, M.; Vassiliadis, S.; Delgado-Frias, J.; IEEE Trans. Comput., 1996, 45, 1045–1049
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Why are R-NN Curves Sigmoidal? For a radius r0 < r < R, the number of NN’s is given by r 2 2πρ0 r0 + 2πr ρ1 dr r0 The number of nearest neighbors as a function of r is   0 if r ≤0 2πρ0 r 2  if 0 < r ≤ r0  NN(r ) =   2 2 2πρ0 r0 + 2πρ1 (r 2 − r0 ) if r0 < r < R N if r ≥R  From Zhang et al., the above function is an approximation of the logistic function Zhang, M.; Vassiliadis, S.; Delgado-Frias, J.; IEEE Trans. Comput., 1996, 45, 1045–1049
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary R-NN Curves & Clustering How Do We Characterize Curves?  RMSE  Generating a mean curve  Examine slopes  Piecewise linear approximation  Compare distance matrices  Shape matching Hausdorff distance Fr´chet distance e
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary R-NN Curves & Clustering 251 84 251 84 300 300 14 12 20 250 250 Slope of the RNN Curve Slope of the RNN Curve 10 200 200 15 8 150 150 10 6 100 100 4 5 2 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 88 137 88 137 300 300 15 250 250 15 Slope of the RNN Curve Slope of the RNN Curve 200 200 10 10 150 150 100 100 5 5 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Smoothed with Friedmans Super Smoother Friedman, J.H.; Laboratory for Computational Statistics, Stanford University, Technical Report 5, 1984
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Genetic Algorithms Features Stochastic optimization procedure Based on evolutionary principles Applicable to large search spaces Not always guaranteed to find the optimal solution Fitness is evaluated by the objective function
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Genetic Algorithms 1 12 14 34 35 Initialization Initially random chromosomes 8 12 23 29 34 Chromosome length is the size of the descriptor subset 1 27 33 42 43 Genes are the descriptors 5 22 28 35 41 Larger pool sizes allow for a wider search 12 17 21 32 48
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Genetic Algorithms Mutation 1 27 33 2 43 Mutation frequency is relatively low Mutations randomly change a single descriptor 1 27 33 15 43 Helps to get out of local minima
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Genetic Algorithms Crossover 1 27 33 42 43 Crossover results in swapping of genes 12 17 21 32 48 Crossover between fit individuals should lead to children with good aspects of both parents In single point crossover 12 17 21 42 43 Choose crosspoint Swap corresponding sections Results in two new children 1 27 33 32 48
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Random Forest Features Built in estimation of prediction accuracy Measure of descriptor importance Measure of similarity Random Forest vs. Decision Tree RF is faster because it searches fewer descriptors at each split A decision tree must be pruned via cross validation. RF trees are grown to full depth RF is much more accurate than a decision tree
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Random Forest Training N molecule Data bootstrap sample Grow tree to maximum size Store tree NO Is the Forest Big Enough? YES STOP
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary OOB Samples What is OOB? Due to bootstrap sampling, each tree uses ≈ 2/3 of the TSET The remaining 1/3 is the Out Of Bag sample The OOB sample can be used to estimate performance during training or measure descriptor importance Descriptor Importance Decision trees create explicit relationships between descriptors and predictions These relationships are hidden in a RF model A randomization procedure on OOB samples can be used to rank descriptors in importance
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Model Complexity Very simple models or very complex models may perform poorly The optimal performance and complexity are obtained from a trade off between bias and variance Mean Squared Error = Var (ˆ ) + Bias 2 (ˆ ), where y y 1 Var ∝ and Bias ∝ complexity complexity
  • Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Model Complexity http://www.ece.wisc.edu/∼nowak/ece901/lecture3.pdf