Prediction Of Bioactivity From Chemical Structure


Published on

Presentation for the Small Molecule Bioactivity Resources At The EBI training course 2010

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Prediction Of Bioactivity From Chemical Structure

  1. 1. Prediction of bioactivity from chemical structure Small Molecule Bioactivity Resources At The EBI Jérémy Besnard [email_address]
  2. 2. Myself <ul><li>PhD student at the university of Dundee </li></ul><ul><ul><li>Supervisor: Pr. Andrew Hopkins </li></ul></ul><ul><ul><li>Lab: medicinal informatics </li></ul></ul><ul><li>Background </li></ul><ul><ul><li>Chemistry degree with some biology </li></ul></ul><ul><ul><li>One industrial year at Pfizer on computational chemistry </li></ul></ul>
  3. 3. Prediction of bioactivity <ul><li>Type of predictions </li></ul><ul><ul><li>How active is a compound? </li></ul></ul><ul><ul><ul><li>Continuous model </li></ul></ul></ul><ul><ul><li>Is the compound active, or not? </li></ul></ul><ul><ul><ul><li>Categorical model </li></ul></ul></ul>QSAR – Quantitative Structure-Activity Relationship Some slides are adapted from Richard Lewis (Novartis) presentation at the University of Sheffield Practical introduction to Chemoinformatics course (next in 2011)
  4. 4. Example <ul><li>Molecular Weight = 360 </li></ul><ul><li>Activity? </li></ul><ul><li>Linear regression: </li></ul><ul><li>Activity = 0.01 Molecular weight + 1.7 (R 2 = 0.900) </li></ul><ul><li>Activity = 5.3 </li></ul><ul><li>Active? </li></ul><ul><li>Category: </li></ul><ul><li>Molecular weight > 260 = active </li></ul><ul><li>Active : Yes </li></ul>Molecular Weight 180 220 250 290 340 380 450 500 Activity (pIC50) 4 4.3 4.8 5.4 4.8 5.8 7.5 7.7
  5. 5. QSAR Activity = IC50, Ki, Ratios… Molecular Descriptors Topological (shape, size) Physical & Thermodynamics Chemical feature (substructure) Activity = f (Molecular Descriptors) Statistics
  6. 6. The absolute basics <ul><li>Activity + Representation + Method = QSAR </li></ul><ul><li>Activity = experimental data </li></ul><ul><li>Representation = description of the molecule </li></ul><ul><li>Method = Statistical tool to use </li></ul><ul><ul><li>Underlying principle: similar molecules should have similar activities </li></ul></ul>
  7. 7. Advantages of Models <ul><li>Fast and cheap method </li></ul><ul><ul><li>Virtual screening: the computer does the manipulation </li></ul></ul><ul><ul><ul><li>Human: day – week </li></ul></ul></ul><ul><ul><ul><li>Computer : seconds - hours </li></ul></ul></ul><ul><li>Help understand the science behind the observation </li></ul><ul><ul><li>Tool to design compounds with higher chance of being active </li></ul></ul>
  8. 8. Activity <ul><li>It can be anything </li></ul><ul><ul><li>Continuous: IC50, %Inhibition, EC50, ratios,… </li></ul></ul><ul><ul><li>Categorical: Yes/No, Low/Medium/High </li></ul></ul><ul><li>Better if </li></ul><ul><ul><li>Data come from the same assay/condition </li></ul></ul><ul><ul><li>Good quality (you trust the experimental data) </li></ul></ul><ul><li>For ADME endpoints </li></ul><ul><ul><li>Lots of software solutions: not easy to predict! </li></ul></ul><ul><ul><ul><li>Few experimental data points (and not very reliable) </li></ul></ul></ul><ul><ul><ul><li>In vivo phenomena </li></ul></ul></ul>
  9. 9. Molecular descriptors <ul><li>Many Many Many </li></ul><ul><li>Simple counts </li></ul><ul><ul><li>Number of atoms, rings, hydrogen bond donors, acceptors, molecular weight… </li></ul></ul><ul><li>Physicochemical </li></ul><ul><ul><li>Hydrophobicity, polarity: cLogP, Polar Surface Area (PSA) </li></ul></ul><ul><li>Shape – Topological indices </li></ul><ul><ul><li>Big, small, long, round </li></ul></ul><ul><li>2D fingerprints </li></ul><ul><ul><li>Presence or absence of certain substructures </li></ul></ul><ul><ul><ul><li>From a dictionary (MACCS eg count of acids) </li></ul></ul></ul><ul><ul><ul><li>On the fly: look at the substructures present in the data </li></ul></ul></ul><ul><li>3D: fingerprints, electrostatics, shape </li></ul>
  10. 10. Fingerprint <ul><li>Binary vector: list of 0 and 1 </li></ul><ul><li>Dictionary: fixed size with each bit = one group (defined in advance) </li></ul><ul><li>Hashed: fragment the molecules and insert the fragment in a bit position of the vector </li></ul>Acid Cl Amide 6 aromatic ring …
  11. 11. Focus on Fingerprint <ul><li>Example of FCFP and ECFP </li></ul><ul><ul><li>Internal to Accelrys software (Pipeline Pilot, will be used for the practicals). </li></ul></ul><ul><ul><li>Circular substructure </li></ul></ul><ul><ul><li>No dictionary </li></ul></ul><ul><ul><li>FCFP = Functional Circular fingerprint </li></ul></ul><ul><ul><ul><li>Atom abstraction: function only (acid, aromatic…) </li></ul></ul></ul><ul><ul><li>ECFP = Extended Circular fingerprint </li></ul></ul><ul><ul><ul><li>Atom and atom properties (C aromatic, aliphatic) </li></ul></ul></ul>
  12. 12. FCFP: Initial Atom Codes
  13. 13. Extending the Initial Atom Codes <ul><li>Fingerprint bits indicate presence and absence of certain structural features </li></ul><ul><li>Fingerprints do not depend on a predefined set of substructural features </li></ul>O N A A A A O N A A A A A Each iteration adds bits that represent larger and larger structures Iteration 0 Iteration 1 Iteration 2
  14. 14. Generating the Fingerprint <ul><li>Iteration is repeated desired number of times </li></ul><ul><ul><li>Each iteration extends the diameter by two bonds </li></ul></ul><ul><li>Codes from all iterations are collected </li></ul><ul><li>Duplicate bits may be removed </li></ul>
  15. 15. Data Sets
  16. 16. Validity of a model <ul><li>It is easy to introduce artefacts and “false correlation” </li></ul>The Trouble with QSAR (or How I Learned To Stop Worrying and Embrace Fallacy), Johnson, J. Chem. Inf. Model., 2008, 48 (1), pp 25–26
  17. 17. Training and Test Sets <ul><li>Build the model from training set </li></ul><ul><li>Predict the test set </li></ul><ul><li>Also called Leave-N-Out validation where N=1 compound to 50% of the dataset. </li></ul><ul><li>Cross validation: repeat the steps using complementary training and test set N times. </li></ul>http://
  18. 18. Space of the sets <ul><li>The training set should cover the representation space evenly </li></ul>
  19. 19. Training vs Test Sets <ul><li>The test set should be not too dissimilar to the training set </li></ul><ul><ul><li>Too similar = over estimated the good quality </li></ul></ul><ul><ul><li>Too dissimilar = difficult prediction </li></ul></ul>Test Set Test Set Test Set
  20. 20. Questions?
  21. 21. Statistical Methods Activity Molecular Descriptors Training and test sets Activity = f (Molecular Descriptors) Statistics
  22. 22. Categorical <ul><li>The focus is on a specific criterion: </li></ul><ul><ul><li>Is the activity < 10uM? (like in HTS assay) </li></ul></ul><ul><li>The data is not continuous </li></ul><ul><ul><li>Soluble/Insoluble </li></ul></ul><ul><li>Try to find a rule (or set of rules) to split the data in classes with the lowest rate of misclassification </li></ul><ul><ul><li>Different coefficients to measure the quality </li></ul></ul><ul><ul><li>(ref: Assessing the accuracy of prediction algorithms for classification: an overview . Baldi et al. Bioinformatics 2000, 16:412-424) </li></ul></ul>
  23. 23. Recursive Partitioning <ul><li>Using decision trees </li></ul><ul><li>Rules are organized like a tree, each node = one rule </li></ul><ul><ul><li>Cut-off : Molecular weight <450 </li></ul></ul><ul><ul><li>Absence/presence of a group: Acid group </li></ul></ul><ul><li>Usually easy to interpret </li></ul><ul><li>Drawback: overfitting and model to specific to the training data </li></ul>
  24. 24. Molecular Weight >450 ≤ 450 Polar surface area >100 0 , 10 ≤ 100 2 , 0 cLogP Acid Group 0 , 7 >4.2 ≤ 4.2 18 , 2 1 , 5 Yes No 21 Actives , 24 Inactives 2 , 10 19 , 14 19 , 7 MW: 178 PSA: 37 LogP: 3 MW: 205 PSA: 20 LogP: 3
  25. 25. Substructural Analysis <ul><li>Idea: each fragment of the molecule makes a contribution to the activity , independent of the other fragments in the molecule. </li></ul><ul><li>Fragments get a score for their activity and a molecule has the score of the sum of the fragments. </li></ul><ul><li>A simple fragment scoring function: </li></ul>Act i = Nb of active compounds containing fragment i Inact i = Nb of inactive compounds containing fragment i
  26. 26. Naïve Bayesian Classifiers <ul><li>Related to the substructural analysis (slight differences in the weight sum calculation ref ) </li></ul><ul><li>Use with fingerprints </li></ul><ul><ul><li>Each substructure (bit in the fingerprint) gets a weight </li></ul></ul><ul><ul><li>Fingerprint can be mixed with other properties </li></ul></ul><ul><ul><ul><li>Properties are binned and each bin obtains a weight </li></ul></ul></ul><ul><li>Molecules are scored, the higher the score the higher the chance to be in a specific category </li></ul><ul><li>Native implementation in Pipeline Pilot (practical) </li></ul>Ref: New Methods for Ligand-Based Virtual Screening: Use of Data Fusion and Machine Learning to Enhance the Effectiveness of Similarity Searching , Hert et al., J. Chem. Inf. Model., 2006, 46 (2), pp 462–470
  27. 27. Validation <ul><li>Collection of coefficients </li></ul><ul><li>Most common ones </li></ul><ul><ul><li>Specificity and sensitivity – ROC curve </li></ul></ul><ul><ul><li>Enrichment plot </li></ul></ul>
  28. 28. Specificity & Sensitivity <ul><li>Specificity </li></ul><ul><li>Example: </li></ul><ul><li>If all compounds are predicted inactives: </li></ul><ul><li>Specificity = 1 (very good) </li></ul><ul><li>Sensitivity = 0 (very bad) </li></ul><ul><li>If all compounds are predicted actives: </li></ul><ul><li>Specificity = 0 (very bad) </li></ul><ul><li>Sensitivity =1 (very good) </li></ul>TP=True Positive TN=True Negative FP=False Positive FN=False Negative <ul><li>Sensitivity </li></ul>http://
  29. 29. ROC curve <ul><li>Plot sensitivity versus 1-specificity </li></ul>Coefficient = Area Under Curve 1 is ideal, 0.5 is random http://
  30. 30. Enrichment curve <ul><li>On some study the rank of compounds is not that important: idea is to select X percent of the data </li></ul><ul><li>Use the model to select the Top X compounds: try to have most of the active molecules inside </li></ul>There 40% of the active in the top 10%. This plot doesn’t tell how many compounds this represents (could be 40 actives and 10,000 inactive in the top 10%)
  31. 31. Other methods <ul><li>There are other statistical methods. </li></ul><ul><li>There is no perfect method and it is project dependent (also “personal” choice) </li></ul><ul><li>Most common: </li></ul><ul><ul><li>Forest of trees </li></ul></ul><ul><ul><li>Support Vector Machine </li></ul></ul><ul><ul><li>Neural Networks </li></ul></ul>
  32. 32. Questions?
  33. 33. Regression <ul><li>Provide a value with more information than yes or no </li></ul><ul><li>Usually smaller set than classification </li></ul><ul><li>Link activity to the structure by an equation (simple to complicated) </li></ul>
  34. 34. Historical <ul><li>First equation: Hansch in 1964 </li></ul><ul><li>Link activity to molecule’s electronic characteristics and to its hydrophobicity </li></ul><ul><li>C is the concentration required to produce a response </li></ul><ul><li>LogP the octanol/water partition coefficient (possibility to cross membrane) </li></ul><ul><li>σ the Hammett substitution parameter ( strength of the electron-withdrawing or -donating properties of the aromatic substituent ) </li></ul><ul><li>It is a linear equation </li></ul><ul><li>Then improved with a parabolic function </li></ul>p-σ-π Analysis. A Method for the Correlation of Biological Activity and Chemical Structure , Hansch et al., J. Am. Chem. Soc., 1964, 86 (8), pp 1616–1626 Parabolic dependence of drug action upon lipophilic character as revealed by a study of hypnotics , Hansch et al., J. Med. Chem., 1968, 11 (1), pp 1–11
  35. 35. Deriving a QSAR equation <ul><li>Most common method is the linear regression </li></ul><ul><li>In QSAR x is usually a descriptor (eg logP) </li></ul><ul><li>Aim: reduce the sum of the differences between the predicted and the real values </li></ul><ul><li>With more than one descriptors: </li></ul>
  36. 36. Quality <ul><li>Most common way is to use the square of the correlation coefficient, R 2 </li></ul><ul><li>Need to review the data: </li></ul>Almost the same R 2
  37. 37. Cross validation <ul><li>Involves the removal of some of the values from the dataset, build a QSAR from the remaining data, and apply this model on the previous removed data. </li></ul><ul><li>The R 2 of cross validation is written Q 2 , it represents the goodness of the prediction (R 2 goodness of fit). </li></ul><ul><li>Q 2 should be lower than R 2 but not too much (otherwise the model was over-fit). </li></ul>
  38. 38. Designing QSAR experiment <ul><li>Find the smallest number of variables to explain as much data as possible </li></ul><ul><ul><li>It is easy to calculate thousands of parameters with a computer in seconds. </li></ul></ul><ul><li>Rule of thumb </li></ul><ul><ul><li>>5 compounds for each descriptor </li></ul></ul><ul><ul><li>Check the descriptor: remove the invariant ones </li></ul></ul><ul><ul><li>Remove correlated factors (by deleting a descriptor, or using data reduction technique – PCA) </li></ul></ul><ul><li>Selection </li></ul><ul><ul><li>Algorithms to select most significant descriptors </li></ul></ul><ul><ul><ul><li>Forward stepping regression: start from 1 and add </li></ul></ul></ul><ul><ul><ul><li>Backward-stepping regression: start with all and remove </li></ul></ul></ul>
  39. 39. Regression algorithms <ul><li>Multiple linear regression (see practical) </li></ul><ul><ul><li>Easy to interpret </li></ul></ul><ul><ul><li>Problem of correlations between factors </li></ul></ul><ul><li>Partial Least Squares (PLS) </li></ul><ul><ul><li>Similar to PCA by reducing the number of factors (x i ) in new orthogonal “latent variables” (t i ) </li></ul></ul><ul><ul><li>Compare to PCA, add a correlation between observed data and the latent variables (y~a 1 t 1 ) </li></ul></ul>
  40. 40. Not limited <ul><li>Regressions algorithms are multiple </li></ul><ul><ul><li>Implementation </li></ul></ul><ul><ul><li>Selection of factors </li></ul></ul><ul><ul><li>Best way to consider a good model </li></ul></ul><ul><li>Other methods </li></ul><ul><ul><li>Gaussian Processes ( ) </li></ul></ul><ul><ul><li>Molecular Field Analysis and Partial Least Square: CoMFA and derivative, using 3D steric and electrostatic information ( and ) </li></ul></ul>
  41. 41. Regression + Category <ul><li>Poor regression but good classification </li></ul>
  42. 42. After <ul><li>Once models are built and have ideas of the mathematical quality: </li></ul><ul><ul><li>Look at the observed vs predicted plot </li></ul></ul><ul><ul><li>Try to understand the model </li></ul></ul><ul><ul><ul><li>Do the descriptors make sense? </li></ul></ul></ul><ul><ul><ul><ul><li>LogP important when modelling solubility </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Why is a certain substructure so important? </li></ul></ul></ul></ul>
  43. 43. Outliers <ul><li>What to do with outliers? </li></ul><ul><li>Prediction far from observed: </li></ul><ul><ul><li>Are the compounds similar to the training set? </li></ul></ul><ul><ul><li>Outside your space of confidence </li></ul></ul><ul><li>Chemical similarity = Activity similarity not always true </li></ul><ul><ul><li>There are activity cliffs ref </li></ul></ul><ul><ul><li>Interesting for SAR study </li></ul></ul>On Outliers and Activity Cliffs−Why QSAR Often Disappoints , Maggiora, J. Chem. Inf. Model.2006, 46, 1535−1535 Structure−Activity Relationship Anatomy by Network-like Similarity Graphs and Local Structure−Activity Relationship Indices , Wawer et al., J. Med. Chem., 2008, 51 (19), pp 6075–6084
  44. 44. A model is a model <ul><li>It is not the reality </li></ul><ul><li>Provides help for experimentations </li></ul><ul><ul><li>Understand what happens </li></ul></ul><ul><ul><li>Reduce the number of experiments </li></ul></ul><ul><ul><li>Do not replace lab work </li></ul></ul><ul><li>There is no one perfect model </li></ul><ul><ul><li>Depending on the method, data sets, descriptors, tuning parameters… </li></ul></ul>
  45. 45. Real correlation? <ul><li>The decrease of marriage decreases the risk of death? Should we ban Church of England Weddings? </li></ul>Why do we Sometimes get Nonsense-Correlations between Time-Series?--A Study in Sampling and the Nature of Time-Series , Yule, Journal of the Royal Statistical Society , Vol. 89, No. 1. (Jan., 1926), pp. 1-63
  46. 46. Further – Multiple Targets <ul><li>Large scale model: </li></ul><ul><ul><li>Prediction of multiple interactions at once </li></ul></ul><ul><ul><li>Need large database </li></ul></ul><ul><ul><ul><li>Wombat (literature), MDDR (patent) </li></ul></ul></ul><ul><ul><ul><li>ChemBl </li></ul></ul></ul><ul><li>Identify side effect, or unknown beneficial effect </li></ul>
  47. 47. Principle <ul><li>SEA approach: </li></ul><ul><ul><li>Similarity of a compound to active ligands (similar to Blast) </li></ul></ul><ul><ul><li>website: http:// / </li></ul></ul><ul><li>Multiple category Bayesian model: </li></ul><ul><ul><li>Each fingerprint gets a different weight for each target: the sum is different by target </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li>List of protein ranked by probability of binding </li></ul></ul>
  48. 48. References <ul><li>An introduction to Chemoinformatics , A. Leach and V.Gillet </li></ul><ul><li>Sheffield course: next one in 2011: , Conference: </li></ul><ul><li>Pipeline Pilot documentation and Cheminformatics analysis and learning in a data pipelining environment , Hassan et al., Molecular Diversity (2006) 10: 283–299, </li></ul><ul><li>Multiple targets: </li></ul><ul><li>Predicting new molecular targets for known drugs , Keiser et al., Nature 462, 175-181 (12 November 2009) and Relating protein pharmacology by ligand chemistry , Keiser et al., Nat Biotech 25 (2), 197-206 (2007) </li></ul><ul><li>Prediction of Biological Targets for Compounds Using Multiple-Category Bayesian Models Trained on Chemogenomics Databases , Nidhi et al., J. Chem. Inf. Model., 2006, 46 (3), pp 1124–1133 </li></ul><ul><li>Global mapping of pharmacological space , Paolini et al., Nat Biotech 25 (7), 805-815 (2006) </li></ul>
  49. 49. Questions
  50. 50. Practicals Using Pipeline Pilot Regression and Classification