Qsar and drug design ppt


Published on

The ppt describes Molecular descriptors and MAchine learning terms how its useful in chem informatics

Published in: Education, Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • [1]Stenberg , P. , Luthman , K. , Ellens , H. , Lee , C. - P. , Smith , P. L. , Lago , A. , Elliott , J. D. , Artursson , P. Prediction of the intestinal absorption of endothelin receptor antagonists using three theoretical methods of increasing complexity . Pharm. Res . 1999 , 16 , 1520 – 1526. [2] Winiwarter , S. , Bonham , N. M. , Ax , F. , Hallberg , A. , Lennern ä s , H. , Karlen , A. Correlation of human jejunal permeability ( in vivo ) of drugs with experimentally and theoretically derived parameters. A multivariate data analysis approach . J. Med. Chem . 1998 , 41 , 4939 – 4949 [3] Egan , W. J. , Merz , K. M. , Baldwin , J. J. Prediction of drug absorption using multivariate statistics . J. Med. Chem . 2000 , 43 , 3867 – 3877
  • Lipophilicity may be modelled using simple physical chemical models. The partition coefficient P is a measure of lipophilicity and is usually experimentally determined by equilibrating a sample of the compound in an octanol/aqueous buffer mixture. The resulting emulsion is then separated. Once separated, the concentration of the drug in each layer is measured and the partition coefficient is then calculated.
  • Don’t worry about exactly how this works. Calculations are almost always carried out using a computer program, and sometimes extra corrections are calculated and added to make the prediction specially good for a specific series of compounds.
  • As lipophilicity changes, so do many properties in addition to the strength of binding to the receptor. Some of these changes are desirable, others are not.
  • In the simplest case the question of molecular similarity is raised for two molecules. They are regarded as similar entities if either chemical/topological, pharmacological or biological properties match. The two structures on top are chemically similar to each other. This is reflected in their common sub-graph, or scaffold: they share 14 atoms. The two other structures at the bottom are less similar chemically (topologically) yet have the same pharmacological activity, namely they both are Angiotensin-Converting Enzyme (ACE) inhibitors.
  • Assessing the similarity or dissimilarity of two compounds is typically not done on structural level (though it is possible, for instance by the size of maximum common sub-graph), instead, structures are encoded into (or represented by) a set of values that are numerically tractable. There is a wide variety of such sets of values available: molecular descriptor, molecular fingerprints, structural keys are all well-known approaches. These can be represented by a multidimensional vector, and the similarity between the original structures can be expressed as the distance between these two vectors. The Euclidean distance is the most widely used example of such type of dissimilarity functions. Another family of such proximities regards the set of values as a series and calculates the ratio between matching and different values in the same position in the two series. A remarkable example is the Tanimoto coefficient.
  • Topological chemical fingerprints encode structural properties of the chemical graph as a sequence of bits. The encoding (hashing) is not reversible, thus two different structures can have the same fingerprint (therefore the name is not so adequate). If a certain feature is present in the structure, for instance a C-O-H pattern, then specific bits in the series are set invariably, however, these very same bits can be set by many other structural features. These properties of the hashed fingerprint make it suitable for structural comparison, substructure search and similarity search. There are various encoding schemas available, one takes all walks of a given maximum length in the chemical graph as patterns to be encoded. For each such structural pattern certain bits are ‘turned on’ in the fingerprint. Once a bit is turned on by a certain feature, it remains 1, other features cannot cancel it out.
  • This example illustrates how a 10 bits long topological chemical fingerprint is created for a simple chain structure. In this example all walks up to 3 steps are considered, and 2 bits are set for each pattern.
  • Two β2 adrenoceptor agonist molecules, and their 64 bit hashed chemical fingerprints. These fingerprints clearly reflect the high structural similarity between the two compounds: there are only two bits that differ.
  • Bits in the hashed binary fingerprint cannot be interpreted, there is no way to infer properties of the original structure from its fingerprint. There are other ways to construct fingerprints other than hashing. An example is pharmacophore fingerprints where each pharmacophore point pair is associated with a histogram bar. Such a fingerprint is not binary, yet it allows fast comparison of the pharmacophore of chemical structures. The construction of such fingerprints is fairly straightforward. First, each atom is labeled with its pharmacophore type (e.g. hydrogen bond donor, hydrophobic, anionic etc). Then shortest paths between each point pairs are calculated. Then histograms are assigned to each pharmacophore type pairs (e.g. acceptor-acceptor, acceptor-donor, acceptor-hydrophobic etc). Each histogram has the same predefined number of bins, these belong to different topological distances considered (e.g. 1, 2, etc, up to 10). Bins store the number of the associated pharmacophore type pairs lying at the given topological distance.
  • The pharmacophores of these two structures are the same (these are both ACE inhibitors). A topological cross-correlation pharmacophore fingerprint is constructed from the structures and the mapped pharmacophoric point types. (In this example only three different pharmacophore point types were considered: acceptor, donor and hydrophobic.) Each point type pair is counted in a corresponding histogram bin depending on their topological distance. (Topological distances from 1 to 6 bonds apart were considered.) The two histograms visibly represent the pharmacophoric similarity of the two compounds, especially specific (or hard) pharmacophore points (hydrogen bond acceptor and donor) related histograms (AA, DA) show significant similarity. Bars in the histograms of the second structure are higher due to the larger size of the corresponding molecular graph.
  • Equipped with fingerprints and dissimilarity metrics virtual screening is made easy. Structures, both query and those in the target library are transformed into fingerprints. Fingerprints are compared against each other using a dissimilarity metric. If the dissimilarity value obtained by the calculation is below a predefined threshold the corresponding structure is a database hit (or virtual hit).
  • Qsar and drug design ppt

    1. 1. Molecular Descriptors and Virtual Screening using Datamining approach Abhik Seal OSDD Cheminformatics
    2. 2. Aim of Cheminformatics Project <ul><li>To screen molecules interacting with the Potential TB targets using classifiers. </li></ul><ul><li>Select the selected molecules and dock with Targets to further screen the molecules for leads. </li></ul><ul><li>Use cheminformatics techniques such as QSAR ,3D qsar, ADMET to look for potential leads and design Drugs using the leads – by building combinatorial libraries. </li></ul>
    3. 3. Tuberculosis <ul><li>Obstacles For Drug Design </li></ul><ul><li>HIV-epidemic that has dramatically increased risk for developing active TB. </li></ul><ul><li>increasing emergence of multi-drug resistant TB (MDR-TB) </li></ul><ul><li>emergence of extensively drug-resistant (XDR) TB strains </li></ul><ul><li>XDR-TB is characterized by resistance to at least the two first-line drugs rifampicin and isoniazid and additionally to a fluoroquinolone and an injectable drug- kanamycin </li></ul><ul><li>Existing TB drugs are therefore only able to target actively growing bacteria through the inhibition of cell processes such as cell wall biogenesis and DNA replication. </li></ul><ul><li>TB chemotherapy characterized by an efficient bactericidal activity but an extremely weak sterilizing activity i.e inability to kill slowly growing and slowly metabolizing strains. </li></ul>
    4. 4. Drugs Currently in Development Expected timelines towards approval of candidate drugs currently in clinical stage of development (Sources: Global TB Alliance Annual report 2004-2005;StopTBPartnership Working Group on New Drugs for TB. Strategic Plan 2006-2015)
    5. 5. Commonly Used TB drugs and Targets
    6. 6. Main Properties of Anti TB drugs
    7. 7. QSAR and Drug Design Compounds + biological activity New compounds with improved biological activity QSAR
    8. 8. What is QSAR? <ul><li>QSAR is a mathematical relationship between a biological activity of a molecular system and its geometric and chemical characteristics. </li></ul><ul><li>A general formula for a quantitative structure-activity relationship </li></ul><ul><li>(QSAR) can be given by the following: </li></ul><ul><li>activity = f (molecular or fragmental properties) </li></ul><ul><li>QSAR attempts to find consistent relationship between biological activity and molecular properties , so that these “rules” can be used to evaluate the activity of new compounds. </li></ul>
    9. 9. Molecule Properties <ul><li>SPC : Structure Property Correlation </li></ul>INTRINSIC PROPERTIES Molar Volume Connectivity Indices Charge Distribution Molecular Weight Polar surface Area.... ....... MOLECULE STRUCTURE CHEMICAL PROPERTIES pKa Log P Solubility Stability BIOLOGICAL PROPERTIES Activity Toxicity Biotransformation Pharmacokinetics
    10. 10. Molecule Descriptors <ul><li>Molecular descriptors are numerical values that </li></ul><ul><li>characterize properties of molecules. </li></ul><ul><li>The descriptors fall into Four classes . </li></ul><ul><li>a) Topological </li></ul><ul><li>b) Geometrical </li></ul><ul><li>c) Electronic </li></ul><ul><li>d) Hybrid or 3D Descriptors </li></ul>
    11. 11. Classification of Descriptors <ul><li>Topological Descriptors </li></ul><ul><li>Topological descriptors are derived directly from the connection table representation of the structure which include: </li></ul><ul><li>a) Atom and Bond Counts </li></ul><ul><li>b) substructure counts </li></ul><ul><li>c) molecular connectivity Indices (Weiner Index , Randic Index, Chi Index) </li></ul><ul><li>d) Kappa Indices </li></ul><ul><li>e) path descriptors </li></ul><ul><li>f) distance-sum Connectivity </li></ul><ul><li>g) Molecular Symmetry </li></ul>
    12. 12. Geometrical Descriptors <ul><li>Geometrical descriptors are derived from the three-dimensional representations and include: </li></ul><ul><li>a) principal moments of inertia, </li></ul><ul><li>b) molecular volume, </li></ul><ul><li>c)solvent-accessible surface area, </li></ul><ul><li>d) Charged partial Surface area </li></ul><ul><li>e) Molecular Surface area </li></ul>
    13. 13. Electronic Descriptors <ul><li>Electronic descriptors characterize the molecular Strcutures with such </li></ul><ul><li>quantities : </li></ul><ul><li>dipole moment, </li></ul><ul><li>Quadrupole moment, </li></ul><ul><li>polarizibility, </li></ul><ul><li>HOMO and LUMO energies, </li></ul><ul><li>Dielectric energy </li></ul><ul><li>Molar Refractivity </li></ul>
    14. 14. Hybrid and 3D Descriptors <ul><li>geometric atom pairs and topological torsions </li></ul><ul><li>spatial autocorrelation vectors </li></ul><ul><li>WHIM indices </li></ul><ul><li>BCUTs </li></ul><ul><li>GETAWAY descriptors </li></ul><ul><li>Topomers </li></ul><ul><li>pharmacophore fingerprints </li></ul><ul><li>Eva Descriptors </li></ul><ul><li>Descriptors of Molecular Field </li></ul>
    15. 15. Limit Of Descriptors <ul><li>The data set should contain at least 5 times as </li></ul><ul><li>many compounds as descriptor in the QSAR. </li></ul><ul><li>The reason for this is that too few compounds </li></ul><ul><li>relative to the number of descriptors will give a </li></ul><ul><li>falsely high correlation: </li></ul><ul><li>2 point exactly determine a line. </li></ul><ul><li>3 points exactly determine a plane (etc.) </li></ul><ul><li>A data set of drug candidate that is similar in </li></ul><ul><li>size meaningless correlation </li></ul>
    16. 16. Tools To calculate Molecular Descriptors Freely available <ul><li>CDK tool </li></ul><ul><li>http://rguha.net/code/java/cdkdesc.html </li></ul><ul><li>POWER MV </li></ul><ul><li>http://nisla05.niss.org/PowerMV/?q=PowerMV/ </li></ul><ul><li>MOLD2 http://www.fda.gov/ScienceResearch/BioinformaticsTools/Mold2/default.htm </li></ul><ul><li>PADEL Descriptor </li></ul><ul><li>http://www.downv.com/Windows/install-PaDEL- Descriptor-10439915.htm </li></ul>
    17. 17. Admet Descriptors to Screen Molecules
    18. 18. Bioavailability <ul><li>The Bioavailability of a compound is classified as : </li></ul>Bioavailability Absorbtion Liver Metabolism Permeability Gut-wal l Metabolism Transporters Lipophilicity Solubility Flexibility Hydrog en Bonding Molecular Size/Shape
    19. 19. PREDICTION OF ADMET PROPERTIES <ul><li>Requirements for a drug: </li></ul><ul><ul><li>Must bind tightly to the biological target in vivo </li></ul></ul><ul><ul><li>Must pass through one or more physiological barriers (cell membrane or blood-brain barrier) </li></ul></ul><ul><ul><li>Must remain long enough to take effect </li></ul></ul><ul><ul><li>Must be removed from the body by metabolism, excretion, or other means </li></ul></ul><ul><li>ADMET: Absorption, Distribution, metabolism, Excretion (Elimination), Toxicity </li></ul>
    20. 20. Lipinski Rule of Five(Oral Drug Properties) <ul><li>Poor absorption or permeation is more likely when: </li></ul><ul><ul><li>MW > 500 </li></ul></ul><ul><ul><li>Log P >5 </li></ul></ul><ul><ul><li>More than 5 H-bond donors (sum of OH and NH groups) </li></ul></ul><ul><ul><li>More than 10 H-bond acceptors (sum of N and O atoms) </li></ul></ul>
    21. 21. Polar Surface Area <ul><li>Defined as amount of molecular surface(vander-walls) arising from polar atoms(Nitrogen and oxygen atom together with attached hydrogens) </li></ul><ul><li>PSA seems to optimally encode those drug properties which play an important role in membrane penetration: molecular polarity, H - bonding features and also solubility. </li></ul><ul><li>It provide excellent correlations with transport properties of drugs .( PSA u sed in the Prediction of Oral absorbtion,Brain penetration, Intestinal Absorption, Caco-2- permeability) </li></ul><ul><li>It has also been effectively used to characterize drug likeness during virtual screening & combinatorial library design. </li></ul><ul><li>The calculation of PSA, however, is rather time- consuming because of the necessity to generate a reasonable 3D molecular geometry and the calculation of the surface itself. </li></ul><ul><li>Peter Ertl introduced an extremely rapid method to obtain PSA descriptor simply from the sum of contributions of polar fragments in a molecule without the necessity to generate its three - dimensional (3D) geometry. </li></ul>
    22. 22. PSA In Intestinal absorption <ul><li>Intestinal absorption is usually expressed as fraction absorbed (FA), expressing the percentage of initial dose appearing in a portal vein. </li></ul><ul><li>A model for PSA was done for the β - adrenoreceptor antagonists[1].A excellent sigmoidal relationship between PSA and FA after oral administration was obtained. Similar sigmoidal relationships can also be obtained for the topological PSA (TPSA). </li></ul><ul><li>These results suggest that drugs with a PSA < 60 Å 2 are completely ( more than 90%) absorbed, whereas drugs with a PSA > 40 Å are absorbed to less than 10%. This conclusion was later confirmed with the correct classification of a set endothelin receptor antagonists as having either low, intermediate or high permeability. </li></ul><ul><li>PSA was also shown to play an important role in explaining human in vivo jejunum permeability[2]. A Model based on PSA and LogP for the prediction of drug absorption was developed for 199 well absorbed and 35 poorly absorbed compounds[3]. </li></ul>
    23. 23. PSA In Blood brain barrier penetration(BBB) <ul><li>Drugs that act on the CNS need to be able to cross the BBB in order to reach their target, while minimal BBB penetration is required for other drugs to prevent CNS side effects. </li></ul><ul><li>A common measure of BBB penetration is the ratio of drug conc’s in the brain and the blood, which is expressed as log (C brain /Cblood ). </li></ul><ul><li>Van de Waterbeemd and Kansy were probably the first to correlate the PSA of a series of CNS drugs to their membrane transport. They obtained a fair correlation of brain uptake with single conformer PSA and molecular volume descriptors. </li></ul><ul><li>Clark etal. Derived a model of 55 compounds using TPSA and LogP </li></ul><ul><li>LogBB= 0.516-0.115* TPSA </li></ul><ul><li>n= 55 r 2 =0.686 r= 0.828 σ = 0.42 </li></ul><ul><li>TPSA in combiantion with ClogP </li></ul><ul><li>LogBB= 0.070-0.014*TPSA+0.169*ClogP </li></ul><ul><li>n=55 r 2 =0.787 r=0.887 σ =0.35 </li></ul><ul><li>Great majority of orally administered CNS drugs have a PSA <70 Å 2 . Non CNS compounds suggested that these have a PSA < 120Å 2 . </li></ul><ul><li>Thus to conclude a majority of the Non CNS penetrating and orally absorbed compounds have PSA values between 70 and 120 A 2 . </li></ul><ul><li>. </li></ul>
    24. 24. 1-Octanol is the most frequently used lipid phase in pharmaceutical research. This is because: <ul><li>It has a polar and non polar region (like a membrane phospholipid) </li></ul><ul><li>P o/w is fairly easy to measure </li></ul><ul><li>P o/w often correlates well with many biological properties </li></ul><ul><li>It can be predicted fairly accurately using computational models </li></ul>X aqueous X octanol P Partition coefficient P (usually expressed as log 10 P or logP ) is defined as: P = [X] octanol [X] aqueous P is a measure of the relative affinity of a molecule for the lipid and aqueous phases in the absence of ionisation. Partition coefficients
    25. 25. LogP for a molecule can be calculated from a sum of fragmental or atom-based terms plus various corrections. logP =  fragments +  corrections Calculation of logP C: 3.16 M: 3.16 PHENYLBUTAZONE Class | Type | Log(P) Contribution Description Value FRAGMENT | # 1 | 3,5-pyrazolidinedione -3.240 ISOLATING |CARBON| 5 Aliphatic isolating carbon(s) 0.975 ISOLATING |CARBON| 12 Aromatic isolating carbon(s) 1.560 EXFRAGMENT|BRANCH| 1 chain and 0 cluster branch(es) -0.130 EXFRAGMENT|HYDROG| 20 H(s) on isolating carbons 4.540 EXFRAGMENT|BONDS | 3 chain and 2 alicyclic (net) -0.540 RESULT | 2.11 |All fragments measured clogP 3.165 clogP for windows output Phenylbutazone Branch
    26. 26. logP So log P needs to be optimised What else does logP affect? Binding to enzyme / receptor Aqueous solubility Binding to P 450 metabolising enzymes Absorption through membrane Binding to blood / tissue proteins – less drug free to act Binding to hERG heart ion channel -cardiotoxicity risk
    27. 27. Admet Descriptors Calculation Tools <ul><li>PreADMET http://preadmet.bmdrc.org/ </li></ul><ul><li>Molecular Descriptors Calculation  - 1081 diverse molecular descriptors </li></ul><ul><li>Drug-Likeness Prediction  - Lipinski rule, lead-like rule, Drug DB like rule </li></ul><ul><li>ADME Prediction   - caco-2, MDCK, BBB, HIA, plasima protein binding and skin permeability data </li></ul><ul><li>Toxicity Prediction  - Ames test and rodent carcinogenicity assay </li></ul><ul><li>SPARC Online Calculator http://ibmlc2.chem.uga.edu/sparc/ </li></ul><ul><li>SPARC on-line calculator for prediction of pK,, solubility, polarizability, and other properties; search in the database of experimental pKa values is also availabl e </li></ul><ul><li>Daylight Chemical Information Systems </li></ul><ul><li>www.daylight .com/ daycgi/clogp </li></ul><ul><li>Calculation of log P by the CLOGP algorithm from BioByte; also access to the LOGPSTARdatabase of experimental log P data . </li></ul>
    28. 28. Admet Tools Continued.. <ul><li>Molinspiration Cheminformatics www.molinspiration.com/seruices/index. </li></ul><ul><li>Calculation of molecular properties relevant to drug design and QSAR, including log P, polar surface area, Rule of Five parameters, and drug-likeness index </li></ul><ul><li>Pirika - www.pirika.com </li></ul><ul><li>Calculation of various types of molecular properties, including boiling point, vapor pressure, and solubility; web demo restricted to only aliphatic molecules </li></ul><ul><li>Actelion - www.actelion.com/page/property_explorer </li></ul><ul><li>Calculation of molecular weight, logP, solubility, drug-score and toxlcity risk . </li></ul><ul><li>Virtual Computational Chemistry Laboratory www. vcclab. org </li></ul><ul><li>Prediction of log P and water solubility based on associative neural networks as well as other parameters; comparison of various prediction methods </li></ul>
    29. 29. Virtual Screening
    30. 30. Ways to Assess Structures from a Virtual Screening Experiment <ul><li>Use a previously derived mathematical model that predicts the biological activity of each structure </li></ul><ul><li>Run substructure queries to eliminate molecules with undesirable functionality </li></ul><ul><li>Use a docking program to ID structures predicted to bind strongly to the active site of a protein (if target structure is known) </li></ul><ul><li>Filters remove structures not wanted in a succession of screening methods </li></ul>
    31. 31. Main Classes of Virtual Screening Methods <ul><li>Depend on the amount of structural and bioactivity data available </li></ul><ul><ul><li>One active molecule known: perform similarity search (ligand-based virtual screening) </li></ul></ul><ul><ul><li>Several active molecules known: try to ID a common 3D pharmacophore, then do a 3D database search </li></ul></ul><ul><ul><li>Reasonable number of active and inactive structures known: train a machine learning technique </li></ul></ul><ul><ul><li>3D structure of the protein known: use protein-ligand docking </li></ul></ul>
    32. 32. STRUCTURE-BASED VIRTUAL SCREENING <ul><li>Protein-Ligand Docking </li></ul><ul><ul><li>Aims to predict 3D structures when a molecule “docks” to a protein </li></ul></ul><ul><ul><ul><li>Need a way to explore the space of possible protein-ligand geometries ( poses ) </li></ul></ul></ul><ul><ul><ul><li>Scoring of the ligand poses uch that the score reflects binding affinity of the ligand; </li></ul></ul></ul><ul><ul><ul><li>Need to score or rank the poses to ID most likely binding mode and assign a priority to the molecules </li></ul></ul></ul><ul><ul><li>Problem: involves many degrees of freedom (rotation, conformation) and solvent effects </li></ul></ul><ul><li>Conformations of ligands in complexes often have very similar geometries to minimum-energy conformations of the isolated ligand </li></ul>
    33. 33. Protein-Ligand Docking Methods <ul><li>Modern methods explore orientational and conformational degrees of freedom at the same time </li></ul><ul><ul><li>Monte Carlo algorithms (change conformation of the ligand or subject the molecule to a translation or rotation within the binding site </li></ul></ul><ul><ul><li>Genetic algorithms </li></ul></ul><ul><ul><li>Incremental construction approaches </li></ul></ul>
    34. 34. Distinguish “Docking” and “Scoring” <ul><li>Docking involves the prediction of the binding mode of individual molecules </li></ul><ul><ul><li>Goal: ID orientation closest in geometry to the observed X-ray structure </li></ul></ul><ul><li>Scoring ranks the ligands using some function related to the free energy of association of the two units </li></ul><ul><ul><li>DOCK function looks at atom pairs of between 2.3-3.5 Angstroms </li></ul></ul><ul><ul><li>Pair-wise linear potential looks at attractive and repulsive regions, taking into account steric and hydrogen bonding interactions(eg moldock) </li></ul></ul>
    35. 35. Structure-Based Virtual Screening: Other Aspects <ul><li>Computationally intensive and complex </li></ul><ul><li>Multitude of possible parameters figure into docking programs </li></ul><ul><li>Docking programs require 3D conformation as the starting point or require partial atomic charges for protein and ligand </li></ul><ul><li>X-Ray Crystallographic studies don’t include hydrogens, but most docking programs require them. </li></ul>
    36. 36. Ligand Based Virtual Screening <ul><li>The Ligand based approach mainly uses pharmacophore maps and (QSAR) to identify or modify a lead in the absence of a known three dimensional structure of the receptor. It is necessary to have experimental affinities and molecular properties of a set of active compounds, for which the chemical structures are known . </li></ul><ul><li>a) PHARMACOPHORE: A pharmacophore is an explicit geometric hypothesis of the critical features of a ligand.Standard features include H-bond donors and acceptors, charged groups,and Hydrophobic patterns.The hypothesis can be used to screen databases for compounds and to refine existing leads . </li></ul><ul><li>For a geometric alignment of the functional groups of the leads, it is necessary to specify the conformations that individual compounds adopt in their bound state. </li></ul><ul><li>Since the simple presence of a pharmacophoric fingerprint is not sufficient for predicting activity, inactive compounds possessing the required pharmacophoric features must also be considered. </li></ul><ul><li>By comparing the volume of the active and the inactive compounds, a common volume can be constructed in order to approximate the shape of the (unknown) receptor site to further refine the pharmacophore model and to screen out additional compounds. </li></ul>
    37. 37. 3D compound Structures Feature Analysis Set of Conformers Align to template compare validation Pharmacophore Application Pharmacophore Modelling Workflow
    38. 38. Continued....... <ul><li>b)QSAR: The goal of QSAR studies is to predict the activity of new compounds based solely on their chemical structure. The underlying assumption is that the biological activity can be attributed to incremental contributions of the molecular fragments determining the biological activity. This assumption is called the linear free energy principle. Information about the strength of interactions is captured for each compound by,for example, steric,electronic,and hydrophobic descriptors . </li></ul>
    39. 39. Molecular similarity and searching Molecules Chemical, pharmacological or biological properties of two compounds match. The more the common features, the higher the similarity between two molecules. Chemical Pharmacophore What is it? The two structures on top are chemically similar to each other. This is reflected in their common sub-graph, or scaffold: they share 14 atoms The two structures above are less similar chemically (topologically) yet have the same pharmacological activity, namely they both are Angiotensin-Converting Enzyme (ACE) inhibitors
    40. 40. Molecular similarity How to calculate it? Sequences/vectors of bits, or numeric values that can be compared by distance functions, similarity metrics . E= Euclidean distance T = Tanimoto index <ul><li>Quantitative assessment of similarity/dissimilarity of structures </li></ul><ul><li>need a numerically tractable form </li></ul><ul><li>molecular descriptors, fingerprints, structural keys </li></ul>
    41. 41. Molecular descriptors <ul><li>hashed binary fingerprint </li></ul><ul><li>encodes topological properties of the chemical graph: connectivity, edge label (bond type), node label (atom type) </li></ul><ul><li>allows the comparison of two molecules with respect to their chemical structure </li></ul>a) chemical fingerprint <ul><li>Construction </li></ul><ul><li>find all 0, 1, …, n step walks in the chemical graph </li></ul><ul><li>generate a bit array for each walks with given number of bits set </li></ul><ul><li>merge the bit arrays with logical OR operation </li></ul>
    42. 42. Molecular descriptors Example 1: chemical fingerprint Example CH3 – CH2 – OH walks from the first carbon atom merge bit arrays for the first carbon atom: 1111011110 This example illustrates how a 10 bits long topological chemical fingerprint is created for a simple chain structure. In this example all walks up to 3 steps are considered, and 2 bits are set for each pattern. length walk bit array 0 C 1010000000 1 C – H 0001010000 1 C – C 0001000100 2 C – C – H 0001000010 2 C – C – O 0100010000 3 C – C – O – H 0000011000
    43. 43. Molecular Similarity Example 1: chemical fingerprint 01000101000101000100000000011010100110101000000 1 0100000000100000 01000101000101000100000000011010100110101000000 0 0100000000100000
    44. 44. Molecular descriptors Example 2: pharmacophore fingerprint <ul><li>encodes pharmacophore properties of molecules as frequency counts of pharmacophore point pairs at given topological distance </li></ul><ul><li>allows the comparison of two molecules with respect to their pharmacophore </li></ul><ul><li>Construction </li></ul><ul><li>map pharmacophore point type to atoms </li></ul><ul><li>calculate length of shortest path between each pair of atoms </li></ul><ul><li>assign a histogram to every pharmacophore point pairs and count the frequency of the pair with respect to its distance </li></ul>
    45. 45. Molecular descriptors Example 2: pharmacophore fingerprint Pharmacophore point type based coloring of atoms: acceptor , donor , hydrophobic , none.
    46. 46. Virtual screening using fingerprints 0000000100001101000000101010000000000110000010000100001000001000 0100010110010010010110011010011100111101000000110000000110001000 0100010100011101010000110000101000010011000010100000000100100000 0001101110011101111110100000100010000110110110000000100110100000 0100010100110100010000000010000000010010000000100100001000101000 0100011100011101000100001011101100110110010010001101001100001000 0101110100110101010111111000010000011111100010000100001000101000 0100010100111101010000100010000000010010000010100100001000101000 0001000100010100010100100000000000001010000010000100000100000000 0100010100010011000000000000000000010100000010000000000000000000 0100010100010100000000000000101000010010000000000100000000000000 0101010101111100111110100000000000011010100011100100001100101000 0100010100011000010000011000000000010001000000110000000001100000 0000000100000000010000100000000000001010100000000100000100100000 0100010100010100000000100000000000010000000000000100001000011000 0001000100001100010010100000010100101011100010000100001000101000 0100011100010100010000100001001110010010000010001100000000101000 0101010100010100010100100000000000010010000010010100100100010000 query targets query fingerprint proximity target fingerprints hits 0101010100010100010100100000000000010010000010010100100100010000 Individual query structure
    47. 47. Hypothesis Fingerprints Advantages Disadvantages <ul><li>strict conditions for hits if actives are fairly similar </li></ul><ul><li>false results with asymmetric metrics </li></ul><ul><li>misses common features of highly diverse sets </li></ul><ul><li>very sensitive to one missing feature </li></ul><ul><li>captures common features of more diverse active sets </li></ul><ul><li>less selective if actives are very similar </li></ul><ul><li>captures common features of more diverse active sets </li></ul><ul><li>specific treatment of the absence of a feature </li></ul><ul><li>less sensitive to outliers </li></ul><ul><li>less selective if actives are very similar </li></ul>
    48. 48. SUMMARY <ul><li>Virtual screening methods are central to many cheminformatics problems in: </li></ul><ul><ul><li>Design </li></ul></ul><ul><ul><li>Selection </li></ul></ul><ul><ul><li>Analysis </li></ul></ul><ul><li>Increasing numbers of molecules can be evaluated using these techniques </li></ul><ul><li>Reliability and accuracy remain as problems in docking and predicting ADMET properties </li></ul><ul><li>Need much more reliable and consistent experimental data </li></ul>
    49. 49. Datamining and Machine Learning Approaches to Virtual Screening
    50. 50. Idea of Datamining <ul><li>Is discovering for patterns in the data i.e for example </li></ul><ul><li>a)an hunter looks pattern in animal migration behavior. </li></ul><ul><li>b)farmers seek patterns in crop growth. </li></ul><ul><li>c) politcians seek patterns in voters opinion </li></ul><ul><li>d) Pattern in the compound structures . </li></ul><ul><li>The Patterns which are discovered must be meaningful and lead to some advantage. </li></ul><ul><li>The process must be automatic or semiautomatic. </li></ul>
    51. 51. Canonical learning Problems <ul><li>Supervised Learning : given examples of inputs and corresponding desired outputs, predict outputs on future inputs. </li></ul><ul><li>a) Classification </li></ul><ul><li>b) Regression </li></ul><ul><li>c) Time series prediction </li></ul><ul><li>Unsupervised Learning : given only inputs, automatically discover representations, features, structure, etc. </li></ul><ul><li>a) Clustering </li></ul><ul><li>b) Outlier detection </li></ul><ul><li>c) Compression </li></ul>
    52. 52. Datamining Methods <ul><li>Substructural Analysis </li></ul><ul><li>The Substrcutural fragments makes a contribution to activity irrespective of the other fragments of the molecule. The idea is to derive a weight for each fragment which reflects to be active or inactive. The sum of weight gives the score of molecule which enables a new set of structures to be ranked in Decreasing probability of activity. </li></ul><ul><li>The weight is calculated using the eq : </li></ul><ul><li>Where act(i) is the number of active molecules that contain the i th fragment and inact(i) is the number of inactive molecules that contain the i th fragment </li></ul>
    53. 53. Discriminant algorithms <ul><li>The aim of discriminant analysis is try to separate the molecules into constituent classes. </li></ul><ul><li>The simplest Linear discriminant which in case of two activity class and two descriptors which aim to find a st. line that separates data such that maximum number of compounds are classified. </li></ul><ul><li>If more than variable uses the line become hyperplane. </li></ul><ul><li>The idea is to express a class as a linear combination of attributes. </li></ul><ul><li>X= w 0 +w 1 a 1 +w 2 a 2 +w 3 a 3 +......... </li></ul><ul><li>X =class a 1 a 2 = attributes w 1 w 2 = weights </li></ul>
    54. 54. Neural Networks(NN) <ul><li>The two most commonly used neural network architectures used in chemistry are the feed forward networks and the Kohonen networks. </li></ul><ul><li>The feed forward NN is a supervised learning method as it uses the values of dependent variables to derive the model. The Kohonen or Self Organizing map (SOM) is an unsupervised method. </li></ul><ul><li>The Feed forward NN contains layers of nodes with connection between all pairs of nodes in the adjacent layers. A key feature is presence of hidden nodes along with back propagation algorithm makes the network applicable to many fields. </li></ul><ul><li>The neural network must first be trained with set of inputs. Once it has been trained it can then be used to predict values for new and unseen molecules. </li></ul>
    55. 55. Neural Networks Continued... <ul><li>The Figure Below shows a Feed forward network with 3Hidden nodes and one output. </li></ul><ul><li>A Kohonen NN consist of rectangular array of nodes and each nodes associates a vector that corresponds to input data (Descriptors values) </li></ul><ul><li>The data is presented to the network one molecule at a time and the distance between each of node vectors and molecule vectors are determined with distance metric. The node with minimum distance becomes the wining node. </li></ul>
    56. 56. Disadvantage of Neural Networks <ul><li>Its is difficult to design a perfect model for neural networks with number of hidden layers and nodes which will best fit the data. </li></ul><ul><li>Another practical issue is Overtraining .An overtrained NN will give excellent results train data but will perform poorly on an unseen data(test data).This is because the network memorizes the data. </li></ul><ul><li>The way solve this problem is to divide the sets in train and test and then watch performance of the set . If the performance of the test set increase such that till it reaches a plateau and start to decline ,at this point network has maximum predictive ability. </li></ul>
    57. 57. DECISION TREES(DT) <ul><li>In Feed forward NN it is not possible to determine the result for a given input due to complex nature of interconnection between nodes one cannot determine which properties are important. </li></ul><ul><li>Decision trees consist of set of rules that associate molecular descriptor values with property of interest. </li></ul><ul><li>A DT is a tree with nodes containing specific rules .Each Rule may correspond to the presence or absence of a particular feature . </li></ul><ul><li>In a DT one start at the root node and follows the edge with appropriate first rule. This continues until a terminal node is reached at which point one can assign the molecule into active and inactive class. </li></ul><ul><li>DTs like ID3 ,C4.5,C 5.0 uses information theory to choose which criteria to choose at each step. </li></ul><ul><li>Random forests a small subset of the descriptors is randomly selected at each node rather than using the full set. </li></ul>
    58. 58. Support Vector Machines(SVM) <ul><li>Support vector machines select a small number of critical boundary instances called support vectors from each class and build a linear discriminant function that separates them as widely as possible. </li></ul><ul><li>Molecules in the test set are mapped to the same feature space and </li></ul><ul><li>their activity is predicted according to which side of the hyper plane they fall. </li></ul><ul><li>The distance to the boundary can be used to assign confidence level to the prediction such that higher the distance the higher the confidence. </li></ul><ul><li>The output of SVM is given by f(x)=sign(g(x)) where g(x)=w(t)x+b, w is a vector and b is a scalar. </li></ul><ul><li>linear SVM can be applied only when the active and inactive compounds can be divided by a straight line (hyperplane) in the feature space. </li></ul>
    59. 59. SVM continued.... <ul><li>When the data cannot be separated linearly, kernel functions are used to transform to the Higher dimensions. </li></ul><ul><li>The output of SVM is given by f(x)=sign(g(x)) and g(x) is given by </li></ul><ul><li>  </li></ul><ul><li>where K is the so-called kernel function, the suffix k represents the support vector, and m stands for the number of support vectors. </li></ul><ul><li>The Gaussian and the Polynomial kernel function are used </li></ul>
    60. 60. Strengths and Weaknesses of SVM <ul><li>Strengths </li></ul><ul><li>Training is relatively easy </li></ul><ul><li>No local optima </li></ul><ul><li>It scales relatively well to high dimensional data </li></ul><ul><li>Tradeoff between classifier complexity and error can be controlled explicitly </li></ul><ul><li>Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors </li></ul><ul><li>Weaknesses </li></ul><ul><li>Need to choose a “good”kernel function. </li></ul>
    61. 61. Measuring Classifier Performance <ul><li>N= total number of instances in the dataset </li></ul><ul><li>TPj= Number of True Positives for class j </li></ul><ul><li>FPj = Number of False positives for class j </li></ul><ul><li>TNj= Number of True Negatives for class j </li></ul><ul><li>FNj= Number of False Negatives for class j </li></ul><ul><li>Accuracy = </li></ul><ul><li>Sensitivity/recall = </li></ul><ul><li>Specificity/precision = </li></ul>
    62. 62. Types of Datamining learning Process in Weka <ul><li>Classification- learning-the learning scheme is presented with a set of classified examples from which it is expected to learn a way of classifying unseen examples. </li></ul><ul><li>Association Learning- any association among features is sought, not just ones that predict a particular class value </li></ul><ul><li>Clustering- groups of examples that belong together are sought </li></ul><ul><li>Numeric prediction- the outcome to be predicted </li></ul><ul><li>is not a discrete class but a numeric quantity. </li></ul>
    63. 63. Classifier Algorithms in WEKA <ul><li>a)Bayes Classifier c) Functions </li></ul><ul><li>AODE LINEAR REGRESSION </li></ul><ul><li>BAYES NET LOGISTIC </li></ul><ul><li>NAÏVE BAYES MULTILAYERD PERCEPTRON </li></ul><ul><li>NAÏVE BAYES MULTINOMIAL RBF NETWORK </li></ul><ul><li>NAÏVE BAYES UPDATABLE SIMPLE LINEAR REGRESSION </li></ul><ul><li>SIMPLE LOGISTIC </li></ul><ul><li>SMO,SMO REG. </li></ul><ul><li>b)Trees d)Rules </li></ul><ul><li>ADTREE CONJUCTIVE RULE </li></ul><ul><li>ID3 DECISION TABLE </li></ul><ul><li>J48 JRIP </li></ul><ul><li>LMT M 5RULES </li></ul><ul><li>NB5TREE NNGE </li></ul><ul><li>RANDOM FOREST ONE R </li></ul><ul><li>RANDOM TREE PRISM </li></ul><ul><li>REP TREE ZERO R </li></ul>
    64. 64. Summary <ul><li>Machine learning is mainly applied to ligand-based drug screening and it is applied to the calculation of the optimal distance between the feature vectors of active and inactive compounds. </li></ul><ul><li>A kernel is essentially a similarity function with certain mathematical properties, and it is possible to define kernel functions over all sorts of structures for example, sets, strings, trees, and probability distributions . </li></ul><ul><li>Interest in neural networks appears to have declined since the arrival of support vector machines, perhaps because the latter generally require fewer parameters to be tuned to achieve the same (or greater) accuracy. </li></ul>
    65. 65. THANK YOU