Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Oct 2011 ualr

1,690 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Oct 2011 ualr

  1. 1. Dr. David Wild Associate Director of Cheminformatics & Assistant Professor Indiana University School of Informatics http://djwild.info [email_address] Essential Cheminformatics UALR Chemistry Seminar Guest Lecture October 2011
  2. 2. Talk outline <ul><li>Some basic methods of chemoinformatics: representing chemical structures </li></ul><ul><li>Some more advanced methods and how they are used in drug discovery </li></ul><ul><li>Interesting current trends and directions </li></ul><ul><li>Shameless publicity for our research :-) </li></ul>Page
  3. 3. Learning objectives <ul><li>Understand how 2D and 3D chemical structures are represented on computer </li></ul><ul><li>Be able to generate a SMILES string for any simple chemical </li></ul><ul><li>Understand the part that Data Mining, Docking, and Virtual Screening play in modern drug design </li></ul><ul><li>Have some insights into the current trends and future directions of chemoinformatics </li></ul>Page
  4. 4. Historical ways of representing chemicals <ul><li>Trivial name , e.g. Baking Soda , Aspirin , Citric Acid , etc. Identifies the compound, but gives no (or little) information about what it consists of </li></ul><ul><li>Chemical formula, e.g. C 6 H 12 O 6 . Specifies the type and quantity of the atoms in the compound, but not its structure (i.e. how the atoms are connected by bonds) </li></ul><ul><li>Systematic name , e.g. 1,2-dibromo-3-chloropropane . Identifies the atoms present and how they are connected by bonds. </li></ul>Page
  5. 5. 2D structure diagram <ul><li>Trivial name: </li></ul><ul><ul><li>tyrosine </li></ul></ul><ul><li>Systematic names: </li></ul><ul><ul><li> -( p -hydroxyphenyl)alanine </li></ul></ul><ul><ul><li> -amino- p -hydroxyhydrocinnamic acid </li></ul></ul>Page
  6. 6. Early computer representations <ul><li>How do we communicate structural information between humans and the computer? </li></ul><ul><ul><li>Line notations, e.g. Wiswesser Line Notation (and later SMILES) </li></ul></ul><ul><li>How do we represent the atoms and bonds in a molecule internally in a computer? </li></ul><ul><ul><li>Atom lookup and connection tables </li></ul></ul>Page
  7. 7. Linear notations <ul><li>Represent the atoms, bonds and connectivity of a molecule in a linear text string </li></ul><ul><li>Concise representation </li></ul><ul><li>Originally designed for manual command line entry into text-only systems </li></ul><ul><li>Now an excellent format for file and database storage (e.g. can be held in a spreadsheet cell, on one line of a text file, or in an Oracle database text field) </li></ul>Page
  8. 8. SMILES <ul><li>(one possible) SMILES for this structure is OC(=O)C(N)CC1=CC=C(O)C=C1 </li></ul><ul><li>Can identify any chemical structure </li></ul><ul><li>There can be several ways of writing the same strucutre in SMILES (although a system of generating canonical SMILES) exists </li></ul>Page
  9. 9. SMILES – Atoms & Bonds <ul><li>Atoms represented by their chemical symbol (C, N, S, O, Br, etc). Uppercase for aliphatic, lowercase for aromatic </li></ul><ul><li>Adjacent atoms implicitly single bonded, or = for double bond, or # for triple bond </li></ul><ul><li>Hydrogens usually implicit </li></ul>Page Propane CCC
  10. 10. SMILES – Atoms & Bonds <ul><li>Atoms represented by their chemical symbol (C, N, S, O, Br, etc). Uppercase for aliphatic, lowercase for aromatic </li></ul><ul><li>Adjacent atoms implicitly single bonded, or = for double bond, or # for triple bond </li></ul><ul><li>Hydrogens usually implicit </li></ul>Page 1-Propanol CCCO Or OCCC !
  11. 11. SMILES – Atoms & Bonds <ul><li>Atoms represented by their chemical symbol (C, N, S, O, Br, etc). Uppercase for aliphatic, lowercase for aromatic </li></ul><ul><li>Adjacent atoms implicitly single bonded, or = for double bond, or # for triple bond </li></ul><ul><li>Hydrogens usually implicit </li></ul>Page Propene C=CC Or CC=C !
  12. 12. SMILES – Branching & Rings <ul><li>Parentheses represent branching </li></ul><ul><li>Ring enclosures represented by using numbers to signify attachment points </li></ul>Page 2-Propanol CC(O)C
  13. 13. SMILES – Branching & Rings <ul><li>Parentheses represent branching </li></ul><ul><li>Ring enclosures represented by using numbers to signify attachment points </li></ul>Page Cyclohexane C1CCCCC1
  14. 14. SMILES – Branching & Rings <ul><li>Parentheses represent branching </li></ul><ul><li>Ring enclosures represented by using numbers to signify attachment points </li></ul>Page Benzene c1ccccc1
  15. 15. SMILES – Branching & Rings <ul><li>Parentheses represent branching </li></ul><ul><li>Ring enclosures represented by using numbers to signify attachment points </li></ul>Page Chlorobenzene c1cc(Cl)ccc1
  16. 16. SMILES – Acetaminophen (Tylenol) Page Acetaminophen c1c(O)ccc(NC(=O)C)c1
  17. 17. SMILES – multiple ring structure Page Indole c1ccc2NCCc2c1
  18. 18. Other SMILES notes <ul><li>All Hydrogen atoms are implicit unless declared otherwise </li></ul><ul><li>Non-organic (i.e. not C,N,S,O,Cl,Br), Hydrogens and modified atoms neet to be placed in square brackets, e.g. [Br], [Xe] </li></ul><ul><li>Charged species indicated by a + or – (and square brackets), e.g. [Na+], [N+], [O-], [Ca++] </li></ul><ul><li>Unknown atoms can be represented by a * (but watch out for confusion with SMARTS!) </li></ul><ul><li>Stereochemistry can be indicated using @@ </li></ul><ul><li>“ Canonical SMILES” can be created </li></ul>Page
  19. 19. SMILES Homepage <ul><li>http://www.daylight.com/dayhtml/smiles/index.html </li></ul><ul><li>Official Syntax Guide </li></ul><ul><li>Tutorial </li></ul><ul><li>Examples </li></ul><ul><li>Resources </li></ul>Page
  20. 20. Internal computer representations <ul><li>Atom Lookup Table </li></ul><ul><ul><li>Stores the atomic names (and possibly other info such as valence, charge, etc.) of all atoms in the molecule </li></ul></ul><ul><li>Connection table </li></ul><ul><ul><li>Indicates which atoms are connected to which other atoms, and with what kind of bond </li></ul></ul>Page
  21. 21. Atom Lookup Table Page 1 2 3 4 5 6 7 8 9 10 11 Note that Hydrogens are not normally stored explicitly. Their presence is inferred from the valence of the atom Atom Label 1 C 2 C 3 C 4 N 5 C 6 O 7 C 8 C 9 C 10 C 11 O
  22. 22. Connection table Page 0 0 0 0 0 0 0 0 0 0 2 0 11 10 9 0 0 0 1 0 1 0 0 4 1 0 0 0 1 0 2 0 3 0 0 0 0 0 2 0 1 2 0 0 0 0 0 0 1 0 1 1 2 3 4 5 6 7 8 9 10 11 0 0 0 11 0 0 2 10 0 0 0 9 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 8 0 0 0 0 0 0 1 0 0 0 0 7 0 0 0 0 0 0 2 0 0 0 0 6 0 0 0 0 1 2 0 1 0 0 0 5 8 7 6 5 4 3 2 1 Atom Label 1 C 2 C 3 C 4 N 5 C 6 O 7 C 8 C 9 C 10 C 11 O
  23. 23. Topological Graph Theory <ul><li>Branch of mathematics particularly applicable to the sciences and computer science </li></ul><ul><li>Study of “ graphs ” which consist of a set of “ nodes ” and a set of “ edges ” joining pairs of nodes </li></ul>Page
  24. 24. Structure Diagrams as Graphs <ul><li>2D structure diagrams very like topological graphs </li></ul><ul><ul><li>atoms  nodes </li></ul></ul><ul><ul><li>bonds  edges </li></ul></ul><ul><li>Terminal hydrogen atoms are not normally shown as separate nodes ( “ implicit ” hydrogens) </li></ul><ul><ul><li>reduces number of nodes by ~50% </li></ul></ul><ul><ul><li>“ hydrogen count ” information used to colour neighbouring “ heavy atom ” atom </li></ul></ul><ul><ul><li>separate nodes sometimes used for “ special ” hydrogens </li></ul></ul><ul><ul><ul><li>deuterium, tritium </li></ul></ul></ul><ul><ul><ul><li>hydrogen bonded to more than one other atom </li></ul></ul></ul><ul><ul><ul><li>hydrogens attached to stereocentres </li></ul></ul></ul>Page
  25. 25. Why use graph theory? <ul><li>Established mathematical field </li></ul><ul><li>Large bank of existing algorithms </li></ul><ul><li>Unlike humans, computers aren ’t very good at pattern recognition </li></ul>Page Images taken from lhinstrumental.com.ar, mathforum.org, gpb.org respectively
  26. 26. Subgraph Isomorphism: finding one graph within another <ul><li>For example, determine whether a particular structural fragment is present in a structure </li></ul><ul><li>Exhaustive algorithm is NP-Complete (exponential time) </li></ul><ul><li>Ullman Algorithm </li></ul>Page
  27. 27. Characterising 2D Structures with Fingerprints <ul><li>A “fingerprint” is made up of a set of descriptors for a molecule. Each descriptor describes (usually the presence or absence of) a particular 2D structural feature. </li></ul><ul><li>Most fingerprints are binary strings made up of zeros and ones </li></ul><ul><li>The fingerprint characterises the molecule, but doesn ’t uniquely describe it. It is useful in many applications we will come to later, e.g. similarity, clustering, diversity. </li></ul>Page
  28. 28. Simple fingerprints (Structural Keys) <ul><li>A pre-defined dictionary of fragments is used, and each bit in a bitstring represents the presence or absence of that fragment in a particular compound </li></ul>Page O O H N H 2 … 1 0 0 1 1 0 1 1 1 0 1 1... O O H N O S O N O ... 4 ... H E T
  29. 29. Hashed Fingerprints <ul><li>All possible fragments in a compound are generated </li></ul><ul><ul><li>All sequences of atoms from 2 - 7 atoms </li></ul></ul><ul><ul><li>Augmented atoms </li></ul></ul><ul><ul><li>Atom pairs </li></ul></ul><ul><li>Number of fragments represented can be huge </li></ul><ul><ul><li>100,000 for just the 2-7-length sequences for C, N, S, O, P, not considering bond types or generalizations </li></ul></ul><ul><li>Hashed onto a fixed number of bits (e.g. 1024) </li></ul><ul><li>A bits and fragments are not directly related </li></ul><ul><li>General - no pre-defined dictionary is required </li></ul>Page O O H N H 2
  30. 30. Chemical Markup Language (CML) <ul><li>Variant of XML for internet communication of structures </li></ul><ul><li>See http://www.xml-cml.org or http://cml.sourceforge.net/ </li></ul>Page
  31. 31. Types of 2D searching <ul><li>Structure search </li></ul><ul><ul><li>“ Is this structure in the database?” </li></ul></ul><ul><li>Substructure search </li></ul><ul><ul><li>“ Find me all of the structures that contain this substructure” </li></ul></ul><ul><li>Similarity search </li></ul><ul><ul><li>“ Find me all of the structures that are similar to this one” </li></ul></ul>Page
  32. 32. Structure search <ul><li>Looking for a particular structure in a database </li></ul><ul><ul><li>Searching proprietary databases or commercial databases </li></ul></ul><ul><li>e.g. is this structure in the database? </li></ul><ul><li>Mathematically, the connection table can be considered a graph , and this is a graph isomorphism problem (solved) </li></ul>Page
  33. 33. Substructure search <ul><li>Looking for all structures that contain one or more particular structural fragments </li></ul><ul><li>e.g. which structures contain a nitro group? </li></ul><ul><li>Mathematically, this is a subgraph isomorphism problem </li></ul><ul><li>Requires way of representing query fragment(s) </li></ul>Page
  34. 34. Similarity search <ul><li>Looking for all the structures in a database that are highly similar to a given structure </li></ul><ul><li>e.g. show me structures with a similarity greater than 0.7 to this molecule </li></ul><ul><li>Requires a way of measuring similarity </li></ul><ul><li>Solved using fingerprint representations and similarity coefficients </li></ul>Page
  35. 35. www.molinspiration.com/cgi-bin/search Page
  36. 36. Substructure search results Page
  37. 37. Similarity search results Page
  38. 38. Cluster analysis <ul><li>Refers to a group of statistical methods used for identifying groups ( “clusters”) of similar items in a multi-dimensional space </li></ul><ul><li>Require a measure of distance or similarity between items </li></ul><ul><li>E.g. for a 2D space: </li></ul>Page
  39. 39. Cluster analysis applied to chemical information <ul><li>Three main uses: </li></ul><ul><ul><li>Grouping compounds into series, particularly helpful in analyzing large datsets (i.e. 1,000 series easier to analyze than 50,000 arbitrary compounds) </li></ul></ul><ul><ul><li>Grouping structures which are likely to have similar biological activity, the premise being that if several compounds in a cluster are active, others are likely to be active too </li></ul></ul><ul><ul><li>Picking small sets of “representative compounds” from large datasets </li></ul></ul><ul><li>We already have measures of similarity and distance – Tanimoto and Euclidean </li></ul><ul><li>By incorporating these fingerprint-based methods, we can use standard cluster-analysis techniques for finding groups of similar structures in a dataset </li></ul>Page
  40. 40. Diversity Analysis <ul><li>Arose in the late 1990 ’s in response to the following needs: </li></ul><ul><ul><li>There was much interest as to how well the corporate collections held by pharmas “covered” possible chemistry / drug space </li></ul></ul><ul><ul><li>Combinatorial Chemistry experiments were producing many new compounds, and people wanted to know if these compounds added anything new to their corporate collections, i.e. if they made the datasets more diverse, or just replicated what was already in there </li></ul></ul><ul><ul><li>Libraries of thousands of compounds became available for purchase – are they worth the money? </li></ul></ul>Page
  41. 41. “ Descriptor Space ” <ul><li>If you chose a descriptor set (e.g. of n fingerprint bits), the “descriptor space” represents the space created if you plot each of the descriptors as a separate dimension </li></ul><ul><li>E.g. if we just had two descriptors (mol.wt. and LogP), our descriptor space would be: </li></ul>Page LogP Mol Wt.
  42. 42. “ Descriptor Space ” <ul><li>People began to talk about “ Chemistry Space” and “ Drug Space” : </li></ul><ul><ul><li>Chemistry space – if you made all the possible compounds that could theoretically be made, the chemistry space represents the regions of a multi-dimensional descriptor space (as defined by a given descriptor set) that would be occupied </li></ul></ul><ul><ul><li>Drug space – the regions of the chemistry space that would be inhabited by drug molecules </li></ul></ul><ul><li>So questions began to be asked such as “how much of chemistry space does our corporate collection cover?”; “how could we cover more?”; “what about drug space?” etc. </li></ul>Page
  43. 43. Simple descriptor space for corporate collection Page Taken from Shemetulskis, et. al., Enhancing the diversity of a corporate database Using chemical database clustering and analysis, Journal of Computer-Aided Molecular Design, 1995, 9 , 407-416
  44. 44. Sources of 3D information <ul><li>X-ray Crystallography </li></ul><ul><li>NMR Spectroscopy </li></ul><ul><li>Computer-generated 3D structures </li></ul><ul><li>X-ray and NMR methods apply to both small molecules and protiens </li></ul>Page
  45. 45. Experimental 3D Databases – Cambridge Structural Database <ul><li>Experimental X-ray structures for 261,000 structures (Jan 2004) </li></ul><ul><li>Various tools for searching the database (some available free) </li></ul><ul><li>More info at: </li></ul><ul><ul><li>http://www.ccdc.cam.ac.uk/ </li></ul></ul>Page
  46. 46. 3D representation on computer <ul><li>The Coordinate Table is an extension of the atom table which lists coordinates of atoms in 3D space relative to a defined origin </li></ul><ul><li>The Distance Matrix gives distances (in Ångstrom) between all atoms. It ’s main use is in comparison of 3D structures. It can be derived from the coordinate table. </li></ul><ul><li>These are usually stored in addition to a connection table. </li></ul>Page
  47. 47. Coordinate Table Page 1 2 3 4 6 5 7 8 9 10 11 12 13 Atom Label X Y Z 1 C -1.8920 -0.9920 -1.5760 2 C -1.3680 -2.1480 -0.9880 3 C -0.0760 -2.1440 -0.4640 4 C 0.7080 -0.9840 -0.5200 5 C 0.2000 -0.1560 -1.1960 6 C -0.1080 0.1600 -1.6520 7 O 2.0840 -1.0280 0.1040 8 O 2.5320 -2.0320 0.6360 9 C 2.8760 0.0240 0.1120 10 O 0.7520 1.3320 -1.0840 11 O 0.6680 2.0240 0.0320 12 C 1.3000 3.0600 0.1520 13 C -0.2400 1.5760 1.4440
  48. 48. Distance Matrix Page 1 2 3 4 6 5 7 8 9 10 11 12 13 4.8 Å 3.5 Å 1 2 3 4 5 6 7 8 9 10 11 12 13 1 1.4 2.4 2.8 2.4 3.8 4.8 4.2 1.4 2.4 2.7 2.9 4.3 2 1.4 2.4 2.8 4.3 5.1 5.0 2.4 3.7 3.9 4.2 5.6 3 1.4 2.4 3.8 4.2 4.8 2.8 4.2 4.7 4.9 6.4 4 1.4 2.5 2.8 3.6 2.4 3.7 4.7 4.6 6.1 5 1.5 2.4 2.3 1.4 2.3 3.7 3.5 4.8 6 1.3 1.2 2.5 2.8 4.4 3.9 5.0 7 2.2 3.7 4.1 5.7 5.2 6.3 8 2.8 2.5 4.2 3.5 4.3 9 1.4 2.6 2.3 3.7 10 2.2 1.3 2.5 11 1.2 2.4 12 1.5 13
  49. 49. Other forms of 3D information <ul><li>Surface (van de Waal ’s, Connolly, volume) </li></ul><ul><li>Properties projected onto surface (electrostatics, hydrophobics) </li></ul><ul><li>Fields (energy, force, electrostatic, steric, hydrophobic) </li></ul><ul><li>Atom-based properties (charge, hydrophobicity, etc) </li></ul>Page
  50. 50. Visualization - JMOL Page http://jmol.sourceforge.net/
  51. 51. Accelrys Discovery Studio Page
  52. 52. Docking algorithms <ul><li>Require 3D atomic structure for protein, and 3D structure for compound ( “ligand”) </li></ul><ul><li>May require initial rough positioning for the ligand </li></ul><ul><li>Will use an optimization method to try and find the best rotation and translation of the ligand in the protein, for optimal binding affinity </li></ul>Page
  53. 53. Genetic Algorithms <ul><li>Create a “population” of possible solutions, encoded as “chromosomes” </li></ul><ul><li>Use “fitness function” to score solutions </li></ul><ul><li>Good solutions are combined together ( “crossover”) and altered (“mutation”) to provide new solutions </li></ul><ul><li>The process repeats until the population “converges” on a solution </li></ul>Page
  54. 54. Sample GOLD output <ul><li>GMP into RNaseT1 </li></ul>Page
  55. 55. Virtual Screening <ul><li>Using computational methods to predict which of a large set of compounds are likely to be active </li></ul><ul><li>Think of it as an “in silico” high throughput screen </li></ul><ul><li>Docking can be used to score molecules in 3D – but needs to be fast! </li></ul><ul><li>Cluster analysis – predict positive activity for compounds in the same cluster as known actives </li></ul><ul><li>Artificial Intelligence / Machine Learning methods can be used </li></ul>Page
  56. 56. Virtual Screening Page
  57. 57. More on Virtual Screening <ul><li>MolInspiration – free online virtual screening for GPCR / Kinase / Ion Channel inhibitors at http://www.molinspiration.com/services/vsinfo.html </li></ul>Page
  58. 58. Cheminformatics Page Cheminformatics Drug Discovery Web Biology Green Chemistry Data mining to find potential drugs Analyzing experimental results Improving activity of drugs In silico toxicity testing Cures for neglected diseases Action of small molecules in cells Chemical probes of protein function Systems chemical biology Chemistry search engines Open Notebook Science Mining online journals Scientific information aggregation Predicting toxicity (EU REACH) Safer & better pesticides In silico biofuels research Improving oil refinement
  59. 59. http://www.jcheminf.com/content/1/1/1 Page
  60. 60. http://djwild.info Page

×