Canonicalized systematic nomenclature in cheminformatics


Published on

Poster presented at the 229th National ACS Meeting in San Diego, 2005.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Canonicalized systematic nomenclature in cheminformatics

  1. 1. Canonicalized systematic nomenclature in chemoinformatics And some new canonicalization tools from OpenEye Jeremy J. Yang Introduction Morgan demo and study Canonicalization in chemoinformatics facilitates rigorous, unambiguous expression and handling of chemical data and knowledge. However, just as chemistry encompasses multiple levels of abstraction and modelling, no single canonicalization method is sufficient to solve all problems. This study reviews some existing canonicalization methodology and describes new methods implemented by chemoinformatics library OEChem and other OpenEye tools. New: canonicalizing molfiles Fig 1: Morgan demo. Extended connectivity values and atom orders. Uses OEChem and Ogham. NCI Diversity set processed with no errors. Definition of canonicalization A canonicalization algorithm must determine a single representation among many possible representations for an individual in its domain. Benefits of canonicalization •  testing equality of molecules •  database search speed •  rigorous informatics and thinking N! (graph isomorphism is hard) – Morgan to the rescue algorithm1 The Morgan is the basis of most chemical canonicalization work since, and deserves careful study. In 1965 Harry L. Morgan published the algorithm already implemented at CAS for its compound registry system. This work, based on generic graph theory, comprises a theoretical solution to the problem of molecular canonicalization, and material validation of its efficacy. More Morgan, and more The Morgan algorithm was a huge step forward, but the basic algorithm has some shortcomings, in performance and comprehensiveness, which have been corrected by subsequent investigators. The resulting methods have been implemented and widely used in large scale database systems. Some key contributions: •  Morgan, 1965 à note to Harry: “You da man!” à CAS •  Wipke & Dyott, 1974 à stereo-enhanced Morgan à MDL •  Jochum & Gasteiger, 1977 à Morgan refinement à CACTVS •  Shelley & Munk, 1977 à Morgan refinement •  Weininger, 1988 à CANSMI canonical line notation à Daylight •  Bradshaw, 1998 à parent compounds à GSK,Daylight •  Delany & Sayle, 1999 à tautomers à OpenEye •  INChi, 2004 à global canonical line notation This study: canonical molecular descriptions, not descriptors The study of graph theory and canonicalization applied to chemistry is extensive and diverse. Canonical descriptors which do not fully represent the model can be of great utility in statistical analyses but are not the focus of this nomenclature study. Canonicalizing a connection table is not new and was discussed by Morgan1 and others. But generating canonical forms of current standard formats is not widely done, for historical and practical reasons, although the available benefits. This is increasingly true now that longer strings are more easily handled by existing computers. OEChem provides sufficient control to accomplish this task. Proposed algorithm: The OpenEye chemoinformatics toolkit OEChem12 employs an optimized Morgan-like canonical algorithm to generate canonical smiles. In addition, the api provides a rich set of tools which can facilitate generation of canonical representations of many types, for many chemical and informational models, and for many standard file formats. •  Remove non-structural data •  Supress hydrogens •  Canonical atom order •  Canonical bond order •  Canonical Kekule bonding based on (selected) aromaticity model •  OEChem::OECanonicalOrderAtoms() •  OEChem::OECanonicalOrderBonds() •  OEChem aromaticity models: OE, Daylight, Tripos, MDL, MMFF •  OEChem: many file formats and flavors, low-level writers •  QuacPac13: tautomers application and toolkit However, the advantages of more terse canonical line notations remain. Fig 2: Morgan slow due to symmetry. RESULTS: Using test program, 1990 NCI Diversity set converted to canonical SDF files, exactly equal to SDF files converted via SMILES ( Also done with MOL2 format. This test validates the ability of OEChem to canonicalize molfiles as strings. Fig 3: Morgan fails Aha! -- Chemo-taxonomy is a “stranded hierarchy” •  subatomic à atoms à molecules •  normal weight atoms à isotopes •  Kekule molecule model à aromatic molecule models •  non-stereo molecule à stereoisomers •  single molecule à combinatorial libraries •  single molecule à queries •  small molecule à macromolecule + cofactors + ligands •  single molecule à Markush structures •  single molecule à tautomer set •  single molecule à pKa states •  single molecule à reactions •  2D à 3D There is a hierarchical relationship among some of these expansions while some are independent. For example, combinatorial library may involve stereoisomeric individuals or non-stereo. For every combination of molecular representations, canonicalization could be advantageous for the reasons described. Hence the task of canonicalization is a multi-faceted one. Dealing with reality: practical problems 1.  Existing formats (may often be): •  ambiguous – poorly defined spec or poor compliance •  un-rigorous – both syntax and semantics are important •  non-comprehensive – only organic, covalent, size limits 2.  Stereoisomer canonicalization remains difficult •  "relative stereo-centers" 3.  Differing valence assumptions and conventions •  implicit-valence and Hcount formats prone to mishandling 4.  Information content and model differences in existing formats •  cannot robustly convert if info must be inferred (e.g. bonds) 5.  Disagreement over correct chemistry •  e.g., valences, aromaticity 6.  Local versus global canonicalization •  Benefits of canonicalization are available locally or globally. global canonicalization requires cooperation. •  Locality definition (time, place, software versions) OpenEye canonicalization tools New: canonical tautomers Tautomers have the same formula (structural isomers), but may differ in proton and electron location, and formal bond order. Special cases: keto/ enol, zwitterion, ring-chain. In the Delany/Sayle algorithm8,13, hydrogen donors and acceptors are perceived, and the number of free hydrogens. Donors and acceptor atoms are ordered canonically. At this stage all tautomerically equivalent inputs are represented identically. Hydrogen locations are exhaustively enumerated. A simple ruleset for enumeration order can designate the first to be the canonical tautomer. Through additional rules, the liklihood can be increased that the canonical tautomer is a low-energy form. Applications: registration (exact search), substructure searching, property prediction, similarity/clustering, protein-ligand analysis. Failure to perceive tautomerism leads to different results for different valence models which really represent the same chemical entity. Fig 4: example: tautomers listed separately in ACD98. The latter is the OEcanonical form. Results: The Maybridge 2003 database was analyzed by the OE program tautomers13. Of 71367 molecules, 97 have tautomers (47 pairs and one triplet). Note that additionally, 2381 were found to be non-unique molecules. Conclusion Rigorous and effective chemoinformatics systems require concepts and methods for canonicalization at multiple levels of chemical abstraction and organization. The current state of the art presents many theoretical and practical challenges. OpenEye tools can help. References 1.  Morgan, H. L., "Generation of a unique machine description for chemical structures - A technique developed at Chemical Abstracts Services", J. Chem. Doc. 1965, 5, 107. 2.  Stereochemically unique naming algorithm, W. Todd Wipke, Thomas M. Dyott; J. Am. Chem. Soc.; 1974; 96(15); 4834-4842. 3.  Canonical Numbering and Constitutional Symmetry, Clemens Jochum and Johann Gasteiger, J. Chem. Inf. Comput. Sci.; 1977; 17(2); 113-117. 4.  Computer Perception of Topological Symmetry, Craig A. Shelley, Morton E. Munk; J. Chem. Inf. Comput. Sci.; 1977; 17(2); 110-113. 5.  An Approach to the Assignment of Canonical Connection Tables and Topological Symmetry Perception, Craig A. Shelley, Morton E. Munk, J. Chem. Inf. Comput. Sci.; 1979; 19(4); 247-250. 6.  David Weininger, Arthur Weininger and Joseph L. Weininger, "SMILES 2: Algorithm for Generation of Unique SMILES Notation", Journal of Chemical Information and Computer Science (JCICS), Vol. 29, No. 2, pp. 97-101, 1989. 7.  A beginner's guide to responsible parenting or knowing your roots,, EuroMUG '98, Cambridge, UK, Oct 1998. 8.  Canonicalization and Enumeration of Tautomers, Jack Delany and Roger Sayle, EuroMUG '99, Cambridge, UK, Oct 1999. 9.  Hooked on Protonics, Roger Sayle and Geoff Skillman,, 224th ACS National Meeting, Boston, Aug 2002. 10.  Introduction to Chemical Info Systems, John Bradshaw,, Euromug02 24th-26th September 2002, Cambridge UK 11.  That INChIFeeling,, Reactive Reports, Sep 2004 (issue 40) 12.  OEChem, OpenEye Scientific Software, 2002. 13.  QuacPac, OpenEye Scientific Software, 2004. Fig 5: tautomer triplet from Maybridge 2003 New: canonical pKa states But The canonicalization of alternative pKa states is accomplished for many classes of molecules by the OpenEye program pkatyper13. This problem resembles tautomer canonicalization in many respects, and is an area of active research at OpenEye. 3600 Cerrillos Road Suite 1107 Santa Fe, New Mexico 87507 505.473.7385