Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

InChI for Large Molecules

2,633 views

Published on

Presented by Roger Sayle at InChI for Large Molecules meeting, NCBI, Bethesda, MD, Monday 27th October 2014

Published in: Science
  • Be the first to comment

InChI for Large Molecules

  1. 1. Inchi for large molecules: The nextmove software perspective Roger Sayle & Noel O’Boyle Nextmove software, cambridge, uk InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  2. 2. “this house believes…” • The most important distinction in life science informatics is between molecular and non-molecular (bio)chemistry, not between chemistry and biology. • Fuzzy distinctions such as “small molecules”, lipids, proteins, nucleic acids, peptides, oligosaccharides, or terpenes are like asking how many colors are there in a rainbow? (c.f. The Sapir-Whorf hypothesis). • Schemes that encode these distinctions (such as HELM and ISO 11238 even RasMol) break down when (poorly defined) categories overlap. InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  3. 3. Peptide or not? cyclo[OAla-Val-D-OVal-D-Val-OAla-Val-D-OVal-D-Val-OAla-Val-D-OVal-D-Val] valinomycin InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  4. 4. Saccharide or not? D-Glucopyranose D-gluco-hexopyranose (2S)-2-methyloxane (2S)-2-methyl-tetrahydropyran InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  5. 5. Saccharide or not? D-Glucopyranose D-gluco-hexopyranose D-Quinovopyranose 6-deoxy-Glucopyranose 6-deoxy-D-gluco-hexopyranose D-Paratopyranose 3,6-dideoxy-Glucopyranose 3,6-dideoxy-D-ribo-hexopyranose D-Amicetopyranose 2,3,6-trideoxy-Glucopyranose 2,3,6-trideoxy-D-erythro-hexopyranose (2S)-2-methyloxane (2S)-2-methyl-tetrahydropyran
  6. 6. The cutting edge of biosimilarity • The high prevalence of potentially life-threatening hypersensitivity reactions to the antibody cetuximab (Erbitux) in some US states has been traced to its glycosylation [containing a Gal(a1-3)Gal epitope]. Chung et al., “Cetuximab-induced anaphylaxis and IgE specific for galactose-alpha-1,3-galactose”, New England Journal of Medicine, Vol. 358, No. 11, pp. 1109-1117, 13th March 2008. • Similarly, Human Erythropoietin (EPO) alpha, beta, delta and omega share the same primary sequence, but differ in their glycosylation patterns. InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  7. 7. Destructive suggestion… • Systems based upon monomer dictionaries (such as HELM and PDB) are notoriously difficult to maintain. • The limited number of monomers in proteinogenic peptides and natural nucleic acid sequences leads to a false sense of security; that monomers are finite. • In practice, the number of monomers, post-translational and chemical modifications is infinite. • Even more difficult than standardizing monomer definitions via a central repository, like PDB, is allowing local custom definitions. InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  8. 8. 48 hexopyranoses InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  9. 9. 264 deoxy-hexopyranoses
  10. 10. 9540 substituted hexopyranoses (4 most common substituents)
  11. 11. Constructive suggestion… • Ideally, a chemical identifier should be independent of the input representation or file format. • Duplicates between small molecules, peptide and proteins are best determined by a single identifier, preferably the existing InChI. • This is possible as increases in computer power and storage mean that cheminformatics toolkits can handle huge biopolymers on modern hardware. InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  12. 12. Proof-of-concept • I’ve previously reported on Tanimoto chemical search of PDB (80K) represented as canonical SMILES (1Gb). • To test for duplicates and InChI key hash collisions, we attempted to generate InChI keys for uniprot. • OpenBabel source tree already contains patches to InChI library to increase the official 1024 atom limit. • A few additional source changes also helped. • Ultimately, InChI keys could be generated for ~99.4% of the ~450K unique sequences in swissprot division. InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  13. 13. Record breaking inchi-key • Sequence Identifier: UTP10_KLULA • Sequence Length: 1774 amino acids • Molecule size: 28509 atoms • InChI Length: 119699 characters • InChI key: PHBRSEQMAKHFGD-ZBXWIJJNSA-N • InChI Canonicalization Time: 73.2s • Canonical SMILES Length: 35408 chars • SMILES Canonicalization Time: 0.4s InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  14. 14. protein Canonicalization time InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  15. 15. protein Canonicalization time InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  16. 16. protein Canonicalization time InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  17. 17. conclusions • “InChI for large molecules” simply requires fixing the bugs in standard InChI. InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  18. 18. acknowledgements • Lisa Sach-Peltason, Hoffmann-La Roche, Basel. • Joann Prescott-Roy, Novartis, Boston, MA. • Greg Landrum, Novatis, Basel, Switzerland. • Evan Bolton, NCBI PubChem project, Bethesda, MD. InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  19. 19. PDB IUPAC NAME L-Cys(1)-L-Tyr-L-Ile-L-Gln-L-Asp-L-Cys(1)-L-Pro-L-Leu-Gly-NH2 IUPAC Condensed [C@H]1(CCCN1C(=O)[C@@H]1CSSC[C@@H](C(=O)N[C@ @H](Cc2ccc(cc2)O)C(=O)N[C@@H]([C@H](CC)C)C(=O)N[ C@@H](CCC(=O)N)C(=O)N[C@@H](CC(=O)O)C(=O)N1)N) C(=O)N[C@@H](CC(C)C)C(=O)NCC(=O)N SMILES DEPICTIONS Sugar & SPLICE L-cysteinyl-L-tyrosyl-L-isoleucyl-L-glutaminyl-L-alpha-aspartyl-L-cysteinyl- L-prolyl-L-leucyl-glycinamide (1->6)-disulfide common NAME [5-L-aspartic acid]oxytocin OH PLN H-C(1)YIQDC(1)PLG-[NH2] PEPTIDE1{C.Y.I.Q.N.C.P.L.G.[am]}$PEPTIDE1,PEPTIDE1,1:R3-6:R3$$$ helm Competing interests statement
  20. 20. Peptide names imply architecture • Named peptides imply not only sequence but also N-terminal acetylation, C-terminal amidation and disulfide bridge topology. • Example named derivatives: – gastrin (14-17) – motilin amide – oxytocin free-acid – acetyl-oxytocin – deacetyl-abarelix – oxytocin reduced – endothelin-1 (1→3),(11 → 15)-bis(disulfide) InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

×