Your SlideShare is downloading. ×
0
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

OPSIN: Taming the jungle of IUPAC chemical nomenclature

472

Published on

OPSIN (Open Parser for Systematic IUPAC Nomenclature) is an open source freely available program for converting chemical names, especially those that are systematic in nature, to chemical structures. …

OPSIN (Open Parser for Systematic IUPAC Nomenclature) is an open source freely available program for converting chemical names, especially those that are systematic in nature, to chemical structures. The software is available as a Java library, command-line interface and as a web service (opsin.ch.cam.ac.uk). OPSIN accepts names that conform to either IUPAC or CAS nomenclature and can convert them to SMILES, InChI and CML (Chemical Markup Language).
OPSIN has grown from covering only simple general organic chemical nomenclature to the point of having competent coverage of all areas of organic chemical nomenclature. One of the most recent additions is comprehensive support for the nomenclature of carbohydrates. This brings support for dialdoses, diketoses, ketoaldoses, alditols, aldonic acids, uronic acids, aldaric acids, glycosides and oligosacchardides, in both the open chain and cyclic forms, named systematically or from trivial sugar stems with support for modification terms such as anhydro or deoxy.
OPSIN’s support for specialised and general organic nomenclature will be demonstrated through illustrative examples and accompanying performance metrics. We focus in particular on areas of nomenclature for which support was recently added and those that are complex to implement such as fused ring nomenclature.

Published in: Spiritual, Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
472
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1 OPSIN Taming the jungle of IUPAC chemical nomenclature Daniel Lowe, Peter Murray-Rust, Robert C Glen 8th September 2013 Indianapolis, ACS 4-[(19S,21R,26R,27S)-19,21-dihydroxy-27-methoxy-26- methylnonacosyl]phenyl 3,6-di-O-methyl-α-D-glucopyranosyl-(1→4)-2,3-di-O- methyl-α-L-rhamnopyranosyl-(1→2)-3-O-methyl-α-L-rhamnopyranoside
  • 2. 2 ol What is chemical name to structure? (2S)- but2-Amino 1-- Stereochemistry locant substituent locant alk unsaturation suffix an NH2• 1 2 3 4
  • 3. 3 • Identify documents by their chemical structures • Assist with structure viewing • Identify incorrect chemical names • Extract reagent structures hence allowing reactions to be reconstructed from text Uses of chemical name to structure
  • 4. 4
  • 5. 5 Parsing • Over 4000 discrete morphemes form the program’s vocabulary (a morpheme is the smallest section of a word with meaning) • These are grouped into 140 classes e.g. • unsaturator (‘ene’) • aminoAcidEndsInIne (‘tyros’) • simpleSubstituent (‘amino’)
  • 6. 6 Word Rule Example acetal Propanal dimethyl acetal additionCompound Carbon tetrachloride acidHalideOrPseudoHalide Cyanic chloride amide Nitrous amide anhydride Acetic anhydride biochemicalEster Adenosine 5'-triphosphate carbonylDerivative Propanone oxime divalentFunctionalGroup Diethyl ether ester Ethyl ethanoate functionalClassEster Acetic acid ethyl ester functionGroupAsGroup Cyanide glycol Ethylene glycol glycolEther Ethylene glycol monomethyl ether hydrazide Phosphoric hydrazide monovalentFunctionalGroup Ethyl alcohol multiEster Ethyl propyl methylphosphonate oxide Thiophene 1,1-dioxide polymer Poly(ethylene) simple Ethylbenzene substituent Chloro
  • 7. 7 Supported chain nomenclature Alkanes Heteroatom hydrides Heterogeneous heteroatom hydrides dodectetractkiliane pentaphosphane disilazane Trivial acids butyric acid
  • 8. 8 Supported ring nomenclature Monocyclic spiro dispiro[4.2.4.2]tetradecane Hantzsch-Widman 1,3,5-triazine furo[3,2-b]thieno[2,3-e]pyridine 2,2':6',2''-terpyridyl Fused ring Ring assembly Von Baeyer tricyclo[2.2.1.12,5]octane Polycyclic spiro spiro[piperidine-4,9'-xanthene]
  • 9. 9 Structural assembly nomenclature Conjunctive nomenclature benzeneethanol Substitutive nomenclature 2,4,6-trinitrotoluene Additive nomenclature methylsulfonyl Multiplicative nomenclature 4,4'-methylenedioxydibenzoic acid Functional class nomenclature ethyl alcohol
  • 10. 10 Structural modifications Heteroatom replacement 1-thia-4-aza-2,6-disilacyclohexane Unsaturation hexa-1,3-dien-5-yne Hydro, dehydro, indicated hydrogen and added hydrogen 2,7-dihydro-1H-azepine Functional replacement Suffixes including infixed suffixes methanedithioic acid 1-chloro-2,4- diimidotricarbonic acid Lambda convention 2λ6-trisulfane
  • 11. 11 Bridges and stereochemistry Bridges 4a,8a-propanoquinoline E/Z stereochemistry (Z)-2-chloro-but-2-ene Relative cis/trans stereochemistry trans-2,6-dimethyl-2,6-dihydronaphthalene R/S stereochemistry (1R,3S)-3-amino-3-methylcyclohexanol
  • 12. 12 Miscellaneous nomenclature 1,3-xylene Groups with indeterminately positioned structural features Charge and oxidation numbers methylmercury(1+) or methylmercury(II) “per-nomenclature” 2-deoxy-ᴅ-ribose Subtractive nomenclature perhydroanthracene perchlorobenzene
  • 13. 13 Polymer nomenclature poly[(benzo[1,2-d:4,5-d']bis[1,3]thiazole-2,6-diyl)-1,4-phenyleneoxy-1,3-phenylene(1,3,5,7-tetraoxo- 1,2,3,5,6,7-hexahydrobenzo[1,2-c:4,5-c']dipyrrole-2,6-diyl)-1,3-phenyleneoxy-1,4-phenylene] Structure-based polymer nomenclature
  • 14. 14 Domain specific nomenclature Steroid nomenclature 17β-Hydroxy-8α,9β,10α-androst-4-en-3-one ʟ-leucinamide Amino acid cyclo(ᴅ-alanyl-ʟ-phenylalanyl)ʟ-arginyl-O-phosphono-ʟ-seryl-ʟ-alanyl-ʟ-proline Oligopeptide Cyclic peptide guanylyl(3'-5')uridine 3'-monophosphate Nucleotide nomenclature
  • 15. 15 Carbohydrate nomenclature (acyclic) ᴅ-gluco-hexose or ᴅ-glucose (preferred) ʟ-ribo-ᴅ-manno-nonose • Carbohydrates are defined using configurational prefixes that each specify the stereochemistry for between 1 and 4 stereocentres
  • 16. 16 Carbohydrate derivatives • These carbohydrate chains can then be algorithmically modified by suffixes ᴅ-glucose ᴅ-glucitol ᴅ-glucaric acid ᴅ-gluconic acid
  • 17. 17 Carbohydrate nomenclature (cyclic) α-ᴅ-glucopyranose 2,7-anhydro-D-glycero-β-D-galacto-oct-2- ulopyranosonic acid ᴅ-glucose
  • 18. 18 Carbohydrate nomenclature (oligosaccharides) β-ᴅ-Fructofuranosyl α-ᴅ-glucopyranoside β-ᴅ-glucopyranosyl-(1→3)-β-ᴅ-glucopyranosyl- (1→3)-ᴅ-glucopyranose
  • 19. 19 Fused ring nomenclature • All fused ring nomenclature is processed algorithmically e.g. even benzofuran is constructed from benzene and furan rather than being a trivial name • For example: benzo[b]cycloocta[jk]fluorene 8 6 6 6 5
  • 20. 20 Fused ring nomenclature (numbering) • Transform to an idealised grid aligned along the longest row of rings • Apply quadrant rules e.g. favour most rings in upper right quadrant 8 6 6 6 5 6 6 6 5 8 8 6 6 5 6 6 6 6 5 8 6 6 5 6 8 6 6 5 8 6
  • 21. 21 Fused ring nomenclature (numbering) • Atoms numbered in ascending order from upper rightmost ring 6 6 6 5 8 Peripheral numbering rules used to choose grid layout that gives the best numbering
  • 22. 22 Beyond IUPAC: CAS index name un-inversion CAS Index Name IUPAC name benzene, ethyl- ethylbenzene Disulfide, bis(2-chloroethyl) Bis(2-chloroethyl) disulfide Benzoic acid, 4,4’-methylenebis[2-chloro- 4,4'-Methylenebis[2-chlorobenzoic acid] Phosphoric acid, ethyl dimethyl ester ethyl dimethyl phosphate
  • 23. 23 Beyond IUPAC: Correcting missing spaces tert-butylacetate tert-butyl acetate tert-butyl-4-vinylperbenzoate No locant and perbenzoate has more than one non-degenerate hydrogen diethylcarbonate Has no substitutable hydrogen Ethylacetate non-ester would be butanoate or butyrate!
  • 24. 24 Performance on machine- generated names 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ACD/Name 12.02 Names ChemBioDraw13 Names Lexichem 2.1 Names Marvin 6.0.2 Names No Result Constitutional Discrepancy Stereochemical Discrepancy Correctly Interpreted 30,000 structures randomly selected from PubChem used as input to machine-generate names
  • 25. 25 Performance on unique names from US patent headings
  • 26. 26 What’s not supported • Parsing of generic chemical names • E.g. 2- or 3- alkylsubstitutedbenzofurans • Advanced inorganic nomenclature e.g. coordinate bonding • Some natural product nomenclature • Advanced stereochemistry e.g. pseudo asymmetric stereo centers, axial stereochemistry etc.
  • 27. 27 Usage Batch conversion on the command line RESTful web service (opsin.ch.cam.ac.uk) NameToStructure nts = NameToStructure.getInstance(); String chemicalName = "acetonitrile"; String smiles = nts.parseToSmiles(chemicalName); Java API java -jar opsin-1.5.0-jar-with-dependencies.jar -osmi input.txt output.smi
  • 28. 28 Who is using OPSIN? Commercial software Cinfony (interface to Python) Many text mining efforts Workflows Web services
  • 29. 29 Conclusions • OPSIN combines high recall, precision and speed of execution • Recent improvements have significantly improved coverage of biochemical nomenclature Visit opsin.ch.cam.ac.uk to try it out and download!
  • 30. 30 OPSIN: Taming the jungle of IUPAC chemical nomenclature daniel@nextmovesoftware.com For more information see: Chemical Name to Structure: OPSIN, an Open Source Solution J. Chem. Inf. Model., 2011, 51 (3), pp 739–753 Extraction of chemical structures and reactions from the literature (https://www.repository.cam.ac.uk/handle/1810/244727) Acknowledgements Albina Asadulina Rich Apodaca Peter Corbett Roger Sayle Funding

×