1
OPSIN
Taming the jungle of IUPAC
chemical nomenclature
Daniel Lowe, Peter Murray-Rust, Robert C Glen
8th September 2013
...
2
ol
What is chemical name to structure?
(2S)- but2-Amino 1--
Stereochemistry locant substituent locant alk unsaturation s...
3
• Identify documents by their chemical
structures
• Assist with structure viewing
• Identify incorrect chemical names
• ...
4
5
Parsing
• Over 4000 discrete morphemes form the
program’s vocabulary
(a morpheme is the smallest section of a word with ...
6
Word Rule Example
acetal Propanal dimethyl acetal
additionCompound Carbon tetrachloride
acidHalideOrPseudoHalide Cyanic ...
7
Supported chain nomenclature
Alkanes Heteroatom hydrides Heterogeneous heteroatom hydrides
dodectetractkiliane pentaphos...
8
Supported ring nomenclature
Monocyclic spiro
dispiro[4.2.4.2]tetradecane
Hantzsch-Widman
1,3,5-triazine
furo[3,2-b]thien...
9
Structural assembly nomenclature
Conjunctive nomenclature
benzeneethanol
Substitutive nomenclature
2,4,6-trinitrotoluene...
10
Structural modifications
Heteroatom replacement
1-thia-4-aza-2,6-disilacyclohexane
Unsaturation
hexa-1,3-dien-5-yne
Hyd...
11
Bridges and stereochemistry
Bridges
4a,8a-propanoquinoline
E/Z stereochemistry
(Z)-2-chloro-but-2-ene
Relative cis/tran...
12
Miscellaneous nomenclature
1,3-xylene
Groups with indeterminately
positioned structural features
Charge and oxidation
n...
13
Polymer nomenclature
poly[(benzo[1,2-d:4,5-d']bis[1,3]thiazole-2,6-diyl)-1,4-phenyleneoxy-1,3-phenylene(1,3,5,7-tetraox...
14
Domain specific nomenclature
Steroid nomenclature
17β-Hydroxy-8α,9β,10α-androst-4-en-3-one
ʟ-leucinamide
Amino acid
cyc...
15
Carbohydrate nomenclature (acyclic)
ᴅ-gluco-hexose or
ᴅ-glucose (preferred)
ʟ-ribo-ᴅ-manno-nonose
• Carbohydrates are d...
16
Carbohydrate derivatives
• These carbohydrate chains can then be algorithmically
modified by suffixes
ᴅ-glucose
ᴅ-gluci...
17
Carbohydrate nomenclature (cyclic)
α-ᴅ-glucopyranose
2,7-anhydro-D-glycero-β-D-galacto-oct-2-
ulopyranosonic acid
ᴅ-glu...
18
Carbohydrate nomenclature
(oligosaccharides)
β-ᴅ-Fructofuranosyl α-ᴅ-glucopyranoside β-ᴅ-glucopyranosyl-(1→3)-β-ᴅ-gluco...
19
Fused ring nomenclature
• All fused ring nomenclature is processed algorithmically e.g.
even benzofuran is constructed ...
20
Fused ring nomenclature
(numbering)
• Transform to an idealised grid aligned along
the longest row of rings
• Apply qua...
21
Fused ring nomenclature
(numbering)
• Atoms numbered in ascending order from
upper rightmost ring
6
6 6 5 8
Peripheral ...
22
Beyond IUPAC:
CAS index name un-inversion
CAS Index Name IUPAC name
benzene, ethyl- ethylbenzene
Disulfide, bis(2-chlor...
23
Beyond IUPAC:
Correcting missing spaces
tert-butylacetate tert-butyl acetate
tert-butyl-4-vinylperbenzoate
No locant an...
24
Performance on machine-
generated names
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
ACD/Name 12.02
Names
ChemBioDraw13
...
25
Performance on unique names
from US patent headings
26
What’s not supported
• Parsing of generic chemical names
• E.g. 2- or 3- alkylsubstitutedbenzofurans
• Advanced inorgan...
27
Usage
Batch conversion on the
command line
RESTful web service
(opsin.ch.cam.ac.uk)
NameToStructure nts = NameToStructu...
28
Who is using OPSIN?
Commercial software
Cinfony
(interface to
Python)
Many text mining efforts
Workflows
Web services
29
Conclusions
• OPSIN combines high recall, precision and speed of
execution
• Recent improvements have significantly imp...
30
OPSIN: Taming the jungle of IUPAC chemical nomenclature
daniel@nextmovesoftware.com
For more information see:
Chemical ...
Upcoming SlideShare
Loading in …5
×

OPSIN: Taming the jungle of IUPAC chemical nomenclature

1,186 views

Published on

OPSIN (Open Parser for Systematic IUPAC Nomenclature) is an open source freely available program for converting chemical names, especially those that are systematic in nature, to chemical structures. The software is available as a Java library, command-line interface and as a web service (opsin.ch.cam.ac.uk). OPSIN accepts names that conform to either IUPAC or CAS nomenclature and can convert them to SMILES, InChI and CML (Chemical Markup Language).
OPSIN has grown from covering only simple general organic chemical nomenclature to the point of having competent coverage of all areas of organic chemical nomenclature. One of the most recent additions is comprehensive support for the nomenclature of carbohydrates. This brings support for dialdoses, diketoses, ketoaldoses, alditols, aldonic acids, uronic acids, aldaric acids, glycosides and oligosacchardides, in both the open chain and cyclic forms, named systematically or from trivial sugar stems with support for modification terms such as anhydro or deoxy.
OPSIN’s support for specialised and general organic nomenclature will be demonstrated through illustrative examples and accompanying performance metrics. We focus in particular on areas of nomenclature for which support was recently added and those that are complex to implement such as fused ring nomenclature.

Published in: Spiritual, Business, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,186
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

OPSIN: Taming the jungle of IUPAC chemical nomenclature

  1. 1. 1 OPSIN Taming the jungle of IUPAC chemical nomenclature Daniel Lowe, Peter Murray-Rust, Robert C Glen 8th September 2013 Indianapolis, ACS 4-[(19S,21R,26R,27S)-19,21-dihydroxy-27-methoxy-26- methylnonacosyl]phenyl 3,6-di-O-methyl-α-D-glucopyranosyl-(1→4)-2,3-di-O- methyl-α-L-rhamnopyranosyl-(1→2)-3-O-methyl-α-L-rhamnopyranoside
  2. 2. 2 ol What is chemical name to structure? (2S)- but2-Amino 1-- Stereochemistry locant substituent locant alk unsaturation suffix an NH2• 1 2 3 4
  3. 3. 3 • Identify documents by their chemical structures • Assist with structure viewing • Identify incorrect chemical names • Extract reagent structures hence allowing reactions to be reconstructed from text Uses of chemical name to structure
  4. 4. 4
  5. 5. 5 Parsing • Over 4000 discrete morphemes form the program’s vocabulary (a morpheme is the smallest section of a word with meaning) • These are grouped into 140 classes e.g. • unsaturator (‘ene’) • aminoAcidEndsInIne (‘tyros’) • simpleSubstituent (‘amino’)
  6. 6. 6 Word Rule Example acetal Propanal dimethyl acetal additionCompound Carbon tetrachloride acidHalideOrPseudoHalide Cyanic chloride amide Nitrous amide anhydride Acetic anhydride biochemicalEster Adenosine 5'-triphosphate carbonylDerivative Propanone oxime divalentFunctionalGroup Diethyl ether ester Ethyl ethanoate functionalClassEster Acetic acid ethyl ester functionGroupAsGroup Cyanide glycol Ethylene glycol glycolEther Ethylene glycol monomethyl ether hydrazide Phosphoric hydrazide monovalentFunctionalGroup Ethyl alcohol multiEster Ethyl propyl methylphosphonate oxide Thiophene 1,1-dioxide polymer Poly(ethylene) simple Ethylbenzene substituent Chloro
  7. 7. 7 Supported chain nomenclature Alkanes Heteroatom hydrides Heterogeneous heteroatom hydrides dodectetractkiliane pentaphosphane disilazane Trivial acids butyric acid
  8. 8. 8 Supported ring nomenclature Monocyclic spiro dispiro[4.2.4.2]tetradecane Hantzsch-Widman 1,3,5-triazine furo[3,2-b]thieno[2,3-e]pyridine 2,2':6',2''-terpyridyl Fused ring Ring assembly Von Baeyer tricyclo[2.2.1.12,5]octane Polycyclic spiro spiro[piperidine-4,9'-xanthene]
  9. 9. 9 Structural assembly nomenclature Conjunctive nomenclature benzeneethanol Substitutive nomenclature 2,4,6-trinitrotoluene Additive nomenclature methylsulfonyl Multiplicative nomenclature 4,4'-methylenedioxydibenzoic acid Functional class nomenclature ethyl alcohol
  10. 10. 10 Structural modifications Heteroatom replacement 1-thia-4-aza-2,6-disilacyclohexane Unsaturation hexa-1,3-dien-5-yne Hydro, dehydro, indicated hydrogen and added hydrogen 2,7-dihydro-1H-azepine Functional replacement Suffixes including infixed suffixes methanedithioic acid 1-chloro-2,4- diimidotricarbonic acid Lambda convention 2λ6-trisulfane
  11. 11. 11 Bridges and stereochemistry Bridges 4a,8a-propanoquinoline E/Z stereochemistry (Z)-2-chloro-but-2-ene Relative cis/trans stereochemistry trans-2,6-dimethyl-2,6-dihydronaphthalene R/S stereochemistry (1R,3S)-3-amino-3-methylcyclohexanol
  12. 12. 12 Miscellaneous nomenclature 1,3-xylene Groups with indeterminately positioned structural features Charge and oxidation numbers methylmercury(1+) or methylmercury(II) “per-nomenclature” 2-deoxy-ᴅ-ribose Subtractive nomenclature perhydroanthracene perchlorobenzene
  13. 13. 13 Polymer nomenclature poly[(benzo[1,2-d:4,5-d']bis[1,3]thiazole-2,6-diyl)-1,4-phenyleneoxy-1,3-phenylene(1,3,5,7-tetraoxo- 1,2,3,5,6,7-hexahydrobenzo[1,2-c:4,5-c']dipyrrole-2,6-diyl)-1,3-phenyleneoxy-1,4-phenylene] Structure-based polymer nomenclature
  14. 14. 14 Domain specific nomenclature Steroid nomenclature 17β-Hydroxy-8α,9β,10α-androst-4-en-3-one ʟ-leucinamide Amino acid cyclo(ᴅ-alanyl-ʟ-phenylalanyl)ʟ-arginyl-O-phosphono-ʟ-seryl-ʟ-alanyl-ʟ-proline Oligopeptide Cyclic peptide guanylyl(3'-5')uridine 3'-monophosphate Nucleotide nomenclature
  15. 15. 15 Carbohydrate nomenclature (acyclic) ᴅ-gluco-hexose or ᴅ-glucose (preferred) ʟ-ribo-ᴅ-manno-nonose • Carbohydrates are defined using configurational prefixes that each specify the stereochemistry for between 1 and 4 stereocentres
  16. 16. 16 Carbohydrate derivatives • These carbohydrate chains can then be algorithmically modified by suffixes ᴅ-glucose ᴅ-glucitol ᴅ-glucaric acid ᴅ-gluconic acid
  17. 17. 17 Carbohydrate nomenclature (cyclic) α-ᴅ-glucopyranose 2,7-anhydro-D-glycero-β-D-galacto-oct-2- ulopyranosonic acid ᴅ-glucose
  18. 18. 18 Carbohydrate nomenclature (oligosaccharides) β-ᴅ-Fructofuranosyl α-ᴅ-glucopyranoside β-ᴅ-glucopyranosyl-(1→3)-β-ᴅ-glucopyranosyl- (1→3)-ᴅ-glucopyranose
  19. 19. 19 Fused ring nomenclature • All fused ring nomenclature is processed algorithmically e.g. even benzofuran is constructed from benzene and furan rather than being a trivial name • For example: benzo[b]cycloocta[jk]fluorene 8 6 6 6 5
  20. 20. 20 Fused ring nomenclature (numbering) • Transform to an idealised grid aligned along the longest row of rings • Apply quadrant rules e.g. favour most rings in upper right quadrant 8 6 6 6 5 6 6 6 5 8 8 6 6 5 6 6 6 6 5 8 6 6 5 6 8 6 6 5 8 6
  21. 21. 21 Fused ring nomenclature (numbering) • Atoms numbered in ascending order from upper rightmost ring 6 6 6 5 8 Peripheral numbering rules used to choose grid layout that gives the best numbering
  22. 22. 22 Beyond IUPAC: CAS index name un-inversion CAS Index Name IUPAC name benzene, ethyl- ethylbenzene Disulfide, bis(2-chloroethyl) Bis(2-chloroethyl) disulfide Benzoic acid, 4,4’-methylenebis[2-chloro- 4,4'-Methylenebis[2-chlorobenzoic acid] Phosphoric acid, ethyl dimethyl ester ethyl dimethyl phosphate
  23. 23. 23 Beyond IUPAC: Correcting missing spaces tert-butylacetate tert-butyl acetate tert-butyl-4-vinylperbenzoate No locant and perbenzoate has more than one non-degenerate hydrogen diethylcarbonate Has no substitutable hydrogen Ethylacetate non-ester would be butanoate or butyrate!
  24. 24. 24 Performance on machine- generated names 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ACD/Name 12.02 Names ChemBioDraw13 Names Lexichem 2.1 Names Marvin 6.0.2 Names No Result Constitutional Discrepancy Stereochemical Discrepancy Correctly Interpreted 30,000 structures randomly selected from PubChem used as input to machine-generate names
  25. 25. 25 Performance on unique names from US patent headings
  26. 26. 26 What’s not supported • Parsing of generic chemical names • E.g. 2- or 3- alkylsubstitutedbenzofurans • Advanced inorganic nomenclature e.g. coordinate bonding • Some natural product nomenclature • Advanced stereochemistry e.g. pseudo asymmetric stereo centers, axial stereochemistry etc.
  27. 27. 27 Usage Batch conversion on the command line RESTful web service (opsin.ch.cam.ac.uk) NameToStructure nts = NameToStructure.getInstance(); String chemicalName = "acetonitrile"; String smiles = nts.parseToSmiles(chemicalName); Java API java -jar opsin-1.5.0-jar-with-dependencies.jar -osmi input.txt output.smi
  28. 28. 28 Who is using OPSIN? Commercial software Cinfony (interface to Python) Many text mining efforts Workflows Web services
  29. 29. 29 Conclusions • OPSIN combines high recall, precision and speed of execution • Recent improvements have significantly improved coverage of biochemical nomenclature Visit opsin.ch.cam.ac.uk to try it out and download!
  30. 30. 30 OPSIN: Taming the jungle of IUPAC chemical nomenclature daniel@nextmovesoftware.com For more information see: Chemical Name to Structure: OPSIN, an Open Source Solution J. Chem. Inf. Model., 2011, 51 (3), pp 739–753 Extraction of chemical structures and reactions from the literature (https://www.repository.cam.ac.uk/handle/1810/244727) Acknowledgements Albina Asadulina Rich Apodaca Peter Corbett Roger Sayle Funding

×