OPSIN (Open Parser for Systematic IUPAC Nomenclature) is an open source freely available program for converting chemical names, especially those that are systematic in nature, to chemical structures. The software is available as a Java library, command-line interface and as a web service (opsin.ch.cam.ac.uk). OPSIN accepts names that conform to either IUPAC or CAS nomenclature and can convert them to SMILES, InChI and CML (Chemical Markup Language).
OPSIN has grown from covering only simple general organic chemical nomenclature to the point of having competent coverage of all areas of organic chemical nomenclature. One of the most recent additions is comprehensive support for the nomenclature of carbohydrates. This brings support for dialdoses, diketoses, ketoaldoses, alditols, aldonic acids, uronic acids, aldaric acids, glycosides and oligosacchardides, in both the open chain and cyclic forms, named systematically or from trivial sugar stems with support for modification terms such as anhydro or deoxy.
OPSIN’s support for specialised and general organic nomenclature will be demonstrated through illustrative examples and accompanying performance metrics. We focus in particular on areas of nomenclature for which support was recently added and those that are complex to implement such as fused ring nomenclature.
The King 'Great Goodness' Part 1 Mahasilava Jataka (Eng. & Chi.).pptx
OPSIN: Taming the jungle of IUPAC chemical nomenclature
1. 1
OPSIN
Taming the jungle of IUPAC
chemical nomenclature
Daniel Lowe, Peter Murray-Rust, Robert C Glen
8th September 2013
Indianapolis, ACS
4-[(19S,21R,26R,27S)-19,21-dihydroxy-27-methoxy-26-
methylnonacosyl]phenyl 3,6-di-O-methyl-α-D-glucopyranosyl-(1→4)-2,3-di-O-
methyl-α-L-rhamnopyranosyl-(1→2)-3-O-methyl-α-L-rhamnopyranoside
2. 2
ol
What is chemical name to structure?
(2S)- but2-Amino 1--
Stereochemistry locant substituent locant alk unsaturation suffix
an
NH2•
1
2
3
4
3. 3
• Identify documents by their chemical
structures
• Assist with structure viewing
• Identify incorrect chemical names
• Extract reagent structures hence allowing
reactions to be reconstructed from text
Uses of chemical name to structure
5. 5
Parsing
• Over 4000 discrete morphemes form the
program’s vocabulary
(a morpheme is the smallest section of a word with meaning)
• These are grouped into 140 classes e.g.
• unsaturator (‘ene’)
• aminoAcidEndsInIne (‘tyros’)
• simpleSubstituent (‘amino’)
12. 12
Miscellaneous nomenclature
1,3-xylene
Groups with indeterminately
positioned structural features
Charge and oxidation
numbers
methylmercury(1+) or
methylmercury(II)
“per-nomenclature”
2-deoxy-ᴅ-ribose
Subtractive nomenclature
perhydroanthracene
perchlorobenzene
15. 15
Carbohydrate nomenclature (acyclic)
ᴅ-gluco-hexose or
ᴅ-glucose (preferred)
ʟ-ribo-ᴅ-manno-nonose
• Carbohydrates are defined using configurational prefixes
that each specify the stereochemistry for between 1 and 4
stereocentres
16. 16
Carbohydrate derivatives
• These carbohydrate chains can then be algorithmically
modified by suffixes
ᴅ-glucose
ᴅ-glucitol
ᴅ-glucaric acid
ᴅ-gluconic acid
19. 19
Fused ring nomenclature
• All fused ring nomenclature is processed algorithmically e.g.
even benzofuran is constructed from benzene and furan
rather than being a trivial name
• For example:
benzo[b]cycloocta[jk]fluorene
8
6 6
6
5
20. 20
Fused ring nomenclature
(numbering)
• Transform to an idealised grid aligned along
the longest row of rings
• Apply quadrant rules e.g. favour most rings in
upper right quadrant
8 6
6 6 5 6 6 6 5 8
8 6 6 5 6 6 6 6 5 8
6 6 5 6 8 6 6 5 8 6
21. 21
Fused ring nomenclature
(numbering)
• Atoms numbered in ascending order from
upper rightmost ring
6
6 6 5 8
Peripheral numbering rules used to
choose grid layout that gives the
best numbering
22. 22
Beyond IUPAC:
CAS index name un-inversion
CAS Index Name IUPAC name
benzene, ethyl- ethylbenzene
Disulfide, bis(2-chloroethyl) Bis(2-chloroethyl) disulfide
Benzoic acid, 4,4’-methylenebis[2-chloro- 4,4'-Methylenebis[2-chlorobenzoic acid]
Phosphoric acid, ethyl dimethyl ester ethyl dimethyl phosphate
23. 23
Beyond IUPAC:
Correcting missing spaces
tert-butylacetate tert-butyl acetate
tert-butyl-4-vinylperbenzoate
No locant and perbenzoate has more
than one non-degenerate hydrogen
diethylcarbonate
Has no substitutable hydrogen
Ethylacetate
non-ester would be
butanoate or butyrate!
24. 24
Performance on machine-
generated names
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
ACD/Name 12.02
Names
ChemBioDraw13
Names
Lexichem 2.1 Names Marvin 6.0.2 Names
No Result
Constitutional Discrepancy
Stereochemical Discrepancy
Correctly Interpreted
30,000 structures randomly selected from PubChem
used as input to machine-generate names
26. 26
What’s not supported
• Parsing of generic chemical names
• E.g. 2- or 3- alkylsubstitutedbenzofurans
• Advanced inorganic nomenclature e.g. coordinate bonding
• Some natural product nomenclature
• Advanced stereochemistry e.g. pseudo asymmetric stereo
centers, axial stereochemistry etc.
27. 27
Usage
Batch conversion on the
command line
RESTful web service
(opsin.ch.cam.ac.uk)
NameToStructure nts = NameToStructure.getInstance();
String chemicalName = "acetonitrile";
String smiles = nts.parseToSmiles(chemicalName);
Java API
java -jar opsin-1.5.0-jar-with-dependencies.jar -osmi input.txt output.smi
28. 28
Who is using OPSIN?
Commercial software
Cinfony
(interface to
Python)
Many text mining efforts
Workflows
Web services
29. 29
Conclusions
• OPSIN combines high recall, precision and speed of
execution
• Recent improvements have significantly improved
coverage of biochemical nomenclature
Visit opsin.ch.cam.ac.uk to try it out and download!
30. 30
OPSIN: Taming the jungle of IUPAC chemical nomenclature
daniel@nextmovesoftware.com
For more information see:
Chemical Name to Structure: OPSIN, an Open Source Solution
J. Chem. Inf. Model., 2011, 51 (3), pp 739–753
Extraction of chemical structures and reactions from the literature
(https://www.repository.cam.ac.uk/handle/1810/244727)
Acknowledgements
Albina Asadulina
Rich Apodaca
Peter Corbett
Roger Sayle
Funding