SUPPORTED NOMENCLATURE
CONCLUSIONS
REFERENCES
ACKNOWLEDGEMENTS
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
Daniel Lowe, Robert C. Glen, Peter Murray-Rust
Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, UK
daniel@nextmovesoftware.com
I would like to thank Boehringer Ingelheim for funding and NextMove
Software for support and encouragement in implementing
carbohydrate nomenclature.
• OPSIN combines high recall, precision and speed of execution.
Recent improvements have significantly improved coverage of
biochemical nomenclature although gaps still remain
• OPSIN is widely use in text mining workflows including those at
AstraZeneca, Digital Science and the Royal Society of Chemistry
[1] Lowe, D. M.; Corbett, P. T.; Murray-Rust, P.; Glen, R. C. J. Chem.
Inf. Model. 2011, 51, 739–753.
[2] http://opsin.ch.cam.ac.uk/
[3] http://bitbucket.org/dan2097/opsin
[4] Lowe, D. M. Ph.D. Thesis, University of Cambridge, 2012
RECENTLY ADDED NOMENCLATURE
• OPSIN (Open Parser for Systematic IUPAC nomenclature)[1] converts
systematic chemical names, as found abundantly in the journal and patent
literature, to their corresponding structures.
• OPSIN is available as:
 A RESTful web service[2] (for interactive input or multiple requests)
 A Java library (available from Bitbucket[3])
 A command-line application for batch conversion
INTRODUCTION
Chemical
Name
OPSIN
CML
SMILES
InChI
Amino acid
nomenclature
Heteroatom
replacement
Spiro ring
system
Von Baeyer ring
system
Functional class
nomenclature
Multiplicative
nomenclature
alkyne
The following names and structures are some examples of what OPSIN supports:
Lambda
convention
Hantzsch-
Widman ring
Functional
Replacement
Conjunctive
nomenclature Polycyclic spiro
fusion
Cyclised
chain
Heteroatom
chain
PERFORMANCE
ALGORITHMS
• Transform to an idealised grid aligned along the longest row of rings
• Apply quadrant rules e.g. favour most rings in upper right quadrant
• Apply peripheral numbering rules (lower locants for fusion carbon atoms in this case)
8 6 6 5 6 6 6 6 5 8
6 6 5 6 8 6 6 5 8 6
8 6
6 6 5 6 6 6 5 8
6
6 6 5 8
Atoms numbered
ascending in this order
from upper rightmost ring
Stereochemistry Substitution Indicated
hydrogen
Fused ring
system
Ester
formation
Stereochemistry
Perhalogenation
Anhydro
(elimination of
water)
Relative cis/trans
stereochemistry
Simple bridging
nomenclature
α/β stereochemistry
Open chain
carbohydrate
Ring Assembly CAS index name
Nucleotide
nomenclature
Systematic carbohydrate
Glycosidic linkage
Name to structure requires many algorithms[4] e.g. stereochemistry perception,
Cahn-Ingold-Prelog priority determination. One example is the surprisingly complex
algorithm that is required to assign numbering to arbitrary fused ring systems.
To evaluate performance on systematic names 30,000 compounds
were randomly selected from PubChem and converted to names
using four different structure to name programs. OPSIN 1.5.0 was
used to generate InChIs from the names which were compared
with the InChIs provided by PubChem. Where the InChIs were not
identical it was determined whether the layers that define the
constitution of the molecule were identical. If they were, this was
classed as a “Stereochemical Discrepancy”, and, if they were
different, this was classed as a “Constitutional Discrepancy”.
On the first two sets whether amino acid names without a D/L prefix
are interpreted as the L- form or unspecified accounts for most of
the cases of stereochemical discrepancy.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
ACD/Name 12.02
Names
ChemBioDraw13
Names
Lexichem 2.1 Names Marvin 6.0.2 Names
No Result
Constitutional Discrepancy
Stereochemical Discrepancy
Correctly Interpreted

OPSIN: Taming the Jungle of IUPAC Chemical Nomenclature

  • 1.
    SUPPORTED NOMENCLATURE CONCLUSIONS REFERENCES ACKNOWLEDGEMENTS OPSIN: Tamingthe Jungle of IUPAC Chemical Nomenclature Daniel Lowe, Robert C. Glen, Peter Murray-Rust Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, UK daniel@nextmovesoftware.com I would like to thank Boehringer Ingelheim for funding and NextMove Software for support and encouragement in implementing carbohydrate nomenclature. • OPSIN combines high recall, precision and speed of execution. Recent improvements have significantly improved coverage of biochemical nomenclature although gaps still remain • OPSIN is widely use in text mining workflows including those at AstraZeneca, Digital Science and the Royal Society of Chemistry [1] Lowe, D. M.; Corbett, P. T.; Murray-Rust, P.; Glen, R. C. J. Chem. Inf. Model. 2011, 51, 739–753. [2] http://opsin.ch.cam.ac.uk/ [3] http://bitbucket.org/dan2097/opsin [4] Lowe, D. M. Ph.D. Thesis, University of Cambridge, 2012 RECENTLY ADDED NOMENCLATURE • OPSIN (Open Parser for Systematic IUPAC nomenclature)[1] converts systematic chemical names, as found abundantly in the journal and patent literature, to their corresponding structures. • OPSIN is available as:  A RESTful web service[2] (for interactive input or multiple requests)  A Java library (available from Bitbucket[3])  A command-line application for batch conversion INTRODUCTION Chemical Name OPSIN CML SMILES InChI Amino acid nomenclature Heteroatom replacement Spiro ring system Von Baeyer ring system Functional class nomenclature Multiplicative nomenclature alkyne The following names and structures are some examples of what OPSIN supports: Lambda convention Hantzsch- Widman ring Functional Replacement Conjunctive nomenclature Polycyclic spiro fusion Cyclised chain Heteroatom chain PERFORMANCE ALGORITHMS • Transform to an idealised grid aligned along the longest row of rings • Apply quadrant rules e.g. favour most rings in upper right quadrant • Apply peripheral numbering rules (lower locants for fusion carbon atoms in this case) 8 6 6 5 6 6 6 6 5 8 6 6 5 6 8 6 6 5 8 6 8 6 6 6 5 6 6 6 5 8 6 6 6 5 8 Atoms numbered ascending in this order from upper rightmost ring Stereochemistry Substitution Indicated hydrogen Fused ring system Ester formation Stereochemistry Perhalogenation Anhydro (elimination of water) Relative cis/trans stereochemistry Simple bridging nomenclature α/β stereochemistry Open chain carbohydrate Ring Assembly CAS index name Nucleotide nomenclature Systematic carbohydrate Glycosidic linkage Name to structure requires many algorithms[4] e.g. stereochemistry perception, Cahn-Ingold-Prelog priority determination. One example is the surprisingly complex algorithm that is required to assign numbering to arbitrary fused ring systems. To evaluate performance on systematic names 30,000 compounds were randomly selected from PubChem and converted to names using four different structure to name programs. OPSIN 1.5.0 was used to generate InChIs from the names which were compared with the InChIs provided by PubChem. Where the InChIs were not identical it was determined whether the layers that define the constitution of the molecule were identical. If they were, this was classed as a “Stereochemical Discrepancy”, and, if they were different, this was classed as a “Constitutional Discrepancy”. On the first two sets whether amino acid names without a D/L prefix are interpreted as the L- form or unspecified accounts for most of the cases of stereochemical discrepancy. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ACD/Name 12.02 Names ChemBioDraw13 Names Lexichem 2.1 Names Marvin 6.0.2 Names No Result Constitutional Discrepancy Stereochemical Discrepancy Correctly Interpreted