Be the first to like this
Ubiquitous in published literature, chemical sketches combine molecular connection tables with additional drawing elements, free text, and embedded images. Unlike connection table formats that precisely capture chemistry for database entry, the primary purpose of a sketch format is to produce a high quality image for conveying information to other chemists. The presence of additional elements present in a sketch may either extend or change the chemical interpretation leading to incorrect structures if naively exported.
The variability and flexibility of sketching formats presents several challenges including:
– Non-chemical elements used to represent chemical features (e.g. arcs for a bond)
– Chemical elements used for non-chemical features (e.g. ethane for a line, cyclobutane for a table cell)
– Contracted labels that need expanding or correcting (e.g. HOOC-, O2N-, K2CO3)
– Distinction between generic labels and chemical elements (e.g. B, C, D, P, V, W, Y, Ac, Ar)
– Reagent abbreviations (e.g. HATU, COMU, THF, DMSO)
– Compound identifiers (e.g. “(II)”, “(IV)”, “(1a)”)
– Bond interpretation (e.g. wavy or dative bond used for attachment points)
– Free text overlaid on the structure (e.g. “( )n” for a link node)
– Assignment of reaction role
Being able to successfully read, interpret, and understand the chemist’s intention improves the quality and utility of the data for searching and analysis.
Since 2001 the USPTO has redrawn chemical sketches with ChemDraw for all patent applications and grants. Totalling more than 24+ million CDX files (10+ million from grants, 14+ million from applications) this data set provides a large collection in which to identify conventions and rectify systematic problems. By identifying and resolving these problems, we have extracted a large high-quality collection of fragments, molecules, reactions, and schemes. Generic structural features such as frequency, positional, substituent, and homology variation are also captured. The extracted chemical data is summarised and compared to that obtained through text-mining (NextMove Software’s LeadMine), image-to-structure (SureChEMBL), and direct export (SCRIPDB).
 Sketch formats include ChemDraw (CDX/CDXML), ACD/ChemSketch (SK2), BioVia Draw (SKC), and MarvinSketch (MRV)