Sketchy Sketches: Hiding Chemistry in Plain Sight


Published on

Ubiquitous in published literature, chemical sketches[1] combine molecular connection tables with additional drawing elements, free text, and embedded images. Unlike connection table formats that precisely capture chemistry for database entry, the primary purpose of a sketch format is to produce a high quality image for conveying information to other chemists. The presence of additional elements present in a sketch may either extend or change the chemical interpretation leading to incorrect structures if naively exported.

The variability and flexibility of sketching formats presents several challenges including:
– Non-chemical elements used to represent chemical features (e.g. arcs for a bond)
– Chemical elements used for non-chemical features (e.g. ethane for a line, cyclobutane for a table cell)
– Contracted labels that need expanding or correcting (e.g. HOOC-, O2N-, K2CO3)
– Distinction between generic labels and chemical elements (e.g. B, C, D, P, V, W, Y, Ac, Ar)
– Reagent abbreviations (e.g. HATU, COMU, THF, DMSO)
– Compound identifiers (e.g. “(II)”, “(IV)”, “(1a)”)
– Bond interpretation (e.g. wavy or dative bond used for attachment points)
– Free text overlaid on the structure (e.g. “( )n” for a link node)
– Assignment of reaction role

Being able to successfully read, interpret, and understand the chemist’s intention improves the quality and utility of the data for searching and analysis.

Since 2001 the USPTO has redrawn chemical sketches with ChemDraw for all patent applications and grants. Totalling more than 24+ million CDX files (10+ million from grants, 14+ million from applications) this data set provides a large collection in which to identify conventions and rectify systematic problems. By identifying and resolving these problems, we have extracted a large high-quality collection of fragments, molecules, reactions, and schemes. Generic structural features such as frequency, positional, substituent, and homology variation are also captured. The extracted chemical data is summarised and compared to that obtained through text-mining (NextMove Software’s LeadMine), image-to-structure (SureChEMBL), and direct export (SCRIPDB).

[1] Sketch formats include ChemDraw (CDX/CDXML), ACD/ChemSketch (SK2), BioVia Draw (SKC), and MarvinSketch (MRV)

Published in: Software
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Sketchy Sketches: Hiding Chemistry in Plain Sight

  1. 1. US 2015/342913 C00015 This WorkSCHEMBL17244875 / CID 118501484 V denotes cycloalkyl, heterocyclyl, aryl or heteroaryl, W denotes , heterocyclyl or heteroaryl; Generic features are recognised and captured in the output format. This allows a large scale summary of the features over U.S. Patents since 2001. Using the first sketch from the claims of each patent, 200,900 generic molecules with high/medium confidence can be extracted. Their feature distribution is: John May, Daniel Lowe and Roger Sayle NextMove Software Ltd, Cambridge, UK. NextMove Software Limited Innovation Centre (Unit 23) Cambridge Science Park Milton Road, Cambridge UK CB4 0EY Introduction We would like to thank MineSoft Ltd for funding and George Papadatos for providing bulk downloads of structures from SureChEMBL for the Dec 2015 Patent Applications. Sketchy Sketches: Hiding Chemistry in Plain Sight Sketch formats, such as ChemDraw, are often preferred for publication as they provide more freedom to the author. This freedom comes at a cost and what is displayed to the user may not coincide with the stored chemical representation. This can lead to incorrect structures if naïvely exported. More than 24 million ChemDraw files are available from the U.S. Patents, using these files a high fidelity converter/interpreter has been developed that is able to extract and associate more chemistry at higher precision. Content Categorisation Acknowledgements Evaluation Examples of Improved Interpretation Using 2,528[3] U.S. Patent Applications published in Dec 2015 the structures extracted by different methods are compared: A categorisation scheme has been developed to organise and characterise diverse sketch content. Sketches are assigned a content type (1), level of detail (2), and a confidence that the interpretation is correct (3): Substituent is assigned if the entity contains an attachment point and are commonly found in Markush claims. NoConnectionTable is assigned to non-chemical sketches and when an extracted connection table is not believed to be real (e.g. chessboardanes[1]). 1. 2. Heifets, A and Jurisica, I. SCRIPDB: a portal for easy access to syntheses, chemicals and reactions in patents. Nucl. Acids. Res. 2012 3. Papadatos, G et al. SureChEMBL: a large-scale, chemically annotated patent document database. Nucl. Acids. Res. 2015 4. All U.S. patents applications in Dec 2015 with chemical complex work units. Generic is assigned when substituent, positional, or frequency variation is present. This primarily includes Markush cores but also polymers. Specific is assigned when all atom labels have been resolved. An entity with unresolved labels and no variation are Unknown, the labels are usually partial, ambiguous, or incorrect condensed formula. US 2015/370940 C00005 NoConnectionTable US 2015/0344503 C00668 Molecule/Specific/High US 2007/129372 C00001 Molecule/Generic/High US 2015/366987 C00008 Substituent/Specific/High Substituent/Generic/High US 2015/368236 C00019US 2015/378257 C00118 Molecule/Generic/High US 2015/378260 C00003 Molecule/Unknown US 2015/366987 C00008 SCHEMBL6726025 / CID 69879541 This Work Attachment Points and Punctuation Striping. Attachment points may be misrepresented as tert-butyl, iso-propyl, or methyl groups. Punctuation can be used to separate structure lists, if appended to an atom label that label may not be understood. Condensed Formula. Formulas are reinterpreted with a more accurate algorithm. Some labels previously defined will be unrecognised (Molecule/Unknown) whilst others are corrected. This WorkUS 2015/376182 C00030 Substituent Labels. Markush definitions are recognised in the patent text and used to resolve ambiguities. W for Tungsten? V for Vanadium? US 2015/376182 [Claim 1] Document Bibliography This WorkUS 2004/101442 C00025 Molfile Contents Generic Feature Survey SCRIPDB[2] and SureChEMBL[3] include structures from directly converting bundled Molfiles. SureChEMBL also includes structures obtained from raster images using image- to-structure. The raw data is used in this evaluation, SureChEMBL filters structures before indexing in the public interface. Normalisation. Output from CDX and TXT are split into individual components (e.g. splitting reactions and salts) to make it comparable to IMG and MOL which had already been split. Only components with ≥12 bonds between heavy atoms are considered. File Content. The 2,528 patents contain 136,372 ChemDraw files with the categories: Files can contain multiple entities, the numbers reported here are by file rather than entity. By entity Molecule/Specific is 106,402, Reaction/Specific is 16,388. Specific 72,520 Generic 26,743 Unknown 2,045 Specific 9,674 Generic 4,774 Unknown 698 Specific 11,647 Generic 6,653 Unknown 344 Molecule Reaction Substituent Frequency Variation Positional Variation Substituent Variation 0 20000 40000 60000 80000 100000 120000 140000 160000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Features NumPatents Overlap. Structures are compared using Canonical SMILES (OEChem) of the individual components on a per document basis. For CDX only content categorised as specific molecules and reactions is counted. Whole Document Claims Section Differences. The biggest contribution to differences between the CDX, IMG and MOL are misinterpreted substituents and condensed formulae (see. left panel). Some differences are a result of human input error. US 2015/380663 C00381 Conclusion Interpreting ChemDraw files shows closer agreement with chemistry mined from the patent text than image-to-structure and much closer agreement than Molfile conversion. Reading the ChemDraw files, rather than the derived Molfiles, allows extraction of complex concepts such as reaction schemes and generic structures. The categorisation helps identify problematic structures and highlights areas to be improved. A major advantage of the ChemDraw interpretation is speed, taking 5h on a single-core to process the back catalogue of 24+ million. SCHEMBL17366912 / CID 118602343 This Work Unique Component-Document Associations Detail: Specific|Generic|Unknown Type: Molecule|Reaction|Substituent|NoConnectionTable Confidence: High|Medium|Low 1 2 3 Not Found in TXT Also Found in TXT CDX 49,119 36,829 (42.8%) IMG 49,836 35,545 (41.6%) MOL 58,169 28,926 (33.2%) Not Found in TXT Also Found in TXT CDX 10,664 800 (6.9%) MOL 11,078 796 (6.7%) IMG 11,783 689 (5.5%) CDX IMG MOL TXT 0 25000 50000 75000 100000 125000 150000 175000 200000 225000 250000 Count (Whole Document) 0 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 Count (Claims Section) Generic Mol/Rxn Specific Mol/Rxn Substituent Molecule 101,308 Reaction 15,146 Substituent 18,644 NoConnectionTable 1,274 PATENT XML CDX MOL TIF Exported by ChemDraw Name-to-Structure (NextMove's LeadMine) Image-to-Structure (SureChEMBL,CLiDE) Molfile Reader (SCRIPDB,SureChEMBL) ChemDraw Interpreter (This Work) LeadMine CDX IMG MOL TXT Patent and Data Bundle Available from USPTO