The document discusses techniques for interpreting chemical sketches found in documents to make the embedded chemistry searchable. It describes challenges in interpreting sketches, such as ambiguous symbols and representations of attachment points. The presentation evaluates an approach to extracting structures from chemical reaction sketches, substituents, and tables of variable compounds found in patents. Over 600,000 unique structures were extracted from US patent applications, many not found through other text or structure mining methods. Limitations in interpreting more complex sketches are also outlined.
Hiding Chemistry in Plain Sight: Interpreting Chemical Sketches from Text
1. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Sketchy sketches: Hiding chemistry
in plain sight
Daniel Lowe, John May and Roger Sayle
NextMove Software
Cambridge, UK
2. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Overview
• Motivation for mining sketches
• Tricky cases when interpreting sketches
• Combining text-mining with sketch
interpretation
3. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Motivation
• The chemical matter discussed in a document
is often critical in determining if it is relevant
• Chemical sketches are not indexed by text-
mining
• If chemical sketches can be made “chemistry
searchable” this helps with:
– Identifying relevant documents
– Prior-art searching
4. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
What input should be used?
• Image to structure techniques tools
(OSRA/Clide/Imago etc.) work with images
– Introduces OCR errors on atom labels
– Crossing bonds present difficulties
– Often can find chemistry in non-chemical images
• Where the sketch is available as a “computer-
readable” format can these issues be
avoided?
5. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Sources of chemdraw sketches
• United States patents (2001-present)
– Over 24 million ChemDraw files!
• Journal articles (albeit in most cases not
publicly accessible)
• Thesis (albeit only if the original manuscript is
made available)
6. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Ambiguous symbols
Symbol Naïve interpretation Possible meaning
Ac Actinium acetyl
Ar Argon aryl
B Boron Generic label
D Deuterium Generic label
P Phosphorus Generic label
Ra Radium Generic label
Rb Rubidium Generic label
V Vanadium Generic label
W Tungsten Generic label
Y Yttrium Generic label
7. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Ambiguous symbols-cont.
• Can disambiguate with text-mining:
– E.g. “B is aryl or heteroaryl”, “B is boron”
• Can disambiguate by connectivity e.g. is a
Yttrium atom with one bond likely?
8. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Attachment point representation
(Below: naïve interpretation)
tert-butyl
methyl
tert-butyl
methyl
9. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Implicit Attachment point
representation
Unlabelled
methyl
Under-valent
atom
Sketch parser needs to be given a hint that
the sketch is a substituent definition!
10. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Formula Interpretation
Input ChemDraw 15 This work
HATU
C4F9
H3PO4
CON(cHex)2 No result
III-2 No result
11. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Categorisation
1) Sketch Type
Molecule
Reaction
Substituent
No connection table
2) Detail
Specific
Generic
Unknown
3)
Confidence in
interpretation
High
Medium
Low
12. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Examples of categorisation
Molecule/Specific/High Substituent/Specific/High
13. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Molecule/Generic/Low
Examples of categorisation
14. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Examples of categorisation
Molecule/Unknown
Formula uninterpretable so can’t know for sure
whether molecule is specific or generic!
15. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Reaction/Specific/Medium
Two reactions
extracted
16. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Examples of categorisation
No Connection Table
17. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Repeated group detection
18. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Electron Localisation
Some delocalised systems don’t
yield valid SMILES
convert to localised system
19. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Positional variation
Naïve export:
Association of R-groups with
ring atoms captured
20. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Evaluation
(Dec 2015 US patent applications)
Molecule
Reaction
Substituent
21. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Comparison with other approaches
*Results courtesy of the SureChEMBL database
Not found by
text-mining
Also found by
text-mining
This work
(parsing CDX
files)
49,119 36,829 (42.8%)
Image to
structure*
49,836 35,545 (41.6%)
ChemDraw
exported Mol
files*
58,169 28,926 (33.2%)
22. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Exemplified compound
R-group Tables
23. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Approach
• Sketches are extracted to extended SMILES capturing:
– R-group labels
– Positional variation
– Repeat groups
• USPTO tables precisely describe how tables should be
displayed but are weak on semantics
– Heuristics used to determine which lines are the same row
– Table caption disambiguated from table column headings
– Column widths used to determine columns
– Colspans detected
• Name to structure used to interpret chemical names/formulas as
R-groups; sketches interpreted as R-groups
• Structure assembled from core and R-groups
24. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Core variation
25. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Positional variation
Incorrect formula
26. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Substituents defined as sketches
27. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Current results
• 2001- June 2016 USPTO patent applications:
– 1.96 million potential table entries detected
– 1.13 million (57.9%) converted to specific
chemical structures
– 621 thousand unique chemical structures
28. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Novelty of results
(versus other pipelines)
Data type
Unique
Compounds
Not found in
text /sketches
Not found in
text
Not found in
sketches
Exemplified
compound
R-group tables
621,140
529,417
(85.2%)
541,974
(87.3%)
590,889
(95.1%)
Text 4,759,009 0% 0%
2,960,937
(62.2%)
Sketches 4,479,113 0%
2,681,041
(59.9%)
0%
Structural identity checks performed using StdInChI
29. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Heavy atom count distribution
30. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Novelty of results
(versus pubchem)
Data type
Unique
Compounds
Not in PubChem
Not in PubChem
(SureChEMBL)
Exemplified
compound
R-group tables
621,140
496,831
(80.0%)
532,166
(85.7%)
Text 4,759,009
564,886
(11.9%)
911,976
(19.2%)
Sketches 4,479,113
886,991
(19.8%)
1,179,229
(26.3%)
Structural identity checks performed using StdInChI
31. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Current limitations
• Application of variable repeat groups
• Obtuse ways of depicting attachment points
• R-groups defined in terms of other R-groups
• R-groups defined elsewhere in the document
• Positional variation R-group representing multiple groups e.g. “3,4-diCl”
• Formulas involving substituted rings e.g.“4-ClPh”
• “Formulas” that mix systematic names with formula e.g. “4-OMe-phenyl”
• Algorithmic number of simple ring-systems (for positional variation)
• Ditto mark
32. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Current limitations
x implicitly 1?
Which is position 8?
Nested R-group
definition
Partially defined by
this text and the table
33. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Conclusions
• Direct interpretation of ChemDraw files can provide
precision benefits over using ChemDraw exported
Mol files or optical structure recognition approaches
• Structures from R-group tables are not handled by
existing text-mining approaches (e.g. SureChEMBL)
• Extracting structures from R-group tables is
complementary to existing approaches
34. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Acknowledgements
• George Papadatos
• Funding provided by:
35. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Thank you for your time!
http://nextmovesoftware.com
http://nextmovesoftware.com/blog
daniel@nextmovesoftware.com