Chemical databases grow larger every year. Without investing in additional hardware or improved software, the time to search these databases will in turn grow longer annually. With an ever-increasing number of pharmaceutical patents, the amount of chemical data associated with these is growing at a rate with which hardware advances alone cannot keep up.
Using automated mining of U.S. and European patents, we have extracted large collections of structural data in the form of reactions, mixtures, and exemplified compounds. Additional information such as protein targets and diseases are also extracted from each patent and associated with the structural data. We will describe how this data can be queried with natural language phrases and how these phrases are interpreted as structural queries.
Through innovations in substructure and similarity search algorithms, it is possible to search and retrieve hundreds of millions of chemical records in fractions of a second. We will demonstrate how this is achieved on a regular desktop machine using just-in-time and ahead-of-time compilation techniques.
CINF 35: Structure searching for patent information: The need for speed
1. 256th ACS National Meeting, Boston, Aug 2018
Structure searching for patent
information:
The need for speed
John Mayfield, Noel OâBoyle, and Roger Sayle
NextMove Software
Cambridge, UK
3. Daniel M. Lowe. Extraction of chemical structures and reactions from the literature. Ph.D. Thesis,
University of Cambridge, 2012
To 7-chloro-4-oxo-4,5-dihydrofuro[2,3-d]pyridazine-2-carboxylic acid (Peakdale) (220 mg, 1.025 mmol) and (3,4-
dimethoxyphenyl)boronic acid (187 mg, 1.025 mmol) in 1,4-dioxane (3 mL) and water (1.5 mL) was
added sodium carbonate(435 mg, 4.10 mmol) and tetrakis(triphenylphosphine)palladium(0) (110 mg, 0.095
mmol). The reaction was heated in the microwave at 80° C. for 2 hours and at 100° C. for a further 2 hours.
The solvent was removed and the residue was suspended in DMSO, filtered and purified by MDAP. Appropriate
fractions were combined and the solvent removed to give 7-(3,4-dimethoxyphenyl)-4-oxo-4,5-dihydrofuro[2,3-
d]pyridazine-2-carboxylic acid (25 mg, 7%) as a yellow solid.
[0517]
US 2016/16966 A1
4. Daniel M. Lowe. Extraction of chemical structures and reactions from the literature. Ph.D. Thesis,
University of Cambridge, 2012
To 7-chloro-4-oxo-4,5-dihydrofuro[2,3-d]pyridazine-2-carboxylic acid (Peakdale) (220 mg, 1.025 mmol) and (3,4-
dimethoxyphenyl)boronic acid (187 mg, 1.025 mmol) in 1,4-dioxane (3 mL) and water (1.5 mL) was
added sodium carbonate(435 mg, 4.10 mmol) and tetrakis(triphenylphosphine)palladium(0) (110 mg, 0.095
mmol). The reaction was heated in the microwave at 80° C. for 2 hours and at 100° C. for a further 2 hours.
The solvent was removed and the residue was suspended in DMSO, filtered and purified by MDAP. Appropriate
fractions were combined and the solvent removed to give 7-(3,4-dimethoxyphenyl)-4-oxo-4,5-dihydrofuro[2,3-
d]pyridazine-2-carboxylic acid (25 mg, 7%) as a yellow solid.
[0517]
US 2016/16966 A1
5. Daniel M. Lowe. Extraction of chemical structures and reactions from the literature. Ph.D. Thesis,
University of Cambridge, 2012
To 7-chloro-4-oxo-4,5-dihydrofuro[2,3-d]pyridazine-2-carboxylic acid (Peakdale) (220 mg, 1.025 mmol) and (3,4-
dimethoxyphenyl)boronic acid (187 mg, 1.025 mmol) in 1,4-dioxane (3 mL) and water (1.5 mL) was
added sodium carbonate(435 mg, 4.10 mmol) and tetrakis(triphenylphosphine)palladium(0) (110 mg, 0.095
mmol). The reaction was heated in the microwave at 80° C. for 2 hours and at 100° C. for a further 2 hours.
The solvent was removed and the residue was suspended in DMSO, filtered and purified by MDAP. Appropriate
fractions were combined and the solvent removed to give 7-(3,4-dimethoxyphenyl)-4-oxo-4,5-dihydrofuro[2,3-
d]pyridazine-2-carboxylic acid (25 mg, 7%) as a yellow solid.
[0517]
Product Properties
7-(3,4-dimethoxyphenyl)-4-oxo-4,5-dihydrofuro[2,3-d]pyridazine-2-carboxylic acid 25 mg, 7% yield, Yellow Solid
Reactant Properties
7-chloro-4-oxo-4,5-dihydrofuro[2,3-d]pyridazine-2-carboxylic acid 220 mg, 1.025 mmol
(3,4-dimethoxyphenyl)boronic acid 187 mg, 1.025 mmol
Agent Properties
1,4-dioxane 3mL
water 1.5mL
sodium carbonate 435 mg, 4.10 mol
tetrakis(triphenylphosphine)palladium(0) 110 mg, 0.095 mmol
DMSO
US 2016/16966 A1
6. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
SKETCH PROCESSING
US 2004/101442 C00025
Default Interpretation
(USPTO molfile)
Our InterpretationOriginal Sketch
Re-interpretation of ChemDraw sketches
1. Correct systematic errors
2. Extract extra semantics (structure variation, reaction schemes)
3. Categorise output (is this something we canât interpret)
John May, et al. Sketchy Sketches: Hiding Chemistry in Plain Sight. Seventh Joint Sheffield Conference on
Cheminformatics. 2016
7. Example 26, US 09718816 B2
John May, et al. Sketchy Sketches: Hiding Chemistry in Plain Sight. Seventh Joint Sheffield Conference on
Cheminformatics. 2016
Step 1
Step 4
Step 3
Step 2
etc..
Reaction SCHEME SKETCHES
8. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
SKETCH CATEGORISATION
Molecule/Specific
Molecule/Generic
Reaction/Specific
Reaction/Generic
NoConnectionTable
US 7092578 B2, Table 1 âSignaling adaptive-quantization matrices in JPEG using end-of-block codesâ
US 7092578 B2, Table 1
C000001.CDX
A category is assigned to
each extracted sketch:
9. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
SKETCH CATEGORISATION
US 7092578 B2, Table 1 âSignaling adaptive-quantization matrices in JPEG using end-of-block codesâ
US 7092578 B2, Table 1
C000001.CDX
10. 256th ACS National Meeting, Boston, Aug 2018
mixtures and formulations
(cocktails)
US 2001/2252 A1
âTOOTHÂ WHITENING PREPARATIONSâ
11. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
R group Tables
US 2016/0002208 A1
12. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
R group Tables
US 2016/0002208 A1
13. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Chemical name translation
6-aminopyrimidine-2,4,5-triol
Chinese (Hanzi used for each morpheme)
6-ć°¨ĺşĺ§ĺś-2,4,5-ä¸é
Japanese (Phonetic translation to Katakana)
6-ă˘ăăăăŞăă¸ăł-2,4,5-ăăŞăŞăźăŤ
Korean (Phonetic translation to Hangul)
6-ěëŻ¸ë ¸íźëŚŹëŻ¸ë-2,4,5-í¸ëŚŹěŹ
ammonia radical pyrimidine three alcohol
amino pyrimidine tri ol
amino pyrimidine tri ol
N
N
OHHO
HO
NH2
15. 256th ACS National Meeting, Boston, Aug 2018
Rule-base text-mining SPEED
Chih-Hsuan Wei et al. Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V
chemical-disease relation (CDR) task. Database (Oxford). 2016; 2016: baw032. PMC4799720
BioCreAtIvE V challenge
evaluating text-mining and
extraction systems.
Web service response time to
annotate an abstract evaluated for
CDR task.
16. 256th ACS National Meeting, Boston, Aug 2018
Rule-base text-mining SPEED
Chih-Hsuan Wei et al. Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V
chemical-disease relation (CDR) task. Database (Oxford). 2016; 2016: baw032. PMC4799720
BioCreAtIvE V challenge
evaluating text-mining and
extraction systems.
Web service response time to
annotate an abstract evaluated for
CDR task.
Efficient rule-based text-mining
provides provenance for
annotations and can mine entire
back-archive of US patents in ~24
hours on a single machine.
20. 256th ACS National Meeting, Boston, Aug 2018
Intelligent query box
Systematic Name Date Range Trivial Name
Yield Range Affiliation Reaction SMARTS
Disease Target DocumentLine Formula
SMILES InChIAuthor Protein Target Collection
Reaction Type (NameRxn)SMARTSSource
âŚand logical combinations thereof
22. 256th ACS National Meeting, Boston, Aug 2018
make/break REACTION SEARCH
Find: â7H-purine substructure productâ
Find: âSynthesis of 7H-purineâ
Requires fast-substructure search to compute using the complement of two sets.
23. 256th ACS National Meeting, Boston, Aug 2018
Cocktails: Mixtures and formulations
25. 256th ACS National Meeting, Boston, Aug 2018
ARTHOR - MOTIVATION
History in optimising search:
â R.Sayle, â1st-class SMARTS patternsâ, Daylight CIS, European UGM, EuroMUG
1997, Verona, Italy
â R. Sayle, âImproved SMILES Substructure Searchingâ, Daylight CIS, European
UGM, EuroMUG 2000, Cambridge, UK.
â R. Sayle, âEfficient Matching of Chemical Subgraphsâ, 9th ICCS,
Noordwijkerhout, The Netherlands, 9th June 2011.
âA substructure search of indole against eMolecules (~7M at the time)
takes 17 secondsâ - 2014
Benchmark of 3.4K queries on 7M compounds from eMolecules
â John May and Roger Sayle, âSubstructure Search Face-Offâ, CCNM,
Cambridge, May 2015
26. 256th ACS National Meeting, Boston, Aug 2018
SUBSEARCH PERFORMANCE
Updated from: John May and Roger Sayle, Substructure Search Face-Off, Presented at CCNM, Cambridge, May 2015
https://www.slideshare.net/NextMoveSoftware/substructure-search-faceoff
1
10
100
1000
3341
1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 1e+07
Time (ms)
NumQueries(n)
1s 10s 1m 5m 1h
90%
BioVia Direct
EPAM Bingo NoSQL
ChemAxon JCART
RDKit Cart
OpenChemLib
OB FastSearch
50m35s
1h2m59s
2h9m11s
2h44m47s
5h13m19s
5h53m40s
2d11h42m14s
EPAM Bingo Cart
Sachem 16m50s
27. 256th ACS National Meeting, Boston, Aug 2018
SUBSEARCH PERFORMANCE
John May and Roger Sayle, Substructure Search Face-Off, Presented at CCNM, Cambridge, May 2015
https://www.slideshare.net/NextMoveSoftware/substructure-search-faceoff
1
10
100
1000
3341
1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 1e+07
Time (ms)
NumQueries(n)
1s 10s 1m 5m 1h
90%
BioVia Direct
EPAM Bingo NoSQL
ChemAxon JCART
RDKit Cart
OpenChemLib
OB FastSearch
50m35s
1h2m59s
2h9m11s
2h44m47s
5h13m19s
5h53m40s
2d11h42m14s
EPAM Bingo Cart
Sachem 16m50s
Arthor (Brute force) 27m17s
Arthor 46s
Arthor (8 threads) 12s
28. 256th ACS National Meeting, Boston, Aug 2018
Substructure Optimisations
Ahead-of-time (AOT)
⢠Chemical records converted to pointer-free memory optimised
data structure (~166B per molecule)
⢠Path-based fingerprint computed and stored in inverted index
⢠Sensible ordering of results
Just-in-time (JIT)
⢠SMARTS traversal based on frequency statistics
⢠Atom/Bond expressions compiled and optimised using
boolean algebra
⢠Fingerprint screening bit selection
29. 256th ACS National Meeting, Boston, Aug 2018
AOT: Storage order
Order by those most similar to the query and favour plain molecules.
CID 60795
CID 11669779
CID 11576259
CID 37888405
30. 256th ACS National Meeting, Boston, Aug 2018
AOT: Storage order
CID 60795
CID 11669779
CID 11576259
CID 37888405
CID 60795
CID 11669779
CID 11576259
CID 37888405
Order by those most similar to the query and favour plain molecules.
31. 256th ACS National Meeting, Boston, Aug 2018
Storage order
Order by those most similar to the query and favour plain molecules.
Tanimoto canât be calculated ahead of time, but can be approximated.
Generate a hexadecimal key based on size and other properties
favouring âplainâ molecules and order by this.
000e000e01000a0004000065000000 CCC(C(=O)O)Oc1ccc(cc1)Cl CHEMBL23477
AtomCountBondCountPartCountCarbonCountCommonHeteroCount
AtomicNumberSum
RadicalCount
ChargeCount
IsotopeCount
32. 256th ACS National Meeting, Boston, Aug 2018
JIT: Pattern Traversal
The same query can be traversed (and matched) in a different orders.
How much slower?
Best BrCCCC
3.4x CC(Br)CC
5.6x CCCCBr
Best n1ccc2c1cccc2
1.4x c12c(ccn1)cccc2
2.3x c12ccccc1ccn2
3.3x c1cnc2ccccc12
3.3x c12ccnc1cccc2
4.8x c1c2ccccc2nc1
Before the query is matched it is rearranged to the âbestâ traversal
order based on frequency statistics
33. 256th ACS National Meeting, Boston, Aug 2018
SIMILARITY Optimisations
Ahead-of-time (AOT)
⢠Store binary fingerprints in buckets based on the cardinality of
the fingerprint as the number of set bits: pop(ulation) count
⢠Stripe (or âtransposeâ) fingerprints reducing the memory reads
for the JIT code
Just-in-time (JIT)
⢠Generate machine code to perform to calculate the Tanimoto
34. 256th ACS National Meeting, Boston, Aug 2018
TANIMOTO CODE GEN
double similarity(long[] q_fp, long[] db_fp) {
int intersect = 0;
int union = 0;
for (int i = 0; i < q_fp.length; i++) {
intersect += Long.bitCount(q_fp[i] & db_fp[i]);
union += Long.bitCount(q_fp[i] | db_fp[i]);
}
return intersect / (double) union;
}
double similarity(long[] q_fp, long[] db_fp, int q_pop, int db_pop) {
int intersect = 0;
for (int i = 0; i < q_fp.length; i++) {
intersect += Long.bitCount(q_fp[i] & db_fp[i]);
}
return intersect / (double) (q_pop+db_pop-intersect);
}
Tanimoto Calculation (Java, 64-bit words)
Equivalent Tanimoto Calculating Union from Intersect
36. 256th ACS National Meeting, Boston, Aug 2018
CHEMBL1906145
TANIMOTO CODE GEN
int intersectChembl1906145(long[] db_fp) {
int pop = 0;
pop += Long.bitCount(0x0000000000000000L & db_fp[1]);
pop += Long.bitCount(0x0000000000000000L & db_fp[1]);
pop += Long.bitCount(0x0000000000400020L & db_fp[2]);
pop += Long.bitCount(0x0010000008000002L & db_fp[3]);
pop += Long.bitCount(0x0160000000000200L & db_fp[4]);
pop += Long.bitCount(0x00000800000a1000L & db_fp[5]);
pop += Long.bitCount(0x1000000001580000L & db_fp[6]);
pop += Long.bitCount(0x0800002000000000L & db_fp[7]);
pop += Long.bitCount(0x0000000000000841L & db_fp[8]);
pop += Long.bitCount(0x0000000006000100L & db_fp[9]);
pop += Long.bitCount(0x0000280002002100L & db_fp[10]);
pop += Long.bitCount(0x0100000000048000L & db_fp[11]);
pop += Long.bitCount(0x0000002088000000L & db_fp[12]);
pop += Long.bitCount(0x0000008000400000L & db_fp[13]);
pop += Long.bitCount(0x0008000000000100L & db_fp[14]);
pop += Long.bitCount(0x0000000000010180L & db_fp[15]);
return pop;
}
For a given query (e.g. ) we can hard code the fingerprint.
37. 256th ACS National Meeting, Boston, Aug 2018
CHEMBL1906145
TANIMOTO CODE GEN
bitCount on empty and singleton words (for ) can be eliminated.
int intersectChembl1906145(long[] db_fp) {
int pop = 0;
pop += (db_fp[0] >> 2) & 0x1;
// pop += Long.bitCount(0x0000000000000000L & db_fp[1]);
pop += Long.bitCount(0x0000000000400020L & db_fp[2]);
pop += Long.bitCount(0x0010000008000002L & db_fp[3]);
pop += Long.bitCount(0x0160000000000200L & db_fp[4]);
pop += Long.bitCount(0x00000800000a1000L & db_fp[5]);
pop += Long.bitCount(0x1000000001580000L & db_fp[6]);
pop += Long.bitCount(0x0800002000000000L & db_fp[7]);
pop += Long.bitCount(0x0000000000000841L & db_fp[8]);
pop += Long.bitCount(0x0000000006000100L & db_fp[9]);
pop += Long.bitCount(0x0000280002002100L & db_fp[10]);
pop += Long.bitCount(0x0100000000048000L & db_fp[11]);
pop += Long.bitCount(0x0000002088000000L & db_fp[12]);
pop += Long.bitCount(0x0000008000400000L & db_fp[13]);
pop += Long.bitCount(0x0008000000000100L & db_fp[14]);
pop += Long.bitCount(0x0000000000010180L & db_fp[15]);
return pop;
}
38. 256th ACS National Meeting, Boston, Aug 2018
2
6
13
3
12
4
11
5
10
14
15
79
8
To optimise the remaining 64-bit words (numbered 2-15) we can derive a graph by
connecting any two words that share a common bit.
TANIMOTO CODE GEN
39. 256th ACS National Meeting, Boston, Aug 2018
TANIMOTO CODE GEN
2
6
13
3
12
4
11
5
10
14
15
79
8
Colouring the graph (such that no two colours are adjacent) tells us how many pop
counts we will need (the number of colours).
40. 256th ACS National Meeting, Boston, Aug 2018
2
6
13
3
12
4
11
5
10
14
15
79
8
TANIMOTO CODE GEN
int intersectChembl1906145(long[] db_fp) {
int pop = 0;
pop += (db_fp[0] >> 2) & 0x1;
// pop += Long.bitCount(0x0000000000000000L & db_fp[1]);
pop += Long.bitCount((0x0000000000400020L & db_fp[2]) |
(0x00000800000a1000L & db_fp[5]) |
(0x0000000006000100L & db_fp[9]) |
(0x0010000008000002L & db_fp[3]) |
(0x0800002000000000L & db_fp[7]) |
(0x0160000000000200L & db_fp[4]));
pop += Long.bitCount((0x1000000001580000L & db_fp[6]) |
(0x0000280002002100L & db_fp[10]) |
(0x0100000000048000L & db_fp[11]) |
(0x0000002088000000L & db_fp[12]) |
(0x0000000000000841L & db_fp[8]));
pop += Long.bitCount((0x0000008000400000L & db_fp[13]) |
(0x0008000000000100L & db_fp[14]));
pop += Long.bitCount(0x0000000000010180L & db_fp[15]);
return pop;
}
We can combine bitCount on words of the
same colour
41. 256th ACS National Meeting, Boston, Aug 2018
Speedy tools for structure searching
⢠Quick feedback from a search allows refinement if needed
⢠Enables different types of search (e.g. make/break)
Speedy tools for text-mining patents
⢠Assists in improvement of grammar and dictionaries
⢠Extract from all patents not just a subset of IPC codes
CONCLUSIONS
Future Work
⢠Extract additional types of chemical data
⢠Advanced query features beyond SMARTS
42. 256th ACS National Meeting, Boston, Aug 2018
Acknowledgements
Yurii Moroz, Chemspace
Pat Walters, Relay Therapeutics
James Davidson, Vernalis
Mathew Swain, Vernalis
Daniel Lowe, Minesoft
Related Talks:
⢠R Sayle. Recent Advances in Chemical & Biological Search Systems: Evolution v
Resolution. ICCS, May 2018
⢠J Mayfield, Pistachio: Search and Faceting of Large Reaction Databases. 254th ACS
National Meeting, Aug 2017
⢠D Lowe. Sketchy sketches: Hiding chemistry in plain sight. 252nd ACS National
Meeting, Aug 2016
Available at: https://www.slideshare.net/NextMoveSoftware
CINF 162: NextMove for Chemspace: Millisecond search in a database of 100
million structures. Thursday 10:25, Grand Ballroom A
CINF 170: Regioselectivity: An application of expert systems and ontologies to
chemical (named) reaction analysis. Thursday 10:40, Lewis