CINF 35: Structure searching for patent information: The need for speed

256th ACS National Meeting, Boston, Aug 2018
Structure searching for patent
information:
The need for speed
John Mayfield, Noel O’Boyle, and Roger Sayle

NextMove Software
Cambridge, UK

Data Search Algorithms

Daniel M. Lowe. Extraction of chemical structures and reactions from the literature. Ph.D. Thesis,
University of Cambridge, 2012
To 7-chloro-4-oxo-4,5-dihydrofuro[2,3-d]pyridazine-2-carboxylic acid (Peakdale) (220 mg, 1.025 mmol) and (3,4-
dimethoxyphenyl)boronic acid (187 mg, 1.025 mmol) in 1,4-dioxane (3 mL) and water (1.5 mL) was
added sodium carbonate(435 mg, 4.10 mmol) and tetrakis(triphenylphosphine)palladium(0) (110 mg, 0.095
mmol). The reaction was heated in the microwave at 80° C. for 2 hours and at 100° C. for a further 2 hours.
The solvent was removed and the residue was suspended in DMSO, filtered and purified by MDAP. Appropriate
fractions were combined and the solvent removed to give 7-(3,4-dimethoxyphenyl)-4-oxo-4,5-dihydrofuro[2,3-
d]pyridazine-2-carboxylic acid (25 mg, 7%) as a yellow solid.
[0517]
US 2016/16966 A1

Daniel M. Lowe. Extraction of chemical structures and reactions from the literature. Ph.D. Thesis,
University of Cambridge, 2012
To 7-chloro-4-oxo-4,5-dihydrofuro[2,3-d]pyridazine-2-carboxylic acid (Peakdale) (220 mg, 1.025 mmol) and (3,4-
dimethoxyphenyl)boronic acid (187 mg, 1.025 mmol) in 1,4-dioxane (3 mL) and water (1.5 mL) was
added sodium carbonate(435 mg, 4.10 mmol) and tetrakis(triphenylphosphine)palladium(0) (110 mg, 0.095
mmol). The reaction was heated in the microwave at 80° C. for 2 hours and at 100° C. for a further 2 hours.
The solvent was removed and the residue was suspended in DMSO, filtered and purified by MDAP. Appropriate
fractions were combined and the solvent removed to give 7-(3,4-dimethoxyphenyl)-4-oxo-4,5-dihydrofuro[2,3-
d]pyridazine-2-carboxylic acid (25 mg, 7%) as a yellow solid.
[0517]
Product Properties
7-(3,4-dimethoxyphenyl)-4-oxo-4,5-dihydrofuro[2,3-d]pyridazine-2-carboxylic acid 25 mg, 7% yield, Yellow Solid
Reactant Properties
7-chloro-4-oxo-4,5-dihydrofuro[2,3-d]pyridazine-2-carboxylic acid 220 mg, 1.025 mmol
(3,4-dimethoxyphenyl)boronic acid 187 mg, 1.025 mmol
Agent Properties
1,4-dioxane 3mL
water 1.5mL
sodium carbonate 435 mg, 4.10 mol
tetrakis(triphenylphosphine)palladium(0) 110 mg, 0.095 mmol
DMSO
US 2016/16966 A1

252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
SKETCH PROCESSING
US 2004/101442 C00025
Default Interpretation
(USPTO molfile)
Our InterpretationOriginal Sketch
Re-interpretation of ChemDraw sketches

1. Correct systematic errors

2. Extract extra semantics (structure variation, reaction schemes)

3. Categorise output (is this something we can’t interpret)
John May, et al. Sketchy Sketches: Hiding Chemistry in Plain Sight. Seventh Joint Sheffield Conference on
Cheminformatics. 2016

Example 26, US 09718816 B2
John May, et al. Sketchy Sketches: Hiding Chemistry in Plain Sight. Seventh Joint Sheffield Conference on
Cheminformatics. 2016
Step 1
Step 4
Step 3
Step 2
etc..
Reaction SCHEME SKETCHES

SKETCH CATEGORISATION
Molecule/Specific
Molecule/Generic
Reaction/Specific
Reaction/Generic
NoConnectionTable
US 7092578 B2, Table 1 ”Signaling adaptive-quantization matrices in JPEG using end-of-block codes”
US 7092578 B2, Table 1

C000001.CDX
A category is assigned to
each extracted sketch:

SKETCH CATEGORISATION
US 7092578 B2, Table 1 ”Signaling adaptive-quantization matrices in JPEG using end-of-block codes”
US 7092578 B2, Table 1

C000001.CDX

mixtures and formulations
(cocktails)
US 2001/2252 A1
“TOOTH WHITENING PREPARATIONS”

R group Tables
US 2016/0002208 A1

Chemical name translation
6-aminopyrimidine-2,4,5-triol
Chinese (Hanzi used for each morpheme)
6-氨基嘧啶-2,4,5-三醇
Japanese (Phonetic translation to Katakana)
6-アミノピリミジン-2,4,5-トリオール
Korean (Phonetic translation to Hangul)
6-아미노피리미딘-2,4,5-트리올
ammonia radical pyrimidine three alcohol
amino pyrimidine tri ol
amino pyrimidine tri ol
N
N
OHHO
HO
NH2

EXTRACTED CHEMICAL DATA GROWTH
0
5M
10M
15M
20M
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
CumulativeNumberofRecords
Year
USPTO Exemplified Compounds
USPTO Reactions
EPO Reactions
USPTO Mixtures
~22M
~6M
~1M

Rule-base text-mining SPEED
Chih-Hsuan Wei et al. Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V
chemical-disease relation (CDR) task. Database (Oxford). 2016; 2016: baw032. PMC4799720
BioCreAtIvE V challenge
evaluating text-mining and
extraction systems.

Web service response time to
annotate an abstract evaluated for
CDR task.

Rule-base text-mining SPEED
Chih-Hsuan Wei et al. Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V
chemical-disease relation (CDR) task. Database (Oxford). 2016; 2016: baw032. PMC4799720
BioCreAtIvE V challenge
evaluating text-mining and
extraction systems.

Web service response time to
annotate an abstract evaluated for
CDR task.

Efficient rule-based text-mining
provides provenance for
annotations and can mine entire
back-archive of US patents in ~24
hours on a single machine.

Arthor Demo Video

Intelligent query box
Systematic Name Date Range Trivial Name
Yield Range Affiliation Reaction SMARTS
Disease Target DocumentLine Formula
SMILES InChIAuthor Protein Target Collection
Reaction Type (NameRxn)SMARTSSource
…and logical combinations thereof

Pistachio: Reactions

make/break REACTION SEARCH
Find: “7H-purine substructure product”
Find: “Synthesis of 7H-purine”
Requires fast-substructure search to compute using the complement of two sets.

Cocktails: Mixtures and formulations

ARTHOR - MOTIVATION
History in optimising search:

– R.Sayle, “1st-class SMARTS patterns”, Daylight CIS, European UGM, EuroMUG
1997, Verona, Italy
– R. Sayle, “Improved SMILES Substructure Searching”, Daylight CIS, European
UGM, EuroMUG 2000, Cambridge, UK.
– R. Sayle, “Efficient Matching of Chemical Subgraphs”, 9th ICCS,
Noordwijkerhout, The Netherlands, 9th June 2011.
“A substructure search of indole against eMolecules (~7M at the time)
takes 17 seconds” - 2014
Benchmark of 3.4K queries on 7M compounds from eMolecules
– John May and Roger Sayle, “Substructure Search Face-Off”, CCNM,
Cambridge, May 2015

SUBSEARCH PERFORMANCE
Updated from: John May and Roger Sayle, Substructure Search Face-Off, Presented at CCNM, Cambridge, May 2015
https://www.slideshare.net/NextMoveSoftware/substructure-search-faceoff
1
10
100
1000
3341
1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 1e+07
Time (ms)
NumQueries(n)
1s 10s 1m 5m 1h
90%
BioVia Direct
EPAM Bingo NoSQL
ChemAxon JCART
RDKit Cart
OpenChemLib
OB FastSearch
50m35s
1h2m59s
2h9m11s
2h44m47s
5h13m19s
5h53m40s
2d11h42m14s
EPAM Bingo Cart
Sachem 16m50s

SUBSEARCH PERFORMANCE
John May and Roger Sayle, Substructure Search Face-Off, Presented at CCNM, Cambridge, May 2015
https://www.slideshare.net/NextMoveSoftware/substructure-search-faceoff
1
10
100
1000
3341
1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 1e+07
Time (ms)
NumQueries(n)
1s 10s 1m 5m 1h
90%
BioVia Direct
EPAM Bingo NoSQL
ChemAxon JCART
RDKit Cart
OpenChemLib
OB FastSearch
50m35s
1h2m59s
2h9m11s
2h44m47s
5h13m19s
5h53m40s
2d11h42m14s
EPAM Bingo Cart
Sachem 16m50s
Arthor (Brute force) 27m17s
Arthor 46s
Arthor (8 threads) 12s

Substructure Optimisations
Ahead-of-time (AOT)
• Chemical records converted to pointer-free memory optimised
data structure (~166B per molecule)

• Path-based fingerprint computed and stored in inverted index

• Sensible ordering of results

Just-in-time (JIT)
• SMARTS traversal based on frequency statistics

• Atom/Bond expressions compiled and optimised using
boolean algebra

• Fingerprint screening bit selection

AOT: Storage order
Order by those most similar to the query and favour plain molecules.
CID 60795
CID 11669779
CID 11576259
CID 37888405

AOT: Storage order
CID 60795
CID 11669779
CID 11576259
CID 37888405
CID 60795
CID 11669779
CID 11576259
CID 37888405

Storage order

Tanimoto can’t be calculated ahead of time, but can be approximated.
Generate a hexadecimal key based on size and other properties
favouring “plain” molecules and order by this.
000e000e01000a0004000065000000 CCC(C(=O)O)Oc1ccc(cc1)Cl CHEMBL23477
AtomCountBondCountPartCountCarbonCountCommonHeteroCount
AtomicNumberSum
RadicalCount
ChargeCount
IsotopeCount

JIT: Pattern Traversal
The same query can be traversed (and matched) in a different orders.
How much slower?

Best BrCCCC
3.4x CC(Br)CC
5.6x CCCCBr
Best n1ccc2c1cccc2
1.4x c12c(ccn1)cccc2
2.3x c12ccccc1ccn2
3.3x c1cnc2ccccc12
3.3x c12ccnc1cccc2
4.8x c1c2ccccc2nc1
Before the query is matched it is rearranged to the “best” traversal
order based on frequency statistics

SIMILARITY Optimisations
Ahead-of-time (AOT)
• Store binary fingerprints in buckets based on the cardinality of
the fingerprint as the number of set bits: pop(ulation) count

• Stripe (or “transpose”) fingerprints reducing the memory reads
for the JIT code

Just-in-time (JIT)
• Generate machine code to perform to calculate the Tanimoto

TANIMOTO CODE GEN
double similarity(long[] q_fp, long[] db_fp) {
int intersect = 0;
int union = 0;
for (int i = 0; i < q_fp.length; i++) {
intersect += Long.bitCount(q_fp[i] & db_fp[i]);
union += Long.bitCount(q_fp[i] | db_fp[i]);
}
return intersect / (double) union;
}
double similarity(long[] q_fp, long[] db_fp, int q_pop, int db_pop) {
int intersect = 0;
}
return intersect / (double) (q_pop+db_pop-intersect);
}
Tanimoto Calculation (Java, 64-bit words)
Equivalent Tanimoto Calculating Union from Intersect

TANIMOTO CODE GEN
double intersect(long[] q_fp, long[] db_fp) {
int pop = 0;
}
return pop;
}
double intersect(long[] q_fp, long[] db_fp) {
int pop = 0;
intersect += Long.bitCount(q_fp[0] & db_fp[0]);
return pop;
}
Intersect Function
Intersect Function Unrolled

CHEMBL1906145
TANIMOTO CODE GEN
int intersectChembl1906145(long[] db_fp) {
int pop = 0;
pop += Long.bitCount(0x0000000000000000L & db_fp[1]);
pop += Long.bitCount(0x00000800000a1000L & db_fp[5]);
return pop;
}
For a given query (e.g. ) we can hard code the fingerprint.

CHEMBL1906145
TANIMOTO CODE GEN
bitCount on empty and singleton words (for ) can be eliminated.
int pop = 0;
pop += (db_fp[0] >> 2) & 0x1;
// pop += Long.bitCount(0x0000000000000000L & db_fp[1]);
pop += Long.bitCount(0x00000800000a1000L & db_fp[5]);
return pop;
}

2
6
13
3
12
4
11
5
10
14
15
79
8
To optimise the remaining 64-bit words (numbered 2-15) we can derive a graph by
connecting any two words that share a common bit.
TANIMOTO CODE GEN

TANIMOTO CODE GEN
2
6
13
3
12
4
11
5
10
14
15
79
8
Colouring the graph (such that no two colours are adjacent) tells us how many pop
counts we will need (the number of colours).

2
6
13
3
12
4
11
5
10
14
15
79
8
TANIMOTO CODE GEN
int pop = 0;
pop += (db_fp[0] >> 2) & 0x1;
// pop += Long.bitCount(0x0000000000000000L & db_fp[1]);
pop += Long.bitCount((0x0000000000400020L & db_fp[2]) |
(0x00000800000a1000L & db_fp[5]) |
(0x0000000006000100L & db_fp[9]) |
(0x0010000008000002L & db_fp[3]) |
(0x0800002000000000L & db_fp[7]) |
(0x0160000000000200L & db_fp[4]));
(0x0000280002002100L & db_fp[10]) |
(0x0100000000048000L & db_fp[11]) |
(0x0000002088000000L & db_fp[12]) |
(0x0000000000000841L & db_fp[8]));
(0x0008000000000100L & db_fp[14]));
return pop;
}
We can combine bitCount on words of the
same colour

Speedy tools for structure searching

• Quick feedback from a search allows refinement if needed

• Enables different types of search (e.g. make/break)
Speedy tools for text-mining patents
• Assists in improvement of grammar and dictionaries

• Extract from all patents not just a subset of IPC codes
CONCLUSIONS
Future Work
• Extract additional types of chemical data

• Advanced query features beyond SMARTS

Acknowledgements
Yurii Moroz, Chemspace

Pat Walters, Relay Therapeutics

James Davidson, Vernalis

Mathew Swain, Vernalis
Daniel Lowe, Minesoft
Related Talks:

• R Sayle. Recent Advances in Chemical & Biological Search Systems: Evolution v
Resolution. ICCS, May 2018

• J Mayfield, Pistachio: Search and Faceting of Large Reaction Databases. 254th ACS
National Meeting, Aug 2017

• D Lowe. Sketchy sketches: Hiding chemistry in plain sight. 252nd ACS National
Meeting, Aug 2016
Available at: https://www.slideshare.net/NextMoveSoftware
CINF 162: NextMove for Chemspace: Millisecond search in a database of 100
million structures. Thursday 10:25, Grand Ballroom A

CINF 170: Regioselectivity: An application of expert systems and ontologies to
chemical (named) reaction analysis. Thursday 10:40, Lewis

CINF 35: Structure searching for patent information: The need for speed

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CINF 35: Structure searching for patent information: The need for speed

Similar to CINF 35: Structure searching for patent information: The need for speed (20)

More from NextMove Software

More from NextMove Software (14)

Recently uploaded

Recently uploaded (20)

CINF 35: Structure searching for patent information: The need for speed