Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CINF 35: Structure searching for patent information: The need for speed

429 views

Published on

Chemical databases grow larger every year. Without investing in additional hardware or improved software, the time to search these databases will in turn grow longer annually. With an ever-increasing number of pharmaceutical patents, the amount of chemical data associated with these is growing at a rate with which hardware advances alone cannot keep up.

Using automated mining of U.S. and European patents, we have extracted large collections of structural data in the form of reactions, mixtures, and exemplified compounds. Additional information such as protein targets and diseases are also extracted from each patent and associated with the structural data. We will describe how this data can be queried with natural language phrases and how these phrases are interpreted as structural queries.

Through innovations in substructure and similarity search algorithms, it is possible to search and retrieve hundreds of millions of chemical records in fractions of a second. We will demonstrate how this is achieved on a regular desktop machine using just-in-time and ahead-of-time compilation techniques.

Published in: Science
  • Be the first to comment

  • Be the first to like this

CINF 35: Structure searching for patent information: The need for speed

  1. 1. 256th ACS National Meeting, Boston, Aug 2018 Structure searching for patent information: The need for speed John Mayfield, Noel O’Boyle, and Roger Sayle NextMove Software Cambridge, UK
  2. 2. 256th ACS National Meeting, Boston, Aug 2018 Data Search Algorithms
  3. 3. Daniel M. Lowe. Extraction of chemical structures and reactions from the literature. Ph.D. Thesis, University of Cambridge, 2012 To 7-chloro-4-oxo-4,5-dihydrofuro[2,3-d]pyridazine-2-carboxylic acid (Peakdale) (220 mg, 1.025 mmol) and (3,4- dimethoxyphenyl)boronic acid (187 mg, 1.025 mmol) in 1,4-dioxane (3 mL) and water (1.5 mL) was added sodium carbonate(435 mg, 4.10 mmol) and tetrakis(triphenylphosphine)palladium(0) (110 mg, 0.095 mmol). The reaction was heated in the microwave at 80° C. for 2 hours and at 100° C. for a further 2 hours. The solvent was removed and the residue was suspended in DMSO, filtered and purified by MDAP. Appropriate fractions were combined and the solvent removed to give 7-(3,4-dimethoxyphenyl)-4-oxo-4,5-dihydrofuro[2,3- d]pyridazine-2-carboxylic acid (25 mg, 7%) as a yellow solid. [0517] US 2016/16966 A1
  4. 4. Daniel M. Lowe. Extraction of chemical structures and reactions from the literature. Ph.D. Thesis, University of Cambridge, 2012 To 7-chloro-4-oxo-4,5-dihydrofuro[2,3-d]pyridazine-2-carboxylic acid (Peakdale) (220 mg, 1.025 mmol) and (3,4- dimethoxyphenyl)boronic acid (187 mg, 1.025 mmol) in 1,4-dioxane (3 mL) and water (1.5 mL) was added sodium carbonate(435 mg, 4.10 mmol) and tetrakis(triphenylphosphine)palladium(0) (110 mg, 0.095 mmol). The reaction was heated in the microwave at 80° C. for 2 hours and at 100° C. for a further 2 hours. The solvent was removed and the residue was suspended in DMSO, filtered and purified by MDAP. Appropriate fractions were combined and the solvent removed to give 7-(3,4-dimethoxyphenyl)-4-oxo-4,5-dihydrofuro[2,3- d]pyridazine-2-carboxylic acid (25 mg, 7%) as a yellow solid. [0517] US 2016/16966 A1
  5. 5. Daniel M. Lowe. Extraction of chemical structures and reactions from the literature. Ph.D. Thesis, University of Cambridge, 2012 To 7-chloro-4-oxo-4,5-dihydrofuro[2,3-d]pyridazine-2-carboxylic acid (Peakdale) (220 mg, 1.025 mmol) and (3,4- dimethoxyphenyl)boronic acid (187 mg, 1.025 mmol) in 1,4-dioxane (3 mL) and water (1.5 mL) was added sodium carbonate(435 mg, 4.10 mmol) and tetrakis(triphenylphosphine)palladium(0) (110 mg, 0.095 mmol). The reaction was heated in the microwave at 80° C. for 2 hours and at 100° C. for a further 2 hours. The solvent was removed and the residue was suspended in DMSO, filtered and purified by MDAP. Appropriate fractions were combined and the solvent removed to give 7-(3,4-dimethoxyphenyl)-4-oxo-4,5-dihydrofuro[2,3- d]pyridazine-2-carboxylic acid (25 mg, 7%) as a yellow solid. [0517] Product Properties 7-(3,4-dimethoxyphenyl)-4-oxo-4,5-dihydrofuro[2,3-d]pyridazine-2-carboxylic acid 25 mg, 7% yield, Yellow Solid Reactant Properties 7-chloro-4-oxo-4,5-dihydrofuro[2,3-d]pyridazine-2-carboxylic acid 220 mg, 1.025 mmol (3,4-dimethoxyphenyl)boronic acid 187 mg, 1.025 mmol Agent Properties 1,4-dioxane 3mL water 1.5mL sodium carbonate 435 mg, 4.10 mol tetrakis(triphenylphosphine)palladium(0) 110 mg, 0.095 mmol DMSO US 2016/16966 A1
  6. 6. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 SKETCH PROCESSING US 2004/101442 C00025 Default Interpretation (USPTO molfile) Our InterpretationOriginal Sketch Re-interpretation of ChemDraw sketches 1. Correct systematic errors 2. Extract extra semantics (structure variation, reaction schemes) 3. Categorise output (is this something we can’t interpret) John May, et al. Sketchy Sketches: Hiding Chemistry in Plain Sight. Seventh Joint Sheffield Conference on Cheminformatics. 2016
  7. 7. Example 26, US 09718816 B2 John May, et al. Sketchy Sketches: Hiding Chemistry in Plain Sight. Seventh Joint Sheffield Conference on Cheminformatics. 2016 Step 1 Step 4 Step 3 Step 2 etc.. Reaction SCHEME SKETCHES
  8. 8. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 SKETCH CATEGORISATION Molecule/Specific Molecule/Generic Reaction/Specific Reaction/Generic NoConnectionTable US 7092578 B2, Table 1 ”Signaling adaptive-quantization matrices in JPEG using end-of-block codes” US 7092578 B2, Table 1 C000001.CDX A category is assigned to each extracted sketch:
  9. 9. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 SKETCH CATEGORISATION US 7092578 B2, Table 1 ”Signaling adaptive-quantization matrices in JPEG using end-of-block codes” US 7092578 B2, Table 1 C000001.CDX
  10. 10. 256th ACS National Meeting, Boston, Aug 2018 mixtures and formulations (cocktails) US 2001/2252 A1 “TOOTH WHITENING PREPARATIONS”
  11. 11. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 R group Tables US 2016/0002208 A1
  12. 12. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 R group Tables US 2016/0002208 A1
  13. 13. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Chemical name translation 6-aminopyrimidine-2,4,5-triol Chinese (Hanzi used for each morpheme) 6-氨基嘧啶-2,4,5-三醇 Japanese (Phonetic translation to Katakana) 6-アミノピリミジン-2,4,5-トリオール Korean (Phonetic translation to Hangul) 6-아미노피리미딘-2,4,5-트리올 ammonia radical pyrimidine three alcohol amino pyrimidine tri ol amino pyrimidine tri ol N N OHHO HO NH2
  14. 14. EXTRACTED CHEMICAL DATA GROWTH 0 5M 10M 15M 20M 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 CumulativeNumberofRecords Year USPTO Exemplified Compounds USPTO Reactions EPO Reactions USPTO Mixtures ~22M ~6M ~1M
  15. 15. 256th ACS National Meeting, Boston, Aug 2018 Rule-base text-mining SPEED Chih-Hsuan Wei et al. Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task. Database (Oxford). 2016; 2016: baw032. PMC4799720 BioCreAtIvE V challenge evaluating text-mining and extraction systems. Web service response time to annotate an abstract evaluated for CDR task.
  16. 16. 256th ACS National Meeting, Boston, Aug 2018 Rule-base text-mining SPEED Chih-Hsuan Wei et al. Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task. Database (Oxford). 2016; 2016: baw032. PMC4799720 BioCreAtIvE V challenge evaluating text-mining and extraction systems. Web service response time to annotate an abstract evaluated for CDR task. Efficient rule-based text-mining provides provenance for annotations and can mine entire back-archive of US patents in ~24 hours on a single machine.
  17. 17. 256th ACS National Meeting, Boston, Aug 2018 Data Search Algorithms
  18. 18. 256th ACS National Meeting, Boston, Aug 2018
  19. 19. 256th ACS National Meeting, Boston, Aug 2018 Arthor Demo Video
  20. 20. 256th ACS National Meeting, Boston, Aug 2018 Intelligent query box Systematic Name Date Range Trivial Name Yield Range Affiliation Reaction SMARTS Disease Target DocumentLine Formula SMILES InChIAuthor Protein Target Collection Reaction Type (NameRxn)SMARTSSource …and logical combinations thereof
  21. 21. 256th ACS National Meeting, Boston, Aug 2018 Pistachio: Reactions
  22. 22. 256th ACS National Meeting, Boston, Aug 2018 make/break REACTION SEARCH Find: “7H-purine substructure product” Find: “Synthesis of 7H-purine” Requires fast-substructure search to compute using the complement of two sets.
  23. 23. 256th ACS National Meeting, Boston, Aug 2018 Cocktails: Mixtures and formulations
  24. 24. 256th ACS National Meeting, Boston, Aug 2018 Data Search Algorithms
  25. 25. 256th ACS National Meeting, Boston, Aug 2018 ARTHOR - MOTIVATION History in optimising search: – R.Sayle, “1st-class SMARTS patterns”, Daylight CIS, European UGM, EuroMUG 1997, Verona, Italy – R. Sayle, “Improved SMILES Substructure Searching”, Daylight CIS, European UGM, EuroMUG 2000, Cambridge, UK. – R. Sayle, “Efficient Matching of Chemical Subgraphs”, 9th ICCS, Noordwijkerhout, The Netherlands, 9th June 2011. “A substructure search of indole against eMolecules (~7M at the time) takes 17 seconds” - 2014 Benchmark of 3.4K queries on 7M compounds from eMolecules – John May and Roger Sayle, “Substructure Search Face-Off”, CCNM, Cambridge, May 2015
  26. 26. 256th ACS National Meeting, Boston, Aug 2018 SUBSEARCH PERFORMANCE Updated from: John May and Roger Sayle, Substructure Search Face-Off, Presented at CCNM, Cambridge, May 2015 https://www.slideshare.net/NextMoveSoftware/substructure-search-faceoff 1 10 100 1000 3341 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 1e+07 Time (ms) NumQueries(n) 1s 10s 1m 5m 1h 90% BioVia Direct EPAM Bingo NoSQL ChemAxon JCART RDKit Cart OpenChemLib OB FastSearch 50m35s 1h2m59s 2h9m11s 2h44m47s 5h13m19s 5h53m40s 2d11h42m14s EPAM Bingo Cart Sachem 16m50s
  27. 27. 256th ACS National Meeting, Boston, Aug 2018 SUBSEARCH PERFORMANCE John May and Roger Sayle, Substructure Search Face-Off, Presented at CCNM, Cambridge, May 2015 https://www.slideshare.net/NextMoveSoftware/substructure-search-faceoff 1 10 100 1000 3341 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 1e+07 Time (ms) NumQueries(n) 1s 10s 1m 5m 1h 90% BioVia Direct EPAM Bingo NoSQL ChemAxon JCART RDKit Cart OpenChemLib OB FastSearch 50m35s 1h2m59s 2h9m11s 2h44m47s 5h13m19s 5h53m40s 2d11h42m14s EPAM Bingo Cart Sachem 16m50s Arthor (Brute force) 27m17s Arthor 46s Arthor (8 threads) 12s
  28. 28. 256th ACS National Meeting, Boston, Aug 2018 Substructure Optimisations Ahead-of-time (AOT) • Chemical records converted to pointer-free memory optimised data structure (~166B per molecule) • Path-based fingerprint computed and stored in inverted index • Sensible ordering of results Just-in-time (JIT) • SMARTS traversal based on frequency statistics • Atom/Bond expressions compiled and optimised using boolean algebra • Fingerprint screening bit selection
  29. 29. 256th ACS National Meeting, Boston, Aug 2018 AOT: Storage order Order by those most similar to the query and favour plain molecules. CID 60795 CID 11669779 CID 11576259 CID 37888405
  30. 30. 256th ACS National Meeting, Boston, Aug 2018 AOT: Storage order CID 60795 CID 11669779 CID 11576259 CID 37888405 CID 60795 CID 11669779 CID 11576259 CID 37888405 Order by those most similar to the query and favour plain molecules.
  31. 31. 256th ACS National Meeting, Boston, Aug 2018 Storage order Order by those most similar to the query and favour plain molecules. Tanimoto can’t be calculated ahead of time, but can be approximated. Generate a hexadecimal key based on size and other properties favouring “plain” molecules and order by this. 000e000e01000a0004000065000000 CCC(C(=O)O)Oc1ccc(cc1)Cl CHEMBL23477 AtomCountBondCountPartCountCarbonCountCommonHeteroCount AtomicNumberSum RadicalCount ChargeCount IsotopeCount
  32. 32. 256th ACS National Meeting, Boston, Aug 2018 JIT: Pattern Traversal The same query can be traversed (and matched) in a different orders. How much slower? Best BrCCCC 3.4x CC(Br)CC 5.6x CCCCBr Best n1ccc2c1cccc2 1.4x c12c(ccn1)cccc2 2.3x c12ccccc1ccn2 3.3x c1cnc2ccccc12 3.3x c12ccnc1cccc2 4.8x c1c2ccccc2nc1 Before the query is matched it is rearranged to the “best” traversal order based on frequency statistics
  33. 33. 256th ACS National Meeting, Boston, Aug 2018 SIMILARITY Optimisations Ahead-of-time (AOT) • Store binary fingerprints in buckets based on the cardinality of the fingerprint as the number of set bits: pop(ulation) count • Stripe (or “transpose”) fingerprints reducing the memory reads for the JIT code Just-in-time (JIT) • Generate machine code to perform to calculate the Tanimoto
  34. 34. 256th ACS National Meeting, Boston, Aug 2018 TANIMOTO CODE GEN double similarity(long[] q_fp, long[] db_fp) { int intersect = 0; int union = 0; for (int i = 0; i < q_fp.length; i++) { intersect += Long.bitCount(q_fp[i] & db_fp[i]); union += Long.bitCount(q_fp[i] | db_fp[i]); } return intersect / (double) union; } double similarity(long[] q_fp, long[] db_fp, int q_pop, int db_pop) { int intersect = 0; for (int i = 0; i < q_fp.length; i++) { intersect += Long.bitCount(q_fp[i] & db_fp[i]); } return intersect / (double) (q_pop+db_pop-intersect); } Tanimoto Calculation (Java, 64-bit words) Equivalent Tanimoto Calculating Union from Intersect
  35. 35. 256th ACS National Meeting, Boston, Aug 2018 TANIMOTO CODE GEN double intersect(long[] q_fp, long[] db_fp) { int pop = 0; for (int i = 0; i < q_fp.length; i++) { intersect += Long.bitCount(q_fp[i] & db_fp[i]); } return pop; } double intersect(long[] q_fp, long[] db_fp) { int pop = 0; intersect += Long.bitCount(q_fp[0] & db_fp[0]); intersect += Long.bitCount(q_fp[1] & db_fp[1]); intersect += Long.bitCount(q_fp[2] & db_fp[2]); intersect += Long.bitCount(q_fp[3] & db_fp[3]); intersect += Long.bitCount(q_fp[4] & db_fp[4]); intersect += Long.bitCount(q_fp[5] & db_fp[5]); intersect += Long.bitCount(q_fp[6] & db_fp[6]); intersect += Long.bitCount(q_fp[7] & db_fp[7]); intersect += Long.bitCount(q_fp[8] & db_fp[8]); intersect += Long.bitCount(q_fp[9] & db_fp[9]); intersect += Long.bitCount(q_fp[10] & db_fp[10]); intersect += Long.bitCount(q_fp[11] & db_fp[11]); intersect += Long.bitCount(q_fp[12] & db_fp[12]); intersect += Long.bitCount(q_fp[13] & db_fp[13]); intersect += Long.bitCount(q_fp[14] & db_fp[14]); intersect += Long.bitCount(q_fp[15] & db_fp[15]); return pop; } Intersect Function Intersect Function Unrolled
  36. 36. 256th ACS National Meeting, Boston, Aug 2018 CHEMBL1906145 TANIMOTO CODE GEN int intersectChembl1906145(long[] db_fp) { int pop = 0; pop += Long.bitCount(0x0000000000000000L & db_fp[1]); pop += Long.bitCount(0x0000000000000000L & db_fp[1]); pop += Long.bitCount(0x0000000000400020L & db_fp[2]); pop += Long.bitCount(0x0010000008000002L & db_fp[3]); pop += Long.bitCount(0x0160000000000200L & db_fp[4]); pop += Long.bitCount(0x00000800000a1000L & db_fp[5]); pop += Long.bitCount(0x1000000001580000L & db_fp[6]); pop += Long.bitCount(0x0800002000000000L & db_fp[7]); pop += Long.bitCount(0x0000000000000841L & db_fp[8]); pop += Long.bitCount(0x0000000006000100L & db_fp[9]); pop += Long.bitCount(0x0000280002002100L & db_fp[10]); pop += Long.bitCount(0x0100000000048000L & db_fp[11]); pop += Long.bitCount(0x0000002088000000L & db_fp[12]); pop += Long.bitCount(0x0000008000400000L & db_fp[13]); pop += Long.bitCount(0x0008000000000100L & db_fp[14]); pop += Long.bitCount(0x0000000000010180L & db_fp[15]); return pop; } For a given query (e.g. ) we can hard code the fingerprint.
  37. 37. 256th ACS National Meeting, Boston, Aug 2018 CHEMBL1906145 TANIMOTO CODE GEN bitCount on empty and singleton words (for ) can be eliminated. int intersectChembl1906145(long[] db_fp) { int pop = 0; pop += (db_fp[0] >> 2) & 0x1; // pop += Long.bitCount(0x0000000000000000L & db_fp[1]); pop += Long.bitCount(0x0000000000400020L & db_fp[2]); pop += Long.bitCount(0x0010000008000002L & db_fp[3]); pop += Long.bitCount(0x0160000000000200L & db_fp[4]); pop += Long.bitCount(0x00000800000a1000L & db_fp[5]); pop += Long.bitCount(0x1000000001580000L & db_fp[6]); pop += Long.bitCount(0x0800002000000000L & db_fp[7]); pop += Long.bitCount(0x0000000000000841L & db_fp[8]); pop += Long.bitCount(0x0000000006000100L & db_fp[9]); pop += Long.bitCount(0x0000280002002100L & db_fp[10]); pop += Long.bitCount(0x0100000000048000L & db_fp[11]); pop += Long.bitCount(0x0000002088000000L & db_fp[12]); pop += Long.bitCount(0x0000008000400000L & db_fp[13]); pop += Long.bitCount(0x0008000000000100L & db_fp[14]); pop += Long.bitCount(0x0000000000010180L & db_fp[15]); return pop; }
  38. 38. 256th ACS National Meeting, Boston, Aug 2018 2 6 13 3 12 4 11 5 10 14 15 79 8 To optimise the remaining 64-bit words (numbered 2-15) we can derive a graph by connecting any two words that share a common bit. TANIMOTO CODE GEN
  39. 39. 256th ACS National Meeting, Boston, Aug 2018 TANIMOTO CODE GEN 2 6 13 3 12 4 11 5 10 14 15 79 8 Colouring the graph (such that no two colours are adjacent) tells us how many pop counts we will need (the number of colours).
  40. 40. 256th ACS National Meeting, Boston, Aug 2018 2 6 13 3 12 4 11 5 10 14 15 79 8 TANIMOTO CODE GEN int intersectChembl1906145(long[] db_fp) { int pop = 0; pop += (db_fp[0] >> 2) & 0x1; // pop += Long.bitCount(0x0000000000000000L & db_fp[1]); pop += Long.bitCount((0x0000000000400020L & db_fp[2]) | (0x00000800000a1000L & db_fp[5]) | (0x0000000006000100L & db_fp[9]) | (0x0010000008000002L & db_fp[3]) | (0x0800002000000000L & db_fp[7]) | (0x0160000000000200L & db_fp[4])); pop += Long.bitCount((0x1000000001580000L & db_fp[6]) | (0x0000280002002100L & db_fp[10]) | (0x0100000000048000L & db_fp[11]) | (0x0000002088000000L & db_fp[12]) | (0x0000000000000841L & db_fp[8])); pop += Long.bitCount((0x0000008000400000L & db_fp[13]) | (0x0008000000000100L & db_fp[14])); pop += Long.bitCount(0x0000000000010180L & db_fp[15]); return pop; } We can combine bitCount on words of the same colour
  41. 41. 256th ACS National Meeting, Boston, Aug 2018 Speedy tools for structure searching • Quick feedback from a search allows refinement if needed • Enables different types of search (e.g. make/break) Speedy tools for text-mining patents • Assists in improvement of grammar and dictionaries • Extract from all patents not just a subset of IPC codes CONCLUSIONS Future Work • Extract additional types of chemical data • Advanced query features beyond SMARTS
  42. 42. 256th ACS National Meeting, Boston, Aug 2018 Acknowledgements Yurii Moroz, Chemspace Pat Walters, Relay Therapeutics James Davidson, Vernalis Mathew Swain, Vernalis Daniel Lowe, Minesoft Related Talks: • R Sayle. Recent Advances in Chemical & Biological Search Systems: Evolution v Resolution. ICCS, May 2018 • J Mayfield, Pistachio: Search and Faceting of Large Reaction Databases. 254th ACS National Meeting, Aug 2017 • D Lowe. Sketchy sketches: Hiding chemistry in plain sight. 252nd ACS National Meeting, Aug 2016 Available at: https://www.slideshare.net/NextMoveSoftware CINF 162: NextMove for Chemspace: Millisecond search in a database of 100 million structures. Thursday 10:25, Grand Ballroom A CINF 170: Regioselectivity: An application of expert systems and ontologies to chemical (named) reaction analysis. Thursday 10:40, Lewis

×