Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Chemistry and reactions from non-US patents

2,039 views

Published on

All US patents from 1976 onwards are freely available in computer-readable formats providing a large corpus for chemical text mining. Other patent offices are increasingly also offering their back-catalogues as XML, allowing chemical text mining to be performed in the same way as for recent US patents.
We investigate how much chemistry is found in non-US patents (compared to US patents) and, where the chemistry is present in publications from multiple patent offices, how long were the delays between these publications.
We show that non-US patents can be text mined for a large number of chemical reactions and analyse the overlap with reactions from US patents. Finally we use all the extracted chemical reactions to explore whether models for predicting reaction yield may be built from features such as the reaction type and its reaction conditions (as text mined from the patent text).

Published in: Technology
  • Be the first to comment

Chemistry and reactions from non-US patents

  1. 1. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Chemistry and reactions from non- US patents Daniel Lowe and Roger Sayle NextMove Software Cambridge, UK
  2. 2. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Topics • Coverage of USPTO vs EPO patents • Extraction of non-chemical-compound data • Can text-mining provide insights into reaction informatics
  3. 3. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 What am I missing by only using the USPTO data?
  4. 4. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 EPO and USPTO background • USPTO: 303k grants published in 2013 • EPO: 67k grants published in 2013 • EPO patents must be in one or more of English, French or German • USPTO documents are all in English
  5. 5. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Epo/USPTO Compounds (using StdInChI) 3,345,877 2,674,069 445,588 USPTO = 1976-2013 patent grants + 2001-2013 patent applications EPO = 1978-2013 patent grants and applications
  6. 6. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Epo/USPTO Reactions (using separately sorted StdInChIs for reactants/agents and products) 706,524 317,533 99,681 USPTO = 1976-2013 patent grants + 2001-2013 patent applications EPO = 1978-2013 patent grants and applications
  7. 7. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Effect of different Languages 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Numberofstructuresextracted English German French Translation performed by Lexichem v2014.Jun
  8. 8. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 First Disclosure Occurrence by Language Excluding compounds disclosed by USPTO (1976- 2013) patents All compounds
  9. 9. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Timeliness • Competitive intelligence requires up to date information • Methodology: – Consider compounds present in both EPO and USPTO date not disclosed by either prior to 2006 – Compare the earliest publication date for each compound
  10. 10. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 US more than 5 years earlier US 3-5 years earlier US 2-3 years earlier US 1-2 years earlier US 6-12 months earlier US 3-6 months earlier US 1-3 months earlier US 1-4 weeks earlier Within a week US 1-4 weeks later US 1-3 months later US 3-6 months later US 6-12 months later US 1-2 years later US 2-3 years later US 3-5 years later US more than 5 years later Compounddisclosures Average lag between EPO/USPTO Compounds first disclosed between 2006-2013 Mean: US 540 days earlier
  11. 11. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 0 10000 20000 30000 40000 50000 60000 70000 US more than 5 years earlier US 3-5 years earlier US 2-3 years earlier US 1-2 years earlier US 6-12 months earlier US 3-6 months earlier US 1-3 months earlier US 1-4 weeks earlier Within a week US 1-4 weeks later US 1-3 months later US 3-6 months later US 6-12 months later US 1-2 years later US 2-3 years later US 3-5 years later US more than 5 years later Compounddisclosures Applications Vs Applications Mean: US 223 days earlier Compounds first disclosed between 2006-2013
  12. 12. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 0 20000 40000 60000 80000 100000 120000 140000 US more than 5 years earlier US 3-5 years earlier US 2-3 years earlier US 1-2 years earlier US 6-12 months earlier US 3-6 months earlier US 1-3 months earlier US 1-4 weeks earlier Within a week US 1-4 weeks later US 1-3 months later US 3-6 months later US 6-12 months later US 1-2 years later US 2-3 years later US 3-5 years later US more than 5 years later Compounddisclosures Grants vs Grants Compounds first disclosed between 2006-2013 Mean: US 287 days earlier
  13. 13. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
  14. 14. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Epo/US 1976-2013 and chembl19 overlap 187,486 1,155,034 6,278,048
  15. 15. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Patents 4-5 years earlier Patents 3-4 years earlier Patents 2-3 years earlier Patents 1-2 years earlier Patents 0-1 years earlier Patents 0-1 years later Patents 1-2 years later Patents 2-3 years later Patents 3-5 years later Patents 4-5 years later Average lag between Patents/CHEMbl19 Compounds first disclosed between 2006-2013 Mean: Patents 345 days earlier
  16. 16. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Other data in patents • Genes/proteins • Reactions including role of reagents • Physical quantities • Spectra • Diseases • Organisms • Companies • …
  17. 17. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Gene/protein identification • Identify trends in drug target popularity • Count number of patents (in a year) mentioning a given gene or its gene products and map to the HGNC symbol • Many genes are referenced by short ambiguous terms hence great care is required to have good precision
  18. 18. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Gene/protein identification (Hepatocyte growth factor receptor)(Epidermal growth factor receptor) 0.00% 0.10% 0.20% 0.30% 0.40% 0.50% 0.60% 0.70% 0.80% 0.90% 0.00% 0.05% 0.10% 0.15% 0.20% 0.25% 0.30% 0.35% 1976197819801982198419861988199019921994199619982000200220042006200820102012 Percentageofprotein/genementionsinthatyear(ChEMBL) Percentageofprotein/genementionsinthatyear(Patents) EGFR 0.00% 0.10% 0.20% 0.30% 0.40% 0.50% 0.60% 0.70% 0.00% 0.02% 0.04% 0.06% 0.08% 0.10% 0.12% 0.14% 1976197819801982198419861988199019921994199619982000200220042006200820102012 Percentageofprotein/genementionsinthatyear(ChEMBL) Percentageofprotein/genementionsinthatyear(Patents) MET
  19. 19. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Gene/protein identification (Mammalian target of rapamycin) (Tissue plasminogen activator) 0.00% 0.05% 0.10% 0.15% 0.20% 0.25% 0.30% 0.35% 0.40% 0.45% 0.00% 0.02% 0.04% 0.06% 0.08% 0.10% 0.12% 0.14% 1976197819801982198419861988199019921994199619982000200220042006200820102012 Percentageofprotein/genementionsinthatyear(ChEMBL) Percentageofprotein/genementionsinthatyear(Patents) MTOR 0.00% 0.05% 0.10% 0.15% 0.20% 0.25% 0.30% 0.35% 0.40% 0.00% 0.05% 0.10% 0.15% 0.20% 0.25% 0.30% 0.35% 0.40% 0.45% 0.50% 1976 1978 1980 1982 1984 1986 19881990 1992 1994 1996 1998 2000 2002 2004 20062008 2010 2012 Percentageofprotein/genementionsinthatyear(ChEMBL) Percentageofprotein/genementionsinthatyear(Patents) PLAT
  20. 20. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014
  21. 21. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Melting/Boiling Point Extraction
  22. 22. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Melting/Boiling Point Extraction Over 99k so far!
  23. 23. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 CHEMICAL reactions
  24. 24. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Predicting yield • Features to consider: • 2D fingerprints (especially around the reaction centers) • Reaction type • Temperature • Time • Change in complexity e.g. chiral centres
  25. 25. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Data extraction • Can pull out yields, quantities, times and temperatures • Can sanity check text-mined yield by calculating from amounts (or even masses) 𝑚𝑎𝑠𝑠 𝑜𝑓 𝑐𝑜𝑚𝑝𝑜𝑢𝑛𝑑(𝑔) 𝑚𝑜𝑙𝑎𝑟 𝑚𝑎𝑠𝑠 [𝑐𝑎𝑙𝑐 𝑓𝑟𝑜𝑚 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒](𝑔𝑚𝑜𝑙−1) = 𝑚𝑜𝑙𝑠 𝑜𝑓 𝑐𝑜𝑚𝑝𝑜𝑢𝑛𝑑 𝑚𝑜𝑙𝑠 𝑜𝑓 𝑝𝑟𝑜𝑑𝑢𝑐𝑡 𝑚𝑜𝑙𝑠 𝑜𝑓 𝑙𝑖𝑚𝑖𝑡𝑖𝑛𝑔 𝑟𝑒𝑎𝑐𝑡𝑎𝑛𝑡 = %𝑦𝑖𝑒𝑙𝑑
  26. 26. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Outliers inevitable
  27. 27. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 sCale vs yield 2001-2013 US applications, Suzuki couplings
  28. 28. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Temperatures
  29. 29. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Identify Synthetic Routes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Intermediates 197702103114 56611 31403 17268 9230 5057 2701 1256 639 301 136 58 15 5 2 Terminal Products 385149149445 81837 47579 27670 16619 9320 5263 2511 1330 678 373 111 63 8 6 5 0 100000 200000 300000 400000 500000 600000 700000 Occurrences Number of steps Intermediates Terminal Products
  30. 30. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Number of steps vs Complexity ringComplexityScore = log10(nRingBridgeAtoms + 1) + log10(nSpiroAtoms + 1) stereoComplexityScore = log10(nStereoCenters + 1) macrocyclePenalty = log10(nMacrocycles + 1) sizePenalty = natoms1.005 − natoms ∑ Ertl & Schuffenhauer 2009 doi:10.1186/1758-2946-1-8
  31. 31. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Number of steps vs Complexity 0 5 10 15 20 25 30 35 40 45 50 1 2 3 4 5 6 7 8 9 10 11 Numberofproductheavyatoms Number of steps Average number of heavy atoms
  32. 32. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Yield vs Change in Complexity 2001-2013 US applications, all reactions
  33. 33. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Trends in Reaction Types 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Suzukicouplingsasapercentageofreactionsinayear Classification performed using NameRXN
  34. 34. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Trends In Solvent Use 0.0% 5.0% 10.0% 15.0% 20.0% 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Percentageofreactionsinthatyear Tetrahydrofuran Dichloromethane Water Dimethylformamide Methanol Ethyl acetate Ethanol 1,4-Dioxane Toluene Acetonitrile Acetic acid Chloroform Acetone Benzene
  35. 35. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Are solvents getting greener? 1976 2013 Water (21%) Tetrahydrofuran (15%) Ethanol (11%) Dichloromethane (14%) Benzene (8%) Water (13%) Methanol (7%) Dimethylformamide (10%) Tetrahydrofuran (5%) Methanol (8%) Dichloromethane (4%) Ethyl acetate (7%) Dimethylformamide (4%) Ethanol (5%) Acetic acid (4%) 1,4-Dioxane (4%) Chloroform (3%) Toluene (3%) Acetone (3%) Acetonitrile (3%) Total for top 10: 71% 82%
  36. 36. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Conclusions • A significant amount of novel chemistry, from EPO patents, comes from the non-English patents • Compounds disclosed by both the USPTO and EPO are on average published earlier by the USPTO but for many compounds an EPO patent will be the earlier disclosure • Gene/protein identification can identify clear changes in patenting behaviour over time • Text mining provides the tools to answer many reaction informatics questions
  37. 37. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Acknowledgements • Funding provided by:
  38. 38. 248th ACS National Meeting, San Francisco CA, USA 10th August 2014 Thank you for your time! http://nextmovesoftware.com http://nextmovesoftware.com/blog daniel@nextmovesoftware.com

×