Chemistry and reactions from non-US patents

248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Chemistry and reactions from non-
US patents
Daniel Lowe and Roger Sayle
NextMove Software
Cambridge, UK

Topics
• Coverage of USPTO vs EPO patents
• Extraction of non-chemical-compound data
• Can text-mining provide insights into reaction
informatics

What am I missing by only using the
USPTO data?

EPO and USPTO background
• USPTO: 303k grants published in 2013
• EPO: 67k grants published in 2013
• EPO patents must be in one or more of
English, French or German
• USPTO documents are all in English

Epo/USPTO Compounds
(using StdInChI)
3,345,877
2,674,069
445,588
USPTO = 1976-2013 patent grants + 2001-2013 patent applications
EPO = 1978-2013 patent grants and applications

Epo/USPTO Reactions
(using separately sorted StdInChIs for reactants/agents and products)
706,524
317,533
99,681
USPTO = 1976-2013 patent grants + 2001-2013 patent applications
EPO = 1978-2013 patent grants and applications

Effect of different Languages
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Numberofstructuresextracted
English
German
French
Translation performed by Lexichem v2014.Jun

First Disclosure Occurrence by
Language
Excluding compounds
disclosed by USPTO (1976-
2013) patents
All compounds

Timeliness
• Competitive intelligence requires up to date
information
• Methodology:
– Consider compounds present in both EPO and
USPTO date not disclosed by either prior to 2006
– Compare the earliest publication date for each
compound

0
20000
40000
60000
80000
100000
120000
140000
160000
180000
US more
than 5
years
earlier
US 3-5
years
earlier
US 2-3
years
earlier
US 1-2
years
earlier
US 6-12
months
earlier
US 3-6
months
earlier
US 1-3
months
earlier
US 1-4
weeks
earlier
Within a
week
US 1-4
weeks
later
US 1-3
months
later
US 3-6
months
later
US 6-12
months
later
US 1-2
years
later
US 2-3
years
later
US 3-5
years
later
US more
than 5
years
later
Compounddisclosures
Average lag between EPO/USPTO
Compounds first disclosed between 2006-2013
Mean: US 540 days earlier

0
10000
20000
30000
40000
50000
60000
70000
US more
than 5
years
earlier
US 3-5
years
earlier
US 2-3
years
earlier
US 1-2
years
earlier
US 6-12
months
earlier
US 3-6
months
earlier
US 1-3
months
earlier
US 1-4
weeks
earlier
Within a
week
US 1-4
weeks
later
US 1-3
months
later
US 3-6
months
later
US 6-12
months
later
US 1-2
years
later
US 2-3
years
later
US 3-5
years
later
US more
than 5
years
later
Compounddisclosures
Applications Vs Applications

0
20000
40000
60000
80000
100000
120000
140000
US more
than 5
years
earlier
US 3-5
years
earlier
US 2-3
years
earlier
US 1-2
years
earlier
US 6-12
months
earlier
US 3-6
months
earlier
US 1-3
months
earlier
US 1-4
weeks
earlier
Within a
week
US 1-4
weeks
later
US 1-3
months
later
US 3-6
months
later
US 6-12
months
later
US 1-2
years
later
US 2-3
years
later
US 3-5
years
later
US more
than 5
years
later
Compounddisclosures
Grants vs Grants

Epo/US 1976-2013 and chembl19
overlap
187,486
1,155,034
6,278,048

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Patents 4-5
years earlier
Patents 3-4
years earlier
Patents 2-3
years earlier
Patents 1-2
years earlier
Patents 0-1
years earlier
Patents 0-1
years later
Patents 1-2
years later
Patents 2-3
years later
Patents 3-5
years later
Patents 4-5
years later
Average lag between
Patents/CHEMbl19
Mean: Patents 345 days
earlier

Other data in patents
• Genes/proteins
• Reactions including role of reagents
• Physical quantities
• Spectra
• Diseases
• Organisms
• Companies
• …

Gene/protein identification
• Identify trends in drug target popularity
• Count number of patents (in a year)
mentioning a given gene or its gene products
and map to the HGNC symbol
• Many genes are referenced by short
ambiguous terms hence great care is required
to have good precision

(Hepatocyte growth factor receptor)(Epidermal growth factor receptor)
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
0.70%
0.80%
0.90%
0.00%
0.05%
0.10%
0.15%
0.20%
0.25%
0.30%
0.35%
1976197819801982198419861988199019921994199619982000200220042006200820102012
Percentageofprotein/genementionsinthatyear(ChEMBL)
Percentageofprotein/genementionsinthatyear(Patents)
EGFR
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
0.70%
0.00%
0.02%
0.04%
0.06%
0.08%
0.10%
0.12%
0.14%
1976197819801982198419861988199019921994199619982000200220042006200820102012
MET

(Mammalian target of rapamycin) (Tissue plasminogen activator)
0.00%
0.05%
0.10%
0.15%
0.20%
0.25%
0.30%
0.35%
0.40%
0.45%
0.00%
0.02%
0.04%
0.06%
0.08%
0.10%
0.12%
0.14%
1976197819801982198419861988199019921994199619982000200220042006200820102012
MTOR
0.00%
0.05%
0.10%
0.15%
0.20%
0.25%
0.30%
0.35%
0.40%
0.00%
0.05%
0.10%
0.15%
0.20%
0.25%
0.30%
0.35%
0.40%
0.45%
0.50%
1976 1978 1980 1982 1984 1986 19881990 1992 1994 1996 1998 2000 2002 2004 20062008 2010 2012
PLAT

Melting/Boiling Point Extraction

Melting/Boiling Point Extraction
Over 99k so far!

CHEMICAL reactions

Predicting yield
• Features to consider:
• 2D fingerprints (especially around the reaction
centers)
• Reaction type
• Temperature
• Time
• Change in complexity e.g. chiral centres

Data extraction
• Can pull out yields, quantities, times and
temperatures
• Can sanity check text-mined yield by
calculating from amounts (or even masses)
𝑚𝑎𝑠𝑠 𝑜𝑓 𝑐𝑜𝑚𝑝𝑜𝑢𝑛𝑑(𝑔)
𝑚𝑜𝑙𝑎𝑟 𝑚𝑎𝑠𝑠 [𝑐𝑎𝑙𝑐 𝑓𝑟𝑜𝑚 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒](𝑔𝑚𝑜𝑙−1)
= 𝑚𝑜𝑙𝑠 𝑜𝑓 𝑐𝑜𝑚𝑝𝑜𝑢𝑛𝑑
𝑚𝑜𝑙𝑠 𝑜𝑓 𝑝𝑟𝑜𝑑𝑢𝑐𝑡
𝑚𝑜𝑙𝑠 𝑜𝑓 𝑙𝑖𝑚𝑖𝑡𝑖𝑛𝑔 𝑟𝑒𝑎𝑐𝑡𝑎𝑛𝑡
= %𝑦𝑖𝑒𝑙𝑑

Outliers inevitable

sCale vs yield
2001-2013 US applications, Suzuki couplings

Temperatures

Identify Synthetic Routes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Intermediates 197702103114 56611 31403 17268 9230 5057 2701 1256 639 301 136 58 15 5 2
Terminal Products 385149149445 81837 47579 27670 16619 9320 5263 2511 1330 678 373 111 63 8 6 5
0
100000
200000
300000
400000
500000
600000
700000
Occurrences
Number of steps
Intermediates
Terminal Products

Number of steps vs Complexity
ringComplexityScore = log10(nRingBridgeAtoms + 1) + log10(nSpiroAtoms + 1)
stereoComplexityScore = log10(nStereoCenters + 1)
macrocyclePenalty = log10(nMacrocycles + 1)
sizePenalty = natoms1.005 − natoms
∑ Ertl & Schuffenhauer 2009
doi:10.1186/1758-2946-1-8

Number of steps vs Complexity
0
5
10
15
20
25
30
35
40
45
50
1 2 3 4 5 6 7 8 9 10 11
Numberofproductheavyatoms
Number of steps
Average number of heavy atoms

Yield vs Change in Complexity
2001-2013 US applications, all reactions

Trends in Reaction Types
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Suzukicouplingsasapercentageofreactionsinayear
Classification performed using NameRXN

Trends In Solvent Use
0.0%
5.0%
10.0%
15.0%
20.0%
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Percentageofreactionsinthatyear
Tetrahydrofuran
Dichloromethane
Water
Dimethylformamide
Methanol
Ethyl acetate
Ethanol
1,4-Dioxane
Toluene
Acetonitrile
Acetic acid
Chloroform
Acetone
Benzene

Are solvents getting greener?
1976 2013
Water (21%) Tetrahydrofuran (15%)
Ethanol (11%) Dichloromethane (14%)
Benzene (8%) Water (13%)
Methanol (7%) Dimethylformamide (10%)
Tetrahydrofuran (5%) Methanol (8%)
Dichloromethane (4%) Ethyl acetate (7%)
Dimethylformamide (4%) Ethanol (5%)
Acetic acid (4%) 1,4-Dioxane (4%)
Chloroform (3%) Toluene (3%)
Acetone (3%) Acetonitrile (3%)
Total for top 10: 71% 82%

Conclusions
• A significant amount of novel chemistry, from EPO patents, comes
from the non-English patents
• Compounds disclosed by both the USPTO and EPO are on average
published earlier by the USPTO but for many compounds an EPO
patent will be the earlier disclosure
• Gene/protein identification can identify clear changes in patenting
behaviour over time
• Text mining provides the tools to answer many reaction
informatics questions

Acknowledgements
• Funding provided by:

Thank you for your time!
http://nextmovesoftware.com
http://nextmovesoftware.com/blog
daniel@nextmovesoftware.com

Chemistry and reactions from non-US patents

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (8)

Similar to Chemistry and reactions from non-US patents

Similar to Chemistry and reactions from non-US patents (20)

More from NextMove Software

More from NextMove Software (18)

Recently uploaded

Recently uploaded (20)

Chemistry and reactions from non-US patents