Automated Extraction of Reactions from the Patent Literature

Automated Extraction of Reactions from the
Patent Literature

Daniel Lowe
Unilever Centre for Molecular Science Informatics
University of Cambridge

1

Chemistry patent applications
• 100,000s applications each year
400000

350000
Chemistry patent applications per year

300000

250000

200000

150000

100000

50000

0
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

World Intellectual Property Indicators, 2011 edition

2

The idea
XML patents

Reaction
Extraction
System

Extracted Reactions

4

Steps involved
• Identifying experimental sections
• Identifying chemical entities
• Chemical name to structure conversion
• Associating chemical entities with quantities
• Assigning chemical roles
• Atom-atom mapping

5

Building on existing projects

6

Archetypal experimental section
Section heading

Section target
compound
Step identifier
Step target
compound
Paragraph number
Synthesis

Workup

Characterisation

7

Jessop, D. M.; Adams, S. E.; Murray-Rust, P.
Mining Chemical Information from Open
Patents. Journal of Cheminformatics 2011, 3, 40.

8

ChemicalTagger
• Tags words of text

• Parses tags to identify phrases

• Generate XML parse tree
– http://chemicaltagger.ch.cam.ac.uk/
– Hawizy, L.; Jessop, D. M.; Adams, N.; Murray-Rust, P. ChemicalTagger: A tool for
semantic text-mining in chemistry. J Cheminf 2011, 3, 17.

9

Tagging
• Regex tagger: tags keywords e.g. “yield”, “mL”
• OSCAR4 tagger: Finds names OSCAR4 believes to be chemical
e.g. “2-methylpyridine”
• OpenNLP: Tags parts of speech

Additional taggers:
• OPSIN tagger: Finds names OPSIN can parse
• Trivial chemical name tagger: Tags a few chemicals missed by
the other taggers and cases that are partially matched by
the regex tagger e.g. Dess-martin reagent

10

Sample ChemicalTagger Output
<MOLECULE>
<OSCARCM>
<OSCAR-CM>methyl</OSCAR-CM>
<OSCAR-CM>4-(chlorosulfonyl)benzoate</OSCAR-CM>
</OSCARCM>
<QUANTITY>
<_-LRB->(</_-LRB->
<MASS>
<CD>606</CD>
<NN-MASS>mg</NN-MASS>
</MASS>
<COMMA>,</COMMA>
<AMOUNT>
<CD>2.1</CD>
<NN-AMOUNT>mmol</NN-AMOUNT>
</AMOUNT>
<COMMA>,</COMMA>
<EQUIVALENT>
<CD>1</CD>
<NN-EQ>eq</NN-EQ>
</EQUIVALENT>
<_-RRB->)</_-RRB->
</QUANTITY>
</MOLECULE>

11

Phrase Identification

12

Quantity Identification

13

Section/Step Parsing

14

Pyridine, pyridines and pyridine rings

The pyridine / Pyridines / Pyridine ring /
Entity Pyridine
Pyridine from step 1 A pyridine Pyridyl

Type Exact DefiniteReference ChemicalClass Fragment

15

Section/Step Parsing

Workup phrase types : Concentrate, Degass,
Dry, Extract, Filter, Partition, Precipitate,
Purify, Recover, Remove, Wash, Quench

16

Example
Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate

To a solution of methyl 4-(chlorosulfonyl)benzoate (606
mg, 2.1 mmol, 1 eq) in DCM (35 ml) was added
pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N
(540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred
at room temperature until all of the starting material was
consumed. The solvent was evaporated in vacuo and the
residue redissolved in ethyl acetate (10 ml), washed with
water (10 ml), saturated sodium hydrogen carbonate (10
ml), dried over sodium sulphate, filtered and evaporated to
yield the title compound as a white solid (690 mg, 1.8
mmol, 85%).

18

Graphical Output

19

CML output
<reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-..
<dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([..
<productList>
<product role="product"> Reaction SMILES
<molecule id="m0">
<name dictRef="nameDict:unknown">title compound</name>
</molecule>
<amount units="unit:mmol">1.8</amount>
<amount units="unit:mg">690</amount> Quantities including yield are extracted
<amount units="unit:percentYield">85.0</amount>
<identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/>
<identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H..
<dl:entityType>definiteReference</dl:entityType>
<dl:state>solid</dl:state> SMILES and InChIs for every structure
</product> resolvable reagent/product
</productList>
<reactantList> Entity is classified as an exact compound,
<reactant role="reactant" count="1">
<molecule id="m1">
definite reference, chemical class or polymer
<name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name>
</molecule>
<amount units="unit:mmol">2.1</amount>
<amount units="unit:mg">606</amount>
<amount units="unit:eq">1.0</amount>
<identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/>

20

Evaluation
• 2008-2011 USPTO patent applications classified as containing
organic chemistry  65,034 documents.

• 484,259 reactions atom mapped reactions extracted

• Adding the additional requirements that all the identified
product molecules were resolvable to structures and that all
reagents were believed to describe exact compounds
 424,621 reactions.

• 100 of these were selected for manual evaluation of quality

21

Reactions found
100,000

10,000
Patents with given number of reactions

1,000

100

10

1
0 200 400 600 800 1000
Number of extracted reactions

22

Results
• 96% correctly identified the primary starting material and product
whilst not misidentifying reagents that could be confused with the
starting material

• As compared to the 495 expected chemical entities there were 61 false
positives and 16 false negatives

• Only 4 of the 321 reagents (with quantities) did not have these
quantities recognised and associated with the reagent

• Association of quantities/yields with products was less successful, 48
out of the 74 cases where such data was present were handled

23

Use Cases
• Reaction searching

• Analysing trends in reactions over time

• Reaction outcome prediction

24

Example of reaction searching
C[CH:1]=[CH2:2].ICI>>C([CH:1]1[CH2:2][CH2]1)

6 reactions found in 5 patents

25

Name I20110224.tarUS20110046406A1-20110224.ZIP0066

Text from US 2011/0046406 A1

26

Most lexical variants

1-ethyl-3-(dimethylaminopropyl)carbodiimide hydrochloride
EDCI hydrochloride
1-ethyl-3-[3-(dimethylamino)propyl]-carbodiimide hydrochloride
N-ethyl-N'-(3-dimethylamino-propyl)-carbodiimide hydrochloride
And 127 more!
N-[3-(Dimethylamino) propyl]-N'-ethylcarbodiimide hydrochloride
1-(3-dimethylaminopropyl)-3-ethylcarbodiimide.HCl
N1-((Ethylimino)methylene)-N3,N3-dimethylpropane-1,3-diamine hydrochloride
N-(3-dimethylaminopropyl)-N'-ethylcarbodiimide hydrochloride
1-ethyl-3-dimethylaminopropyl-carbodiimide hydrochloride
1-(3-dimethylaminopropyl)-3-ethylcarbodiimide HCl
675 chemicals had over
1-[3(dimethylamino)propyl]-3-ethylcarbodiimide hydrochloride
1-(-3-dimethylamino-propyl)-3-ethylcarbodiimide hydrochloride 10 lexical variants!
N-(3-Dimethylamino-1-propyl)-N'-ethylcarbodiimide hydrochloride
1-ethyl-3-(3-dimethylaminopropyl)carbodiimide monohydrochloride
1-(3-(Dimethylamino)propyl)-3-ethyl-carbodiimide hydrochloride

27

Most common solvents

28

Known Limitations
• The first workup reagent is often erroneously classified as a
reactant

• Atom mapping produces mappings that are not necessarily
representative of reaction mechanism and occasionally
involve clearly incorrect atoms

• Conditions from analogous reactions are not resolved

• Temperature/time for reactions to occur not captured

29

Conclusions
• 424,621 exact atom-mapped reactions were
extracted from 4 years of USPTO patent
applications
• Evaluation indicates the reactions to be of
generally good quality especially if the
misidentification of workup reagents as
reactants is not considered important
• All the code to extract reactions is open source:
https://bitbucket.org/dan2097/patent-reaction-extraction

30

Acknowledgements
Unilever centre: Indigo toolkit:
Robert Glen Mikhail Rybalkin
Peter Murray-Rust Savelyev Alexander
Lezan Hawizy Dmitry Pavlov
David Jessop
Matthew Grayson
Boehringer Ingelheim for funding SMARTS searching:
Roger Sayle

31

Any Questions?

Email: daniel@nextmovesoftware.com

32

Automated Extraction of Reactions from the Patent Literature

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Automated Extraction of Reactions from the Patent Literature

Similar to Automated Extraction of Reactions from the Patent Literature (20)

More from dan2097

More from dan2097 (6)

Recently uploaded

Recently uploaded (20)

Automated Extraction of Reactions from the Patent Literature

Editor's Notes