We have created a pipeline of recently enhanced open source components for extracting chemical reactions from full text chemical literature. OSCAR4 is used to recognise chemical entities and resolve to structures where appropriate. OPSIN is used to resolve systematic chemical names to structures. Chemical Tagger performs part of speech tagging allowing the interpretation of phrases in chemical syntheses. The final output is a semantic representation (chemical components and their roles, reaction conditions, actions including workup, yield and properties of the product). We then attempt to map all atoms in the product(s) to reactants. If successful we also attempt to calculate the stoichiometry of the reaction. The system has been deployed on over 56,000 USPTO patents published since 2008. The level of recall is useful and most extracted reactions make chemical sense. The pipeline is generally applicable to reactions in chemical literature including journals and theses.
Scanning the Internet for External Cloud Exposures via SSL Certs
Automated Extraction of Reactions from the Patent Literature
1. Automated Extraction of Reactions from the
Patent Literature
Daniel Lowe
Unilever Centre for Molecular Science Informatics
University of Cambridge
1
2. Chemistry patent applications
• 100,000s applications each year
400000
350000
Chemistry patent applications per year
300000
250000
200000
150000
100000
50000
0
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
World Intellectual Property Indicators, 2011 edition
2
4. The idea
XML patents
Reaction
Extraction
System
Extracted Reactions
4
5. Steps involved
• Identifying experimental sections
• Identifying chemical entities
• Chemical name to structure conversion
• Associating chemical entities with quantities
• Assigning chemical roles
• Atom-atom mapping
5
8. Jessop, D. M.; Adams, S. E.; Murray-Rust, P.
Mining Chemical Information from Open
Patents. Journal of Cheminformatics 2011, 3, 40.
8
9. ChemicalTagger
• Tags words of text
• Parses tags to identify phrases
• Generate XML parse tree
– http://chemicaltagger.ch.cam.ac.uk/
– Hawizy, L.; Jessop, D. M.; Adams, N.; Murray-Rust, P. ChemicalTagger: A tool for
semantic text-mining in chemistry. J Cheminf 2011, 3, 17.
9
10. Tagging
• Regex tagger: tags keywords e.g. “yield”, “mL”
• OSCAR4 tagger: Finds names OSCAR4 believes to be chemical
e.g. “2-methylpyridine”
• OpenNLP: Tags parts of speech
Additional taggers:
• OPSIN tagger: Finds names OPSIN can parse
• Trivial chemical name tagger: Tags a few chemicals missed by
the other taggers and cases that are partially matched by
the regex tagger e.g. Dess-martin reagent
10
15. Pyridine, pyridines and pyridine rings
The pyridine / Pyridines / Pyridine ring /
Entity Pyridine
Pyridine from step 1 A pyridine Pyridyl
Type Exact DefiniteReference ChemicalClass Fragment
15
18. Example
Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate
To a solution of methyl 4-(chlorosulfonyl)benzoate (606
mg, 2.1 mmol, 1 eq) in DCM (35 ml) was added
pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N
(540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred
at room temperature until all of the starting material was
consumed. The solvent was evaporated in vacuo and the
residue redissolved in ethyl acetate (10 ml), washed with
water (10 ml), saturated sodium hydrogen carbonate (10
ml), dried over sodium sulphate, filtered and evaporated to
yield the title compound as a white solid (690 mg, 1.8
mmol, 85%).
18
20. CML output
<reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-..
<dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([..
<productList>
<product role="product"> Reaction SMILES
<molecule id="m0">
<name dictRef="nameDict:unknown">title compound</name>
</molecule>
<amount units="unit:mmol">1.8</amount>
<amount units="unit:mg">690</amount> Quantities including yield are extracted
<amount units="unit:percentYield">85.0</amount>
<identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/>
<identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H..
<dl:entityType>definiteReference</dl:entityType>
<dl:state>solid</dl:state> SMILES and InChIs for every structure
</product> resolvable reagent/product
</productList>
<reactantList> Entity is classified as an exact compound,
<reactant role="reactant" count="1">
<molecule id="m1">
definite reference, chemical class or polymer
<name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name>
</molecule>
<amount units="unit:mmol">2.1</amount>
<amount units="unit:mg">606</amount>
<amount units="unit:eq">1.0</amount>
<identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/>
20
21. Evaluation
• 2008-2011 USPTO patent applications classified as containing
organic chemistry 65,034 documents.
• 484,259 reactions atom mapped reactions extracted
• Adding the additional requirements that all the identified
product molecules were resolvable to structures and that all
reagents were believed to describe exact compounds
424,621 reactions.
• 100 of these were selected for manual evaluation of quality
21
22. Reactions found
100,000
10,000
Patents with given number of reactions
1,000
100
10
1
0 200 400 600 800 1000
Number of extracted reactions
22
23. Results
• 96% correctly identified the primary starting material and product
whilst not misidentifying reagents that could be confused with the
starting material
• As compared to the 495 expected chemical entities there were 61 false
positives and 16 false negatives
• Only 4 of the 321 reagents (with quantities) did not have these
quantities recognised and associated with the reagent
• Association of quantities/yields with products was less successful, 48
out of the 74 cases where such data was present were handled
23
24. Use Cases
• Reaction searching
• Analysing trends in reactions over time
• Reaction outcome prediction
24
25. Example of reaction searching
C[CH:1]=[CH2:2].ICI>>C([CH:1]1[CH2:2][CH2]1)
6 reactions found in 5 patents
25
29. Known Limitations
• The first workup reagent is often erroneously classified as a
reactant
• Atom mapping produces mappings that are not necessarily
representative of reaction mechanism and occasionally
involve clearly incorrect atoms
• Conditions from analogous reactions are not resolved
• Temperature/time for reactions to occur not captured
29
30. Conclusions
• 424,621 exact atom-mapped reactions were
extracted from 4 years of USPTO patent
applications
• Evaluation indicates the reactions to be of
generally good quality especially if the
misidentification of workup reagents as
reactants is not considered important
• All the code to extract reactions is open source:
https://bitbucket.org/dan2097/patent-reaction-extraction
30
31. Acknowledgements
Unilever centre: Indigo toolkit:
Robert Glen Mikhail Rybalkin
Peter Murray-Rust Savelyev Alexander
Lezan Hawizy Dmitry Pavlov
David Jessop
Matthew Grayson
Boehringer Ingelheim for funding SMARTS searching:
Roger Sayle
31
Manual abstraction of the precise details of reactions from this many documents would be expensive.
How can one get access to patents? Google patents offers all USPTO patents from 2001 onwards as XML including images and ChemDraw files. Older patents are available with just the text back to 1976, back to 1920 with OCRed text and back to 1790 if one OCRs themselves
This problem can be broken down into several sub problems
Fortunately we don’t have to start from scratch, many open source toolkit exist to help with these tasks. OPSIN, name to structure, OSCAR4, chemical entity recognition, ChemicalTagger, tagging and parsing of experimental chemistry text
This is what a typical experimental section from a patent looks like. We need to identify such sections, link the heading with the paragraphs and preferably distinguish synthesis reagents from workup reagents.
Heading/paragraphs can be extracted directly from the XML. The classifier uses the probabilities of words being present in an experimental chemistry section versus a standard paragraph. The language in experimental sections is quite repetitive so this works well. In some cases a heading may not be annotated as such in the XML, this can be detected in many cases and processed as if the heading was a discrete element.
This work relies heavily on ChemicalTagger and significant improvements have been made to ChemicalTagger as part of this porject to improve its performance and range of concepts recognised. Hence a description of the system would not be complete without also explaining what ChemicalTagger does
For this project we also use the following taggers. These tags can then be parsed to yield….
Quantities have been recognised and marked up and associated with a molecule. Where certain key words are identified phrases can be identfied….
A few phrase types are identified directly by the grammar e.g. a chemical in a chemical is a dissolve phrase
Will be associated with the identified compound. As you can see a compound doesn’t have to contain a chemical entity. (title compound as a white solid)
Uses a combination of textual clues and OPSIN’s classification
Phrases can be classified into workup by phrase type e.g. extraction, purification. As the yielded compound and characterisation are often conjoined rather than explicitly identifying the workup compounds commonly associated with characterisation are marked up as false positives by regexes. A single paragraph may have multiple blocks of synthesis and workup. Structure-aware role assignment involves things like heuristically assigning known solvents as solvent and catalysts e.g. using lists of known solvents/catalysts and their properties e.g. transition metal
Perform sanity check on reaction e.g. has a product and at least 2 reagents. Attempt to find mapping where all product atoms can be accounted for
Here is an example of an experimental section
Occasionally the system identifies a compound as a reactant that was mentioned only in the context of the current reaction being performed in an analogous way to the reaction that produced it. False positives arise from workup reagents being classified as reactants and clear errors. Product information often not explicitly associated with product.
Simmons–Smith reaction for conversion of a terminal allyl group to a cyclopropane group found 6 hits in 5 patents.
It should be noted that nowhere in this text and indeed in the whole patent is the name of the reaction mentioned, this is quite common.
675 chemical entities had over 10 lexical variants
Top 10
This is due to the text typically just saying that the substance is added without further specification of its purpose