• Like
Automated Extraction of Reactions from the Patent Literature
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Automated Extraction of Reactions from the Patent Literature

  • 1,255 views
Published

We have created a pipeline of recently enhanced open source components for extracting chemical reactions from full text chemical literature. OSCAR4 is used to recognise chemical entities and resolve …

We have created a pipeline of recently enhanced open source components for extracting chemical reactions from full text chemical literature. OSCAR4 is used to recognise chemical entities and resolve to structures where appropriate. OPSIN is used to resolve systematic chemical names to structures. Chemical Tagger performs part of speech tagging allowing the interpretation of phrases in chemical syntheses. The final output is a semantic representation (chemical components and their roles, reaction conditions, actions including workup, yield and properties of the product). We then attempt to map all atoms in the product(s) to reactants. If successful we also attempt to calculate the stoichiometry of the reaction. The system has been deployed on over 56,000 USPTO patents published since 2008. The level of recall is useful and most extracted reactions make chemical sense. The pipeline is generally applicable to reactions in chemical literature including journals and theses.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,255
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
14
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Manual abstraction of the precise details of reactions from this many documents would be expensive.
  • How can one get access to patents? Google patents offers all USPTO patents from 2001 onwards as XML including images and ChemDraw files. Older patents are available with just the text back to 1976, back to 1920 with OCRed text and back to 1790 if one OCRs themselves
  • This problem can be broken down into several sub problems
  • Fortunately we don’t have to start from scratch, many open source toolkit exist to help with these tasks. OPSIN, name to structure, OSCAR4, chemical entity recognition, ChemicalTagger, tagging and parsing of experimental chemistry text
  • This is what a typical experimental section from a patent looks like. We need to identify such sections, link the heading with the paragraphs and preferably distinguish synthesis reagents from workup reagents.
  • Heading/paragraphs can be extracted directly from the XML. The classifier uses the probabilities of words being present in an experimental chemistry section versus a standard paragraph. The language in experimental sections is quite repetitive so this works well. In some cases a heading may not be annotated as such in the XML, this can be detected in many cases and processed as if the heading was a discrete element.
  • This work relies heavily on ChemicalTagger and significant improvements have been made to ChemicalTagger as part of this porject to improve its performance and range of concepts recognised. Hence a description of the system would not be complete without also explaining what ChemicalTagger does
  • For this project we also use the following taggers. These tags can then be parsed to yield….
  • Quantities have been recognised and marked up and associated with a molecule. Where certain key words are identified phrases can be identfied….
  • A few phrase types are identified directly by the grammar e.g. a chemical in a chemical is a dissolve phrase
  • Will be associated with the identified compound. As you can see a compound doesn’t have to contain a chemical entity. (title compound as a white solid)
  • Uses a combination of textual clues and OPSIN’s classification
  • Phrases can be classified into workup by phrase type e.g. extraction, purification. As the yielded compound and characterisation are often conjoined rather than explicitly identifying the workup compounds commonly associated with characterisation are marked up as false positives by regexes. A single paragraph may have multiple blocks of synthesis and workup. Structure-aware role assignment involves things like heuristically assigning known solvents as solvent and catalysts e.g. using lists of known solvents/catalysts and their properties e.g. transition metal
  • Perform sanity check on reaction e.g. has a product and at least 2 reagents. Attempt to find mapping where all product atoms can be accounted for
  • Here is an example of an experimental section
  • Occasionally the system identifies a compound as a reactant that was mentioned only in the context of the current reaction being performed in an analogous way to the reaction that produced it. False positives arise from workup reagents being classified as reactants and clear errors. Product information often not explicitly associated with product.
  • Simmons–Smith reaction for conversion of a terminal allyl group to a cyclopropane group found 6 hits in 5 patents.
  • It should be noted that nowhere in this text and indeed in the whole patent is the name of the reaction mentioned, this is quite common.
  • 675 chemical entities had over 10 lexical variants
  • Top 10
  • This is due to the text typically just saying that the substance is added without further specification of its purpose

Transcript

  • 1. Automated Extraction of Reactions from the Patent Literature Daniel Lowe Unilever Centre for Molecular Science Informatics University of Cambridge 1
  • 2. Chemistry patent applications• 100,000s applications each year 400000 350000 Chemistry patent applications per year 300000 250000 200000 150000 100000 50000 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 World Intellectual Property Indicators, 2011 edition 2
  • 3. 3
  • 4. The idea XML patents Reaction Extraction SystemExtracted Reactions 4
  • 5. Steps involved• Identifying experimental sections• Identifying chemical entities• Chemical name to structure conversion• Associating chemical entities with quantities• Assigning chemical roles• Atom-atom mapping 5
  • 6. Building on existing projects 6
  • 7. Archetypal experimental section Section heading Section target compound Step identifier Step target compoundParagraph number Synthesis Workup Characterisation 7
  • 8. Jessop, D. M.; Adams, S. E.; Murray-Rust, P.Mining Chemical Information from OpenPatents. Journal of Cheminformatics 2011, 3, 40. 8
  • 9. ChemicalTagger• Tags words of text• Parses tags to identify phrases• Generate XML parse tree – http://chemicaltagger.ch.cam.ac.uk/ – Hawizy, L.; Jessop, D. M.; Adams, N.; Murray-Rust, P. ChemicalTagger: A tool for semantic text-mining in chemistry. J Cheminf 2011, 3, 17. 9
  • 10. Tagging• Regex tagger: tags keywords e.g. “yield”, “mL”• OSCAR4 tagger: Finds names OSCAR4 believes to be chemical e.g. “2-methylpyridine”• OpenNLP: Tags parts of speechAdditional taggers:• OPSIN tagger: Finds names OPSIN can parse• Trivial chemical name tagger: Tags a few chemicals missed by the other taggers and cases that are partially matched by the regex tagger e.g. Dess-martin reagent 10
  • 11. Sample ChemicalTagger Output <MOLECULE> <OSCARCM> <OSCAR-CM>methyl</OSCAR-CM> <OSCAR-CM>4-(chlorosulfonyl)benzoate</OSCAR-CM> </OSCARCM> <QUANTITY> <_-LRB->(</_-LRB-> <MASS> <CD>606</CD> <NN-MASS>mg</NN-MASS> </MASS> <COMMA>,</COMMA> <AMOUNT> <CD>2.1</CD> <NN-AMOUNT>mmol</NN-AMOUNT> </AMOUNT> <COMMA>,</COMMA> <EQUIVALENT> <CD>1</CD> <NN-EQ>eq</NN-EQ> </EQUIVALENT> <_-RRB->)</_-RRB-> </QUANTITY> </MOLECULE> 11
  • 12. Phrase Identification 12
  • 13. Quantity Identification 13
  • 14. Section/Step Parsing 14
  • 15. Pyridine, pyridines and pyridine rings The pyridine / Pyridines / Pyridine ring / Entity Pyridine Pyridine from step 1 A pyridine Pyridyl Type Exact DefiniteReference ChemicalClass Fragment 15
  • 16. Section/Step ParsingWorkup phrase types : Concentrate, Degass, Dry, Extract, Filter, Partition, Precipitate, Purify, Recover, Remove, Wash, Quench 16
  • 17. Atom-mapping 17
  • 18. ExampleMethyl 4-[(pentafluorophenoxy)sulfonyl]benzoateTo a solution of methyl 4-(chlorosulfonyl)benzoate (606mg, 2.1 mmol, 1 eq) in DCM (35 ml) was addedpentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N(540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirredat room temperature until all of the starting material wasconsumed. The solvent was evaporated in vacuo and theresidue redissolved in ethyl acetate (10 ml), washed withwater (10 ml), saturated sodium hydrogen carbonate (10ml), dried over sodium sulphate, filtered and evaporated toyield the title compound as a white solid (690 mg, 1.8mmol, 85%). 18
  • 19. Graphical Output 19
  • 20. CML output<reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-.. <dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([.. <productList> <product role="product"> Reaction SMILES <molecule id="m0"> <name dictRef="nameDict:unknown">title compound</name> </molecule> <amount units="unit:mmol">1.8</amount> <amount units="unit:mg">690</amount> Quantities including yield are extracted <amount units="unit:percentYield">85.0</amount> <identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/> <identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H.. <dl:entityType>definiteReference</dl:entityType> <dl:state>solid</dl:state> SMILES and InChIs for every structure </product> resolvable reagent/product </productList> <reactantList> Entity is classified as an exact compound, <reactant role="reactant" count="1"> <molecule id="m1"> definite reference, chemical class or polymer <name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name> </molecule> <amount units="unit:mmol">2.1</amount> <amount units="unit:mg">606</amount> <amount units="unit:eq">1.0</amount> <identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/> 20
  • 21. Evaluation• 2008-2011 USPTO patent applications classified as containing organic chemistry  65,034 documents.• 484,259 reactions atom mapped reactions extracted• Adding the additional requirements that all the identified product molecules were resolvable to structures and that all reagents were believed to describe exact compounds  424,621 reactions.• 100 of these were selected for manual evaluation of quality 21
  • 22. Reactions found 100,000 10,000Patents with given number of reactions 1,000 100 10 1 0 200 400 600 800 1000 Number of extracted reactions 22
  • 23. Results• 96% correctly identified the primary starting material and product whilst not misidentifying reagents that could be confused with the starting material• As compared to the 495 expected chemical entities there were 61 false positives and 16 false negatives• Only 4 of the 321 reagents (with quantities) did not have these quantities recognised and associated with the reagent• Association of quantities/yields with products was less successful, 48 out of the 74 cases where such data was present were handled 23
  • 24. Use Cases• Reaction searching• Analysing trends in reactions over time• Reaction outcome prediction 24
  • 25. Example of reaction searchingC[CH:1]=[CH2:2].ICI>>C([CH:1]1[CH2:2][CH2]1) 6 reactions found in 5 patents 25
  • 26. Name I20110224.tarUS20110046406A1-20110224.ZIP0066Text from US 2011/0046406 A1 26
  • 27. Most lexical variants1-ethyl-3-(dimethylaminopropyl)carbodiimide hydrochlorideEDCI hydrochloride1-ethyl-3-[3-(dimethylamino)propyl]-carbodiimide hydrochlorideN-ethyl-N-(3-dimethylamino-propyl)-carbodiimide hydrochloride And 127 more!N-[3-(Dimethylamino) propyl]-N-ethylcarbodiimide hydrochloride1-(3-dimethylaminopropyl)-3-ethylcarbodiimide.HClN1-((Ethylimino)methylene)-N3,N3-dimethylpropane-1,3-diamine hydrochlorideN-(3-dimethylaminopropyl)-N-ethylcarbodiimide hydrochloride1-ethyl-3-dimethylaminopropyl-carbodiimide hydrochloride1-(3-dimethylaminopropyl)-3-ethylcarbodiimide HCl 675 chemicals had over1-[3(dimethylamino)propyl]-3-ethylcarbodiimide hydrochloride1-(-3-dimethylamino-propyl)-3-ethylcarbodiimide hydrochloride 10 lexical variants!N-(3-Dimethylamino-1-propyl)-N-ethylcarbodiimide hydrochloride1-ethyl-3-(3-dimethylaminopropyl)carbodiimide monohydrochloride1-(3-(Dimethylamino)propyl)-3-ethyl-carbodiimide hydrochloride 27
  • 28. Most common solvents 28
  • 29. Known Limitations• The first workup reagent is often erroneously classified as a reactant• Atom mapping produces mappings that are not necessarily representative of reaction mechanism and occasionally involve clearly incorrect atoms• Conditions from analogous reactions are not resolved• Temperature/time for reactions to occur not captured 29
  • 30. Conclusions• 424,621 exact atom-mapped reactions were extracted from 4 years of USPTO patent applications• Evaluation indicates the reactions to be of generally good quality especially if the misidentification of workup reagents as reactants is not considered important• All the code to extract reactions is open source: https://bitbucket.org/dan2097/patent-reaction-extraction 30
  • 31. AcknowledgementsUnilever centre: Indigo toolkit:Robert Glen Mikhail RybalkinPeter Murray-Rust Savelyev AlexanderLezan Hawizy Dmitry PavlovDavid JessopMatthew GraysonBoehringer Ingelheim for funding SMARTS searching: Roger Sayle 31
  • 32. Any Questions?Email: daniel@nextmovesoftware.com 32