Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Automated Extraction of Reactions from the            Patent Literature                        Daniel Lowe     Unilever Ce...
Chemistry patent applications• 100,000s applications each year                                               400000       ...
3
The idea   XML patents     Reaction    Extraction      SystemExtracted Reactions                      4
Steps involved•   Identifying experimental sections•   Identifying chemical entities•   Chemical name to structure convers...
Building on existing projects                                6
Archetypal experimental section                           Section heading                            Section target       ...
Jessop, D. M.; Adams, S. E.; Murray-Rust, P.Mining Chemical Information from OpenPatents. Journal of Cheminformatics 2011,...
ChemicalTagger• Tags words of text• Parses tags to identify phrases• Generate XML parse tree   – http://chemicaltagger.ch....
Tagging•   Regex tagger: tags keywords e.g. “yield”, “mL”•   OSCAR4 tagger: Finds names OSCAR4 believes to be chemical    ...
Sample ChemicalTagger Output     <MOLECULE>       <OSCARCM>         <OSCAR-CM>methyl</OSCAR-CM>         <OSCAR-CM>4-(chlor...
Phrase Identification                        12
Quantity Identification                          13
Section/Step Parsing                       14
Pyridine, pyridines and pyridine rings                        The pyridine /       Pyridines /    Pyridine ring / Entity  ...
Section/Step ParsingWorkup phrase types : Concentrate, Degass, Dry, Extract, Filter, Partition, Precipitate, Purify, Recov...
Atom-mapping               17
ExampleMethyl 4-[(pentafluorophenoxy)sulfonyl]benzoateTo a solution of methyl 4-(chlorosulfonyl)benzoate (606mg, 2.1 mmol,...
Graphical Output                   19
CML output<reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nam...
Evaluation•   2008-2011 USPTO patent applications classified as containing    organic chemistry  65,034 documents.•   484...
Reactions found                                         100,000                                          10,000Patents wit...
Results•   96% correctly identified the primary starting material and product    whilst not misidentifying reagents that c...
Use Cases• Reaction searching• Analysing trends in reactions over time• Reaction outcome prediction                       ...
Example of reaction searchingC[CH:1]=[CH2:2].ICI>>C([CH:1]1[CH2:2][CH2]1)     6 reactions found in 5 patents              ...
Name I20110224.tarUS20110046406A1-20110224.ZIP0066Text from US 2011/0046406 A1                                            ...
Most lexical variants1-ethyl-3-(dimethylaminopropyl)carbodiimide hydrochlorideEDCI hydrochloride1-ethyl-3-[3-(dimethylamin...
Most common solvents                       28
Known Limitations•   The first workup reagent is often erroneously classified as a    reactant•   Atom mapping produces ma...
Conclusions• 424,621 exact atom-mapped reactions were  extracted from 4 years of USPTO patent  applications• Evaluation in...
AcknowledgementsUnilever centre:                   Indigo toolkit:Robert Glen                        Mikhail RybalkinPeter...
Any Questions?Email: daniel@nextmovesoftware.com                                     32
Upcoming SlideShare
Loading in …5
×

Automated Extraction of Reactions from the Patent Literature

1,943 views

Published on

We have created a pipeline of recently enhanced open source components for extracting chemical reactions from full text chemical literature. OSCAR4 is used to recognise chemical entities and resolve to structures where appropriate. OPSIN is used to resolve systematic chemical names to structures. Chemical Tagger performs part of speech tagging allowing the interpretation of phrases in chemical syntheses. The final output is a semantic representation (chemical components and their roles, reaction conditions, actions including workup, yield and properties of the product). We then attempt to map all atoms in the product(s) to reactants. If successful we also attempt to calculate the stoichiometry of the reaction. The system has been deployed on over 56,000 USPTO patents published since 2008. The level of recall is useful and most extracted reactions make chemical sense. The pipeline is generally applicable to reactions in chemical literature including journals and theses.

Published in: Technology
  • Be the first to comment

Automated Extraction of Reactions from the Patent Literature

  1. 1. Automated Extraction of Reactions from the Patent Literature Daniel Lowe Unilever Centre for Molecular Science Informatics University of Cambridge 1
  2. 2. Chemistry patent applications• 100,000s applications each year 400000 350000 Chemistry patent applications per year 300000 250000 200000 150000 100000 50000 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 World Intellectual Property Indicators, 2011 edition 2
  3. 3. 3
  4. 4. The idea XML patents Reaction Extraction SystemExtracted Reactions 4
  5. 5. Steps involved• Identifying experimental sections• Identifying chemical entities• Chemical name to structure conversion• Associating chemical entities with quantities• Assigning chemical roles• Atom-atom mapping 5
  6. 6. Building on existing projects 6
  7. 7. Archetypal experimental section Section heading Section target compound Step identifier Step target compoundParagraph number Synthesis Workup Characterisation 7
  8. 8. Jessop, D. M.; Adams, S. E.; Murray-Rust, P.Mining Chemical Information from OpenPatents. Journal of Cheminformatics 2011, 3, 40. 8
  9. 9. ChemicalTagger• Tags words of text• Parses tags to identify phrases• Generate XML parse tree – http://chemicaltagger.ch.cam.ac.uk/ – Hawizy, L.; Jessop, D. M.; Adams, N.; Murray-Rust, P. ChemicalTagger: A tool for semantic text-mining in chemistry. J Cheminf 2011, 3, 17. 9
  10. 10. Tagging• Regex tagger: tags keywords e.g. “yield”, “mL”• OSCAR4 tagger: Finds names OSCAR4 believes to be chemical e.g. “2-methylpyridine”• OpenNLP: Tags parts of speechAdditional taggers:• OPSIN tagger: Finds names OPSIN can parse• Trivial chemical name tagger: Tags a few chemicals missed by the other taggers and cases that are partially matched by the regex tagger e.g. Dess-martin reagent 10
  11. 11. Sample ChemicalTagger Output <MOLECULE> <OSCARCM> <OSCAR-CM>methyl</OSCAR-CM> <OSCAR-CM>4-(chlorosulfonyl)benzoate</OSCAR-CM> </OSCARCM> <QUANTITY> <_-LRB->(</_-LRB-> <MASS> <CD>606</CD> <NN-MASS>mg</NN-MASS> </MASS> <COMMA>,</COMMA> <AMOUNT> <CD>2.1</CD> <NN-AMOUNT>mmol</NN-AMOUNT> </AMOUNT> <COMMA>,</COMMA> <EQUIVALENT> <CD>1</CD> <NN-EQ>eq</NN-EQ> </EQUIVALENT> <_-RRB->)</_-RRB-> </QUANTITY> </MOLECULE> 11
  12. 12. Phrase Identification 12
  13. 13. Quantity Identification 13
  14. 14. Section/Step Parsing 14
  15. 15. Pyridine, pyridines and pyridine rings The pyridine / Pyridines / Pyridine ring / Entity Pyridine Pyridine from step 1 A pyridine Pyridyl Type Exact DefiniteReference ChemicalClass Fragment 15
  16. 16. Section/Step ParsingWorkup phrase types : Concentrate, Degass, Dry, Extract, Filter, Partition, Precipitate, Purify, Recover, Remove, Wash, Quench 16
  17. 17. Atom-mapping 17
  18. 18. ExampleMethyl 4-[(pentafluorophenoxy)sulfonyl]benzoateTo a solution of methyl 4-(chlorosulfonyl)benzoate (606mg, 2.1 mmol, 1 eq) in DCM (35 ml) was addedpentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N(540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirredat room temperature until all of the starting material wasconsumed. The solvent was evaporated in vacuo and theresidue redissolved in ethyl acetate (10 ml), washed withwater (10 ml), saturated sodium hydrogen carbonate (10ml), dried over sodium sulphate, filtered and evaporated toyield the title compound as a white solid (690 mg, 1.8mmol, 85%). 18
  19. 19. Graphical Output 19
  20. 20. CML output<reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-.. <dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([.. <productList> <product role="product"> Reaction SMILES <molecule id="m0"> <name dictRef="nameDict:unknown">title compound</name> </molecule> <amount units="unit:mmol">1.8</amount> <amount units="unit:mg">690</amount> Quantities including yield are extracted <amount units="unit:percentYield">85.0</amount> <identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/> <identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H.. <dl:entityType>definiteReference</dl:entityType> <dl:state>solid</dl:state> SMILES and InChIs for every structure </product> resolvable reagent/product </productList> <reactantList> Entity is classified as an exact compound, <reactant role="reactant" count="1"> <molecule id="m1"> definite reference, chemical class or polymer <name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name> </molecule> <amount units="unit:mmol">2.1</amount> <amount units="unit:mg">606</amount> <amount units="unit:eq">1.0</amount> <identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/> 20
  21. 21. Evaluation• 2008-2011 USPTO patent applications classified as containing organic chemistry  65,034 documents.• 484,259 reactions atom mapped reactions extracted• Adding the additional requirements that all the identified product molecules were resolvable to structures and that all reagents were believed to describe exact compounds  424,621 reactions.• 100 of these were selected for manual evaluation of quality 21
  22. 22. Reactions found 100,000 10,000Patents with given number of reactions 1,000 100 10 1 0 200 400 600 800 1000 Number of extracted reactions 22
  23. 23. Results• 96% correctly identified the primary starting material and product whilst not misidentifying reagents that could be confused with the starting material• As compared to the 495 expected chemical entities there were 61 false positives and 16 false negatives• Only 4 of the 321 reagents (with quantities) did not have these quantities recognised and associated with the reagent• Association of quantities/yields with products was less successful, 48 out of the 74 cases where such data was present were handled 23
  24. 24. Use Cases• Reaction searching• Analysing trends in reactions over time• Reaction outcome prediction 24
  25. 25. Example of reaction searchingC[CH:1]=[CH2:2].ICI>>C([CH:1]1[CH2:2][CH2]1) 6 reactions found in 5 patents 25
  26. 26. Name I20110224.tarUS20110046406A1-20110224.ZIP0066Text from US 2011/0046406 A1 26
  27. 27. Most lexical variants1-ethyl-3-(dimethylaminopropyl)carbodiimide hydrochlorideEDCI hydrochloride1-ethyl-3-[3-(dimethylamino)propyl]-carbodiimide hydrochlorideN-ethyl-N-(3-dimethylamino-propyl)-carbodiimide hydrochloride And 127 more!N-[3-(Dimethylamino) propyl]-N-ethylcarbodiimide hydrochloride1-(3-dimethylaminopropyl)-3-ethylcarbodiimide.HClN1-((Ethylimino)methylene)-N3,N3-dimethylpropane-1,3-diamine hydrochlorideN-(3-dimethylaminopropyl)-N-ethylcarbodiimide hydrochloride1-ethyl-3-dimethylaminopropyl-carbodiimide hydrochloride1-(3-dimethylaminopropyl)-3-ethylcarbodiimide HCl 675 chemicals had over1-[3(dimethylamino)propyl]-3-ethylcarbodiimide hydrochloride1-(-3-dimethylamino-propyl)-3-ethylcarbodiimide hydrochloride 10 lexical variants!N-(3-Dimethylamino-1-propyl)-N-ethylcarbodiimide hydrochloride1-ethyl-3-(3-dimethylaminopropyl)carbodiimide monohydrochloride1-(3-(Dimethylamino)propyl)-3-ethyl-carbodiimide hydrochloride 27
  28. 28. Most common solvents 28
  29. 29. Known Limitations• The first workup reagent is often erroneously classified as a reactant• Atom mapping produces mappings that are not necessarily representative of reaction mechanism and occasionally involve clearly incorrect atoms• Conditions from analogous reactions are not resolved• Temperature/time for reactions to occur not captured 29
  30. 30. Conclusions• 424,621 exact atom-mapped reactions were extracted from 4 years of USPTO patent applications• Evaluation indicates the reactions to be of generally good quality especially if the misidentification of workup reagents as reactants is not considered important• All the code to extract reactions is open source: https://bitbucket.org/dan2097/patent-reaction-extraction 30
  31. 31. AcknowledgementsUnilever centre: Indigo toolkit:Robert Glen Mikhail RybalkinPeter Murray-Rust Savelyev AlexanderLezan Hawizy Dmitry PavlovDavid JessopMatthew GraysonBoehringer Ingelheim for funding SMARTS searching: Roger Sayle 31
  32. 32. Any Questions?Email: daniel@nextmovesoftware.com 32

×