SlideShare a Scribd company logo
1 of 32
Automated Extraction of Reactions from the
            Patent Literature




                        Daniel Lowe
     Unilever Centre for Molecular Science Informatics
                 University of Cambridge




                                                         1
Chemistry patent applications
• 100,000s applications each year
                                               400000


                                               350000
      Chemistry patent applications per year




                                               300000


                                               250000


                                               200000


                                               150000


                                               100000


                                                50000


                                                    0
                                                        2000   2001   2002   2003   2004   2005     2006    2007     2008    2009

                                                                                                  World Intellectual Property Indicators, 2011 edition

                                                                                                                                               2
3
The idea
   XML patents




     Reaction
    Extraction
      System




Extracted Reactions

                      4
Steps involved
•   Identifying experimental sections
•   Identifying chemical entities
•   Chemical name to structure conversion
•   Associating chemical entities with quantities
•   Assigning chemical roles
•   Atom-atom mapping


                                                    5
Building on existing projects




                                6
Archetypal experimental section
                           Section heading

                            Section target
                             compound
     Step identifier
                              Step target
                              compound
Paragraph number
                               Synthesis



                                Workup


                            Characterisation




                                               7
Jessop, D. M.; Adams, S. E.; Murray-Rust, P.
Mining Chemical Information from Open
Patents. Journal of Cheminformatics 2011, 3, 40.




                                        8
ChemicalTagger
• Tags words of text

• Parses tags to identify phrases

• Generate XML parse tree
   – http://chemicaltagger.ch.cam.ac.uk/
   – Hawizy, L.; Jessop, D. M.; Adams, N.; Murray-Rust, P. ChemicalTagger: A tool for
     semantic text-mining in chemistry. J Cheminf 2011, 3, 17.




                                                                                        9
Tagging
•   Regex tagger: tags keywords e.g. “yield”, “mL”
•   OSCAR4 tagger: Finds names OSCAR4 believes to be chemical
    e.g. “2-methylpyridine”
•   OpenNLP: Tags parts of speech


Additional taggers:
• OPSIN tagger: Finds names OPSIN can parse
• Trivial chemical name tagger: Tags a few chemicals missed by
   the other taggers and cases that are partially matched by
   the regex tagger e.g. Dess-martin reagent


                                                            10
Sample ChemicalTagger Output
     <MOLECULE>
       <OSCARCM>
         <OSCAR-CM>methyl</OSCAR-CM>
         <OSCAR-CM>4-(chlorosulfonyl)benzoate</OSCAR-CM>
       </OSCARCM>
       <QUANTITY>
         <_-LRB->(</_-LRB->
         <MASS>
           <CD>606</CD>
           <NN-MASS>mg</NN-MASS>
         </MASS>
         <COMMA>,</COMMA>
         <AMOUNT>
           <CD>2.1</CD>
           <NN-AMOUNT>mmol</NN-AMOUNT>
         </AMOUNT>
         <COMMA>,</COMMA>
         <EQUIVALENT>
           <CD>1</CD>
           <NN-EQ>eq</NN-EQ>
         </EQUIVALENT>
         <_-RRB->)</_-RRB->
       </QUANTITY>
     </MOLECULE>

                                                           11
Phrase Identification




                        12
Quantity Identification




                          13
Section/Step Parsing




                       14
Pyridine, pyridines and pyridine rings


                        The pyridine /       Pyridines /    Pyridine ring /
 Entity   Pyridine
                     Pyridine from step 1    A pyridine         Pyridyl

 Type      Exact      DefiniteReference     ChemicalClass     Fragment




                                                                      15
Section/Step Parsing




Workup phrase types : Concentrate, Degass,
 Dry, Extract, Filter, Partition, Precipitate,
 Purify, Recover, Remove, Wash, Quench




                                                 16
Atom-mapping




               17
Example
Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate

To a solution of methyl 4-(chlorosulfonyl)benzoate (606
mg, 2.1 mmol, 1 eq) in DCM (35 ml) was added
pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N
(540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred
at room temperature until all of the starting material was
consumed. The solvent was evaporated in vacuo and the
residue redissolved in ethyl acetate (10 ml), washed with
water (10 ml), saturated sodium hydrogen carbonate (10
ml), dried over sodium sulphate, filtered and evaporated to
yield the title compound as a white solid (690 mg, 1.8
mmol, 85%).

                                                         18
Graphical Output




                   19
CML output
<reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-..
 <dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([..
 <productList>
  <product role="product">                                                                     Reaction SMILES
   <molecule id="m0">
    <name dictRef="nameDict:unknown">title compound</name>
   </molecule>
   <amount units="unit:mmol">1.8</amount>
   <amount units="unit:mg">690</amount>                                           Quantities including yield are extracted
   <amount units="unit:percentYield">85.0</amount>
   <identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/>
   <identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H..
   <dl:entityType>definiteReference</dl:entityType>
   <dl:state>solid</dl:state>                                                       SMILES and InChIs for every structure
  </product>                                                                               resolvable reagent/product
 </productList>
 <reactantList>                                  Entity is classified as an exact compound,
  <reactant role="reactant" count="1">
   <molecule id="m1">
                                              definite reference, chemical class or polymer
    <name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name>
   </molecule>
   <amount units="unit:mmol">2.1</amount>
   <amount units="unit:mg">606</amount>
   <amount units="unit:eq">1.0</amount>
   <identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/>



                                                                                                                                                  20
Evaluation
•   2008-2011 USPTO patent applications classified as containing
    organic chemistry  65,034 documents.

•   484,259 reactions atom mapped reactions extracted

•   Adding the additional requirements that all the identified
    product molecules were resolvable to structures and that all
    reagents were believed to describe exact compounds
     424,621 reactions.

•   100 of these were selected for manual evaluation of quality

                                                                  21
Reactions found
                                         100,000




                                          10,000
Patents with given number of reactions




                                           1,000




                                            100




                                             10




                                              1
                                                   0     200      400               600        800   1000
                                                               Number of extracted reactions




                                                                                                            22
Results
•   96% correctly identified the primary starting material and product
    whilst not misidentifying reagents that could be confused with the
    starting material

•   As compared to the 495 expected chemical entities there were 61 false
    positives and 16 false negatives

•   Only 4 of the 321 reagents (with quantities) did not have these
    quantities recognised and associated with the reagent

•   Association of quantities/yields with products was less successful, 48
    out of the 74 cases where such data was present were handled

                                                                             23
Use Cases
• Reaction searching

• Analysing trends in reactions over time

• Reaction outcome prediction




                                            24
Example of reaction searching
C[CH:1]=[CH2:2].ICI>>C([CH:1]1[CH2:2][CH2]1)




     6 reactions found in 5 patents


                                               25
Name I20110224.tarUS20110046406A1-20110224.ZIP0066




Text from US 2011/0046406 A1




                                                        26
Most lexical variants

1-ethyl-3-(dimethylaminopropyl)carbodiimide hydrochloride
EDCI hydrochloride
1-ethyl-3-[3-(dimethylamino)propyl]-carbodiimide hydrochloride
N-ethyl-N'-(3-dimethylamino-propyl)-carbodiimide hydrochloride
                                                                             And 127 more!
N-[3-(Dimethylamino) propyl]-N'-ethylcarbodiimide hydrochloride
1-(3-dimethylaminopropyl)-3-ethylcarbodiimide.HCl
N1-((Ethylimino)methylene)-N3,N3-dimethylpropane-1,3-diamine hydrochloride
N-(3-dimethylaminopropyl)-N'-ethylcarbodiimide hydrochloride
1-ethyl-3-dimethylaminopropyl-carbodiimide hydrochloride
1-(3-dimethylaminopropyl)-3-ethylcarbodiimide HCl
                                                                             675 chemicals had over
1-[3(dimethylamino)propyl]-3-ethylcarbodiimide hydrochloride
1-(-3-dimethylamino-propyl)-3-ethylcarbodiimide hydrochloride                10 lexical variants!
N-(3-Dimethylamino-1-propyl)-N'-ethylcarbodiimide hydrochloride
1-ethyl-3-(3-dimethylaminopropyl)carbodiimide monohydrochloride
1-(3-(Dimethylamino)propyl)-3-ethyl-carbodiimide hydrochloride



                                                                                                      27
Most common solvents




                       28
Known Limitations
•   The first workup reagent is often erroneously classified as a
    reactant

•   Atom mapping produces mappings that are not necessarily
    representative of reaction mechanism and occasionally
    involve clearly incorrect atoms

•   Conditions from analogous reactions are not resolved

•   Temperature/time for reactions to occur not captured



                                                                    29
Conclusions
• 424,621 exact atom-mapped reactions were
  extracted from 4 years of USPTO patent
  applications
• Evaluation indicates the reactions to be of
  generally good quality especially if the
  misidentification of workup reagents as
  reactants is not considered important
• All the code to extract reactions is open source:
  https://bitbucket.org/dan2097/patent-reaction-extraction

                                                        30
Acknowledgements
Unilever centre:                   Indigo toolkit:
Robert Glen                        Mikhail Rybalkin
Peter Murray-Rust                  Savelyev Alexander
Lezan Hawizy                       Dmitry Pavlov
David Jessop
Matthew Grayson
Boehringer Ingelheim for funding   SMARTS searching:
                                   Roger Sayle



                                                        31
Any Questions?




Email: daniel@nextmovesoftware.com


                                     32

More Related Content

What's hot

Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksSequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksNguyen Quang
 
Machine Learning in Chemistry: Part II
Machine Learning in Chemistry: Part IIMachine Learning in Chemistry: Part II
Machine Learning in Chemistry: Part IIJon Paul Janet
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsDarius Barušauskas
 
Graph Representation Learning
Graph Representation LearningGraph Representation Learning
Graph Representation LearningJure Leskovec
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryKenta Oono
 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning TutorialAmr Rashed
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentationBushra Jbawi
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Krishnaram Kenthapadi
 
汎用なNeural Network Potential「Matlantis」を使った新素材探索_浅野_JACI先端化学・材料技術部会 高選択性反応分科会主...
汎用なNeural Network Potential「Matlantis」を使った新素材探索_浅野_JACI先端化学・材料技術部会 高選択性反応分科会主...汎用なNeural Network Potential「Matlantis」を使った新素材探索_浅野_JACI先端化学・材料技術部会 高選択性反応分科会主...
汎用なNeural Network Potential「Matlantis」を使った新素材探索_浅野_JACI先端化学・材料技術部会 高選択性反応分科会主...Matlantis
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPAnuj Gupta
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNNShuai Zhang
 

What's hot (20)

Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksSequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural Networks
 
Machine Learning in Chemistry: Part II
Machine Learning in Chemistry: Part IIMachine Learning in Chemistry: Part II
Machine Learning in Chemistry: Part II
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
Graph Representation Learning
Graph Representation LearningGraph Representation Learning
Graph Representation Learning
 
SQL & NoSQL
SQL & NoSQLSQL & NoSQL
SQL & NoSQL
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
Clustering
ClusteringClustering
Clustering
 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning Tutorial
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)
 
汎用なNeural Network Potential「Matlantis」を使った新素材探索_浅野_JACI先端化学・材料技術部会 高選択性反応分科会主...
汎用なNeural Network Potential「Matlantis」を使った新素材探索_浅野_JACI先端化学・材料技術部会 高選択性反応分科会主...汎用なNeural Network Potential「Matlantis」を使った新素材探索_浅野_JACI先端化学・材料技術部会 高選択性反応分科会主...
汎用なNeural Network Potential「Matlantis」を使った新素材探索_浅野_JACI先端化学・材料技術部会 高選択性反応分科会主...
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
Ontologies
OntologiesOntologies
Ontologies
 

Similar to Automated Extraction of Reactions from the Patent Literature

Introduction to Chemoinformatics
Introduction to ChemoinformaticsIntroduction to Chemoinformatics
Introduction to ChemoinformaticsSSA KPI
 
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...Hitesh Patel
 
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical PatentsChemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical Patentsdan2097
 
Virtual Reaction Service Using Chem Axon Reactor July06
Virtual Reaction Service Using Chem Axon Reactor July06Virtual Reaction Service Using Chem Axon Reactor July06
Virtual Reaction Service Using Chem Axon Reactor July06DanielSButler
 
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verification
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verificationISMB2011 Tutorial: Biomedical Ontologies for data integration and verification
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verificationMichel Dumontier
 
Novel materials for development of optical sensors
Novel materials for development of optical sensorsNovel materials for development of optical sensors
Novel materials for development of optical sensorsreganf
 
6-8-10 Presentation1 - Copy.ppt
6-8-10 Presentation1 - Copy.ppt6-8-10 Presentation1 - Copy.ppt
6-8-10 Presentation1 - Copy.pptAsifAli165576
 
Global content summit: Overview, content partnering, richness
Global content summit: Overview, content partnering, richnessGlobal content summit: Overview, content partnering, richness
Global content summit: Overview, content partnering, richnessCyndy Parr
 
Chemoinformatics in Action
Chemoinformatics in ActionChemoinformatics in Action
Chemoinformatics in ActionSSA KPI
 
Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Valery Tkachenko
 
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docx
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docxOrganic I Review Workbook – The Toolbox ALL STAR MOLECU.docx
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docxjacksnathalie
 
IRSAE aquatic ecology 28 June 2018 metabolomics
IRSAE aquatic ecology 28 June 2018 metabolomicsIRSAE aquatic ecology 28 June 2018 metabolomics
IRSAE aquatic ecology 28 June 2018 metabolomicsPanagiotis Arapitsas
 
Cheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspectiveCheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspectiveNextMove Software
 
Harmony 2011: Formalization of SBML models as OWL ontologies
Harmony 2011: Formalization of SBML models as OWL ontologiesHarmony 2011: Formalization of SBML models as OWL ontologies
Harmony 2011: Formalization of SBML models as OWL ontologiesMichel Dumontier
 
"Productivity and Simplicity" - Streamlining Cumbersome Sample Preparation i...
"Productivity and Simplicity" -  Streamlining Cumbersome Sample Preparation i..."Productivity and Simplicity" -  Streamlining Cumbersome Sample Preparation i...
"Productivity and Simplicity" - Streamlining Cumbersome Sample Preparation i...Oscar Cabrices PhD
 

Similar to Automated Extraction of Reactions from the Patent Literature (20)

Introduction to Chemoinformatics
Introduction to ChemoinformaticsIntroduction to Chemoinformatics
Introduction to Chemoinformatics
 
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
 
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
 
SEMS: Model search and ranked Retrieval (Ron Henkel)
SEMS: Model search and ranked Retrieval (Ron Henkel)SEMS: Model search and ranked Retrieval (Ron Henkel)
SEMS: Model search and ranked Retrieval (Ron Henkel)
 
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical PatentsChemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
 
Virtual Reaction Service Using Chem Axon Reactor July06
Virtual Reaction Service Using Chem Axon Reactor July06Virtual Reaction Service Using Chem Axon Reactor July06
Virtual Reaction Service Using Chem Axon Reactor July06
 
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verification
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verificationISMB2011 Tutorial: Biomedical Ontologies for data integration and verification
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verification
 
Novel materials for development of optical sensors
Novel materials for development of optical sensorsNovel materials for development of optical sensors
Novel materials for development of optical sensors
 
6-8-10 Presentation1 - Copy.ppt
6-8-10 Presentation1 - Copy.ppt6-8-10 Presentation1 - Copy.ppt
6-8-10 Presentation1 - Copy.ppt
 
Global content summit: Overview, content partnering, richness
Global content summit: Overview, content partnering, richnessGlobal content summit: Overview, content partnering, richness
Global content summit: Overview, content partnering, richness
 
Chemoinformatics in Action
Chemoinformatics in ActionChemoinformatics in Action
Chemoinformatics in Action
 
CRE-!-Lec.pptx
CRE-!-Lec.pptxCRE-!-Lec.pptx
CRE-!-Lec.pptx
 
Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...
 
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docx
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docxOrganic I Review Workbook – The Toolbox ALL STAR MOLECU.docx
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docx
 
The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...
 
IRSAE aquatic ecology 28 June 2018 metabolomics
IRSAE aquatic ecology 28 June 2018 metabolomicsIRSAE aquatic ecology 28 June 2018 metabolomics
IRSAE aquatic ecology 28 June 2018 metabolomics
 
Cheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspectiveCheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspective
 
Harmony 2011: Formalization of SBML models as OWL ontologies
Harmony 2011: Formalization of SBML models as OWL ontologiesHarmony 2011: Formalization of SBML models as OWL ontologies
Harmony 2011: Formalization of SBML models as OWL ontologies
 
"Productivity and Simplicity" - Streamlining Cumbersome Sample Preparation i...
"Productivity and Simplicity" -  Streamlining Cumbersome Sample Preparation i..."Productivity and Simplicity" -  Streamlining Cumbersome Sample Preparation i...
"Productivity and Simplicity" - Streamlining Cumbersome Sample Preparation i...
 
Determination of Common Counterions and Impurity Anions in Pharmaceuticals Us...
Determination of Common Counterions and Impurity Anions in Pharmaceuticals Us...Determination of Common Counterions and Impurity Anions in Pharmaceuticals Us...
Determination of Common Counterions and Impurity Anions in Pharmaceuticals Us...
 

More from dan2097

From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resourcesFrom Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resourcesdan2097
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...dan2097
 
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclatureOPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclaturedan2097
 
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
OPSIN: Taming the Jungle of IUPAC Chemical NomenclatureOPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclaturedan2097
 
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping AlgorithmsEvaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithmsdan2097
 
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChI
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChIInChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChI
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChIdan2097
 

More from dan2097 (6)

From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resourcesFrom Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
 
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclatureOPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
 
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
OPSIN: Taming the Jungle of IUPAC Chemical NomenclatureOPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
 
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping AlgorithmsEvaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
 
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChI
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChIInChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChI
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChI
 

Recently uploaded

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Automated Extraction of Reactions from the Patent Literature

  • 1. Automated Extraction of Reactions from the Patent Literature Daniel Lowe Unilever Centre for Molecular Science Informatics University of Cambridge 1
  • 2. Chemistry patent applications • 100,000s applications each year 400000 350000 Chemistry patent applications per year 300000 250000 200000 150000 100000 50000 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 World Intellectual Property Indicators, 2011 edition 2
  • 3. 3
  • 4. The idea XML patents Reaction Extraction System Extracted Reactions 4
  • 5. Steps involved • Identifying experimental sections • Identifying chemical entities • Chemical name to structure conversion • Associating chemical entities with quantities • Assigning chemical roles • Atom-atom mapping 5
  • 6. Building on existing projects 6
  • 7. Archetypal experimental section Section heading Section target compound Step identifier Step target compound Paragraph number Synthesis Workup Characterisation 7
  • 8. Jessop, D. M.; Adams, S. E.; Murray-Rust, P. Mining Chemical Information from Open Patents. Journal of Cheminformatics 2011, 3, 40. 8
  • 9. ChemicalTagger • Tags words of text • Parses tags to identify phrases • Generate XML parse tree – http://chemicaltagger.ch.cam.ac.uk/ – Hawizy, L.; Jessop, D. M.; Adams, N.; Murray-Rust, P. ChemicalTagger: A tool for semantic text-mining in chemistry. J Cheminf 2011, 3, 17. 9
  • 10. Tagging • Regex tagger: tags keywords e.g. “yield”, “mL” • OSCAR4 tagger: Finds names OSCAR4 believes to be chemical e.g. “2-methylpyridine” • OpenNLP: Tags parts of speech Additional taggers: • OPSIN tagger: Finds names OPSIN can parse • Trivial chemical name tagger: Tags a few chemicals missed by the other taggers and cases that are partially matched by the regex tagger e.g. Dess-martin reagent 10
  • 11. Sample ChemicalTagger Output <MOLECULE> <OSCARCM> <OSCAR-CM>methyl</OSCAR-CM> <OSCAR-CM>4-(chlorosulfonyl)benzoate</OSCAR-CM> </OSCARCM> <QUANTITY> <_-LRB->(</_-LRB-> <MASS> <CD>606</CD> <NN-MASS>mg</NN-MASS> </MASS> <COMMA>,</COMMA> <AMOUNT> <CD>2.1</CD> <NN-AMOUNT>mmol</NN-AMOUNT> </AMOUNT> <COMMA>,</COMMA> <EQUIVALENT> <CD>1</CD> <NN-EQ>eq</NN-EQ> </EQUIVALENT> <_-RRB->)</_-RRB-> </QUANTITY> </MOLECULE> 11
  • 15. Pyridine, pyridines and pyridine rings The pyridine / Pyridines / Pyridine ring / Entity Pyridine Pyridine from step 1 A pyridine Pyridyl Type Exact DefiniteReference ChemicalClass Fragment 15
  • 16. Section/Step Parsing Workup phrase types : Concentrate, Degass, Dry, Extract, Filter, Partition, Precipitate, Purify, Recover, Remove, Wash, Quench 16
  • 18. Example Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate To a solution of methyl 4-(chlorosulfonyl)benzoate (606 mg, 2.1 mmol, 1 eq) in DCM (35 ml) was added pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N (540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred at room temperature until all of the starting material was consumed. The solvent was evaporated in vacuo and the residue redissolved in ethyl acetate (10 ml), washed with water (10 ml), saturated sodium hydrogen carbonate (10 ml), dried over sodium sulphate, filtered and evaporated to yield the title compound as a white solid (690 mg, 1.8 mmol, 85%). 18
  • 20. CML output <reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-.. <dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([.. <productList> <product role="product"> Reaction SMILES <molecule id="m0"> <name dictRef="nameDict:unknown">title compound</name> </molecule> <amount units="unit:mmol">1.8</amount> <amount units="unit:mg">690</amount> Quantities including yield are extracted <amount units="unit:percentYield">85.0</amount> <identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/> <identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H.. <dl:entityType>definiteReference</dl:entityType> <dl:state>solid</dl:state> SMILES and InChIs for every structure </product> resolvable reagent/product </productList> <reactantList> Entity is classified as an exact compound, <reactant role="reactant" count="1"> <molecule id="m1"> definite reference, chemical class or polymer <name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name> </molecule> <amount units="unit:mmol">2.1</amount> <amount units="unit:mg">606</amount> <amount units="unit:eq">1.0</amount> <identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/> 20
  • 21. Evaluation • 2008-2011 USPTO patent applications classified as containing organic chemistry  65,034 documents. • 484,259 reactions atom mapped reactions extracted • Adding the additional requirements that all the identified product molecules were resolvable to structures and that all reagents were believed to describe exact compounds  424,621 reactions. • 100 of these were selected for manual evaluation of quality 21
  • 22. Reactions found 100,000 10,000 Patents with given number of reactions 1,000 100 10 1 0 200 400 600 800 1000 Number of extracted reactions 22
  • 23. Results • 96% correctly identified the primary starting material and product whilst not misidentifying reagents that could be confused with the starting material • As compared to the 495 expected chemical entities there were 61 false positives and 16 false negatives • Only 4 of the 321 reagents (with quantities) did not have these quantities recognised and associated with the reagent • Association of quantities/yields with products was less successful, 48 out of the 74 cases where such data was present were handled 23
  • 24. Use Cases • Reaction searching • Analysing trends in reactions over time • Reaction outcome prediction 24
  • 25. Example of reaction searching C[CH:1]=[CH2:2].ICI>>C([CH:1]1[CH2:2][CH2]1) 6 reactions found in 5 patents 25
  • 27. Most lexical variants 1-ethyl-3-(dimethylaminopropyl)carbodiimide hydrochloride EDCI hydrochloride 1-ethyl-3-[3-(dimethylamino)propyl]-carbodiimide hydrochloride N-ethyl-N'-(3-dimethylamino-propyl)-carbodiimide hydrochloride And 127 more! N-[3-(Dimethylamino) propyl]-N'-ethylcarbodiimide hydrochloride 1-(3-dimethylaminopropyl)-3-ethylcarbodiimide.HCl N1-((Ethylimino)methylene)-N3,N3-dimethylpropane-1,3-diamine hydrochloride N-(3-dimethylaminopropyl)-N'-ethylcarbodiimide hydrochloride 1-ethyl-3-dimethylaminopropyl-carbodiimide hydrochloride 1-(3-dimethylaminopropyl)-3-ethylcarbodiimide HCl 675 chemicals had over 1-[3(dimethylamino)propyl]-3-ethylcarbodiimide hydrochloride 1-(-3-dimethylamino-propyl)-3-ethylcarbodiimide hydrochloride 10 lexical variants! N-(3-Dimethylamino-1-propyl)-N'-ethylcarbodiimide hydrochloride 1-ethyl-3-(3-dimethylaminopropyl)carbodiimide monohydrochloride 1-(3-(Dimethylamino)propyl)-3-ethyl-carbodiimide hydrochloride 27
  • 29. Known Limitations • The first workup reagent is often erroneously classified as a reactant • Atom mapping produces mappings that are not necessarily representative of reaction mechanism and occasionally involve clearly incorrect atoms • Conditions from analogous reactions are not resolved • Temperature/time for reactions to occur not captured 29
  • 30. Conclusions • 424,621 exact atom-mapped reactions were extracted from 4 years of USPTO patent applications • Evaluation indicates the reactions to be of generally good quality especially if the misidentification of workup reagents as reactants is not considered important • All the code to extract reactions is open source: https://bitbucket.org/dan2097/patent-reaction-extraction 30
  • 31. Acknowledgements Unilever centre: Indigo toolkit: Robert Glen Mikhail Rybalkin Peter Murray-Rust Savelyev Alexander Lezan Hawizy Dmitry Pavlov David Jessop Matthew Grayson Boehringer Ingelheim for funding SMARTS searching: Roger Sayle 31

Editor's Notes

  1. Manual abstraction of the precise details of reactions from this many documents would be expensive.
  2. How can one get access to patents? Google patents offers all USPTO patents from 2001 onwards as XML including images and ChemDraw files. Older patents are available with just the text back to 1976, back to 1920 with OCRed text and back to 1790 if one OCRs themselves
  3. This problem can be broken down into several sub problems
  4. Fortunately we don’t have to start from scratch, many open source toolkit exist to help with these tasks. OPSIN, name to structure, OSCAR4, chemical entity recognition, ChemicalTagger, tagging and parsing of experimental chemistry text
  5. This is what a typical experimental section from a patent looks like. We need to identify such sections, link the heading with the paragraphs and preferably distinguish synthesis reagents from workup reagents.
  6. Heading/paragraphs can be extracted directly from the XML. The classifier uses the probabilities of words being present in an experimental chemistry section versus a standard paragraph. The language in experimental sections is quite repetitive so this works well. In some cases a heading may not be annotated as such in the XML, this can be detected in many cases and processed as if the heading was a discrete element.
  7. This work relies heavily on ChemicalTagger and significant improvements have been made to ChemicalTagger as part of this porject to improve its performance and range of concepts recognised. Hence a description of the system would not be complete without also explaining what ChemicalTagger does
  8. For this project we also use the following taggers. These tags can then be parsed to yield….
  9. Quantities have been recognised and marked up and associated with a molecule. Where certain key words are identified phrases can be identfied….
  10. A few phrase types are identified directly by the grammar e.g. a chemical in a chemical is a dissolve phrase
  11. Will be associated with the identified compound. As you can see a compound doesn’t have to contain a chemical entity. (title compound as a white solid)
  12. Uses a combination of textual clues and OPSIN’s classification
  13. Phrases can be classified into workup by phrase type e.g. extraction, purification. As the yielded compound and characterisation are often conjoined rather than explicitly identifying the workup compounds commonly associated with characterisation are marked up as false positives by regexes. A single paragraph may have multiple blocks of synthesis and workup. Structure-aware role assignment involves things like heuristically assigning known solvents as solvent and catalysts e.g. using lists of known solvents/catalysts and their properties e.g. transition metal
  14. Perform sanity check on reaction e.g. has a product and at least 2 reagents. Attempt to find mapping where all product atoms can be accounted for
  15. Here is an example of an experimental section
  16. Occasionally the system identifies a compound as a reactant that was mentioned only in the context of the current reaction being performed in an analogous way to the reaction that produced it. False positives arise from workup reagents being classified as reactants and clear errors. Product information often not explicitly associated with product.
  17. Simmons–Smith reaction for conversion of a terminal allyl group to a cyclopropane group found 6 hits in 5 patents.
  18. It should be noted that nowhere in this text and indeed in the whole patent is the name of the reaction mentioned, this is quite common.
  19. 675 chemical entities had over 10 lexical variants
  20. Top 10
  21. This is due to the text typically just saying that the substance is added without further specification of its purpose