SlideShare a Scribd company logo
Chemical Text Mining for Current
 Awareness of Pharmaceutical
            Patents

                Daniel Lowe and Roger Sayle
                    NextMove Software
                       Cambridge, UK

 ACS National Meeting, Philadelphia, USA 19th August 2012
US patent applications by year
                                           35000


                                           30000


                                           25000
            Patent applications per year




                                           20000
                                                                                                                                 Pharma

                                           15000                                                                                 C07D



                                           10000


                                           5000


                                               0
                                                   2002   2003   2004   2005   2006   2007   2008   2009   2010   2011   2012*


*2012 includes patent applications published on or before 9th August 2012.
“Pharma” is defined as IPC codes C07*, A61K, A61P and A01N.

             ACS National Meeting, Philadelphia, USA 19th August 2012
USPTO Bulk Downloads




ACS National Meeting, Philadelphia, USA 19th August 2012
Workflow




ACS National Meeting, Philadelphia, USA 19th August 2012
Green Book 1976-2000




ACS National Meeting, Philadelphia, USA 19th August 2012
SGml (2001 grants)

• Uses different tags to Red Book documents
• May contain unclosed tags:

<PCIT>
<DOC><DNUM><PDAT>5154857</PDAT></DNUM>
<DATE><PDAT>19921000</PDAT></DATE></DOC>
<PARTY-US>
<NAM><SNM><STEXT><PDAT>Goto et al.</PDAT></STEXT></SNM></NAM>
</PARTY-US>
<PNC><PDAT>25229963</PDAT></PNC></PCIT><CITED-BY-EXAMINER>



      ACS National Meeting, Philadelphia, USA 19th August 2012
Red Book (2002 – present)
<heading id="h-0082" level="1">Example 30</heading>
<heading id="h-0083" level="1">Preparation of (E)-2-amino-N-(3-(2-(2,6,6-
trimethylcyclohex-1-enyl)vinyl)phenyl)acetamide</heading>
<p id="p-0883" num="1052"><chemistry id="CHEM-US-00235" num="00235">
<img id="EMI-C00235" he="20.07mm" wi="57.57mm" file="US20090170841A1-
20090702-C00235.TIF" img-content="chem" img-format="tif"/>
</chemistry>
</p>
<p id="p-0884" num="1053">(E)-2-amino-N-(3-(2-(2,6,6-trimethylcyclohex-1-
enyl)vinyl)phenyl)acetamide was prepared following the method used in
Example 15.</p>
<p id="p-0885" num="1054">Step 1: Coupling of Wittig reagent 24 with 3-
nitrobenzaldehyde gave 1-nitro-3-(2-(2,6,6-trimethylcyclohex-1-
enyl)vinyl)benzene as a light yellow oil. Yield (0.639 g, 95%), isomer ratio
4:1 ratio trans:cis.</p>
<p id="p-0886" num="1055">trans-isomer: <sup>1</sup>H NMR (300 MHz, CDCl<sub>3</sub>)
&#x3b4; 8.24 (t, J=1.9 Hz, 1H), 8.04 (m, 1H), 7.69 (d, J=7.7 Hz, 1H), 7.47 (t, J=8.0 Hz, 1H), 6.83 (dd,
J=16.3, 0.85 Hz, 1H), 6.40 (d, J=16.3 Hz, 1H), 2.06 (t, J=6.2 Hz, 2H), 1.81 (s, 3H), 1.65 (m, 2H), 1.52 (m,
2H), 1.08 (s, 6H);</p>


           ACS National Meeting, Philadelphia, USA 19th August 2012
Html generated from red book




 ACS National Meeting, Philadelphia, USA 19th August 2012
Benefits of clean Input

• Patent feed text:
Cis-2,3,6,7,12,12a-hexahydro-2-benzyl6-(4-methoxyphenyl)-pyrazino0x9b 2',1':6,1
!pyrido0x9b 3,4-b!indole-1,4-dione

                                                                      New line in
• Extracted from USPTO source:                                      middle of name!
Cis-2,3,6,7,12,12a-hexahydro-2-benzyl6-(4-methoxyphenyl)-
pyrazino[2',1':6,1]pyrido[3,4-b]indole-1,4-dione


• LeadMine entity:
Cis-2,3,6,7,12,12a-hexahydro-2-benzyl-6-(4-methoxyphenyl)-
pyrazino[2',1':6,1]pyrido[3,4-b]indole-1,4-dione


         ACS National Meeting, Philadelphia, USA 19th August 2012
leadMine 2.0

• Dictionary and grammar based general entity
  recogniser


• Tokenization determined by the terms to be
  recognised



     ACS National Meeting, Philadelphia, USA 19th August 2012
Default Dictionaries
Dictionary                             Example                   Size
Molecule                               benzoic acid              Infinite
Dictionary                             ranitidine                11,201
Registry #                             GW-409544                 Large but finite
CAS Number                             7732-18-5                 Large but finite
Element                                gold                      185
Fragment                               phenyl                    Infinite
Atom Fragment                          chloro                    11
Polymer                                polystyrene               74
Generic                                alkane                    362
Noise                                  formal                    16

      ACS National Meeting, Philadelphia, USA 19th August 2012
LeadMine 2.0
• Dictionaries to be used are configurable e.g.
  protein targets, genes, diseases, reaction
  names etc.

• Matching speed is independent of dictionary
  size

• Any dictionary can be used with spelling
  correction to match lexically close entities

     ACS National Meeting, Philadelphia, USA 19th August 2012
LeadMine 2.0 Configuration
#A company registry number for a compound
[dictionary]
  location CFDictR.cfx
                                                    Multiple dictionaries
  entityType R
  htmlColor #90b0ff
                                                    can map to the same
  caseSensitive false
  useSpellingCorrection false
                                                    entity type
#A molecule e.g. 2-methylpyridine
[dictionary]
  location CFDictM.cfx
  entityType M
  htmlColor violet
  enforceBracketing true

                                                    Spelling correction can
  caseSensitive false
  useSpellingCorrection true

                                                    be adjusted on a per
  minimumCorrectedEntityLength 9
  maxCorrectionDistance 1

                                                    dictionary basis
          ACS National Meeting, Philadelphia, USA 19th August 2012
Building Dictionaries

• Uses Daciuk/Mihov’s algorithm to allow building
  dictionaries with millions of entities in linear time
• Extremely large dictionaries are often smaller when
  compiled than the original input
• 54 million synonyms from PubChem can be compiled
  to a dictionary slightly less than 1gb in 17 minutes
  and 20 seconds!
Daciuk, J.; Mihov, S.; Watson, B. W.; Watson, R. E. Incremental
construction of minimal acyclic finite-state automata.
Computational linguistics 2000, 26, 3–16.
       ACS National Meeting, Philadelphia, USA 19th August 2012
Foreign Language Support

• Chinese and Japanese chemical names may be
  rapidly converted to English as a pre-
  processing step



“Translating IUPAC-like chemical nomenclature to and from
simplified Chinese” 9:10 am, Wednesday, Global Opportunities
in Chemical Information



      ACS National Meeting, Philadelphia, USA 19th August 2012
Sample outpuT (HTml)




ACS National Meeting, Philadelphia, USA 19th August 2012
Sample outpuT (CSV)

"in",E,"M",1,0,"COC1=CC=C(CN2CCC(CC2)O)C=C1","1-(4-methoxybenzyl)piperidin-4-ol",
"in",E,"M",1,0,"BrCC1=CC=C(C=C1)OC","1-(bromomethyl)-4-methoxybenzene",
"in",E,"M",1,0,"OC1CCNCC1","4-hydroxypiperidine",
"in",E,"G",1,0,,"brine",
"in",E,"M",1,0,"CN(C=O)C","dimethylformamide",
"in",E,"M",1,0,"C(C)(=O)OCC","ethyl acetate",
"in",E,"M",1,0,"CO","methanol",
"in",E,"M",1,0,"C(Cl)Cl","methylene chloride",
"in",E,"G",1,0,,"silica gel",
"in",E,"M",1,0,"S(=O)(=O)([O-])[O-].[Na+].[Na+]","sodium sulfate",
"in",E,"M",1,0,"C(C)N(CC)CC","triethylamine",
"in",E,"N",1,0,"O","water",




           ACS National Meeting, Philadelphia, USA 19th August 2012
BRAT (brat rapid annotation tool)
T1    M   25 44                   4-hydroxypiperidine
T2    M   78 95                   dimethylformamide
T3    M   121 153                 1-(bromomethyl)-4-methoxybenzene
T4    M   178 191                 triethylamine
T5    M   404 417                 ethyl acetate
T6    N   439 444                 water
T7    G   457 462                 brine
T8    M   495 509                 sodium sulfate
T9    G   658 668                 silica gel
T10   M   675 683                 methanol
T11   M   684 702                 methylene chloride
T12   M   714 747                 1-(4-methoxybenzyl)piperidin-4-ol




      ACS National Meeting, Philadelphia, USA 19th August 2012
BRAT (brat rapid annotation tool)




ACS National Meeting, Philadelphia, USA 19th August 2012
PatfetcH




ACS National Meeting, Philadelphia, USA 19th August 2012
PatfetcH




ACS National Meeting, Philadelphia, USA 19th August 2012
PatfetcH-cont.

• Recognises common USPTO grant/application
  number variants e.g. 6356863/US
  6356863/006356863/US 6,356,863 B1

• Allows all USPTO patent grant/applications to
  be accessed as text or html from simple URLs
  e.g. patfetch/patents/6356863.html


     ACS National Meeting, Philadelphia, USA 19th August 2012
“Macroscopic” analysis

• Having all patents available allows for analysis
  that spans the entire corpus rather than being
  limited to a single patent
• Example use cases
  – Identifying the key compounds in a patent
  – Finding the first instance of a molecule in the
    patent literature
  – Identify patents containing novel chemistry

     ACS National Meeting, Philadelphia, USA 19th August 2012
Filtering irrelevant patents

• Most irrelevant patents can be excluded by
  IPC codes. These are assigned by the USPTO to
  classify each patent.
• Typical pharmaceutical IPC codes
  – CO7 (Organic Chemistry)
  – A61K (Preparations for medical, dental or toilet purposes)
  – A61P (Specific therapeutic activity of chemical compounds
    or medicinal preparations)
  – AO1N (Preservation of bodies of humans or animals or
    plants or parts thereof)
     ACS National Meeting, Philadelphia, USA 19th August 2012
Finding the first mention of a
              compound
• Trivial names of compound often won’t be present in
  the first patent synthesising a compound
• Brand name Fabior, approved May 11, 2012
• Generic name Tazarotene
• First mentioned in US05023341




     ACS National Meeting, Philadelphia, USA 19th August 2012
Finding the first mention of a
              compound
• Brand name Erivedge, approved January 30, 2012
• Generic name Vismodegib
• First mentioned in US20060063779A1




     ACS National Meeting, Philadelphia, USA 19th August 2012
Novel compounds per patent




ACS National Meeting, Philadelphia, USA 19th August 2012
Novel compounds per patent




ACS National Meeting, Philadelphia, USA 19th August 2012
Awareness of novel scaffolds




Pitt, W. R.; Parry, D. M.; Perry, B. G.; Groom, C. R. Heteroaromatic rings of
the future. Journal of medicinal chemistry 2009, 52, 2952–2963.
   ACS National Meeting, Philadelphia, USA 19th August 2012
Awareness of novel scaffolds




                                                       US46041346, 1983




ACS National Meeting, Philadelphia, USA 19th August 2012
Awareness of novel scaffolds




                                                 US20040038959A1




ACS National Meeting, Philadelphia, USA 19th August 2012
Rate of novel scaffold discovery
                                    600



                                    500
         Novel scaffolds per year




                                    400



                                    300
                                                                                               6 atom ring fused to 5 atom ring
                                                                                               6 atom ring fused to 6 atom ring
                                    200



                                    100



                                      0




*year not yet complete


                                    ACS National Meeting, Philadelphia, USA 19th August 2012
Solvent Occurrences




Extracted from reactions in 2008-2011 USPTO patent applications


           ACS National Meeting, Philadelphia, USA 19th August 2012
Conclusions

• Getting clean text from patents is an
  important starting point

• LeadMine offers a highly configurable
  environment for performing entity extraction

• Comprehensive coverage of the patent
  literature can assist in identifying the
  interesting aspects of a new patent
     ACS National Meeting, Philadelphia, USA 19th August 2012
acknowledgements

•   Sorel Muresan and Paul Hongxing Xie, AstraZeneca.
•   Nicko Goncheroff, SureChem/Digital Science.
•   Colin Batchelor, Royal Society of Chemistry.
•   Peter Loew and Heinz Saller, InfoChem.
•   Pat Walters, Vertex Pharmaceuticals.


• Thank you for your time.


       ACS National Meeting, Philadelphia, USA 20th August 2012
ACS National Meeting, Philadelphia, USA 19th August 2012
Simple JAVA api

ExtractEngine engine = new ExtractEngine();
EntityCollector collector = engine.processString("text
to analyse");
List<Entity> foundEntities = collector.getEntities();
for (Entity entity : entities) {
  entity.getText();
  entity.getEntityType();
  entity.getBeg();
  entity.getEnd();
}


      ACS National Meeting, Philadelphia, USA 19th August 2012

More Related Content

Viewers also liked

Dry Needling Brochure All Locations
Dry Needling Brochure All LocationsDry Needling Brochure All Locations
Dry Needling Brochure All Locations
Raquel Scharkopf, MHA
 
Datos abiertos y participacion ciudadana - Conferencia Universidad Veracruzana
Datos abiertos y participacion ciudadana - Conferencia Universidad VeracruzanaDatos abiertos y participacion ciudadana - Conferencia Universidad Veracruzana
Datos abiertos y participacion ciudadana - Conferencia Universidad Veracruzana
Hiriam Eduardo Perez Vidal
 
09 cannes insights mma
09 cannes insights mma09 cannes insights mma
09 cannes insights mma
Mobile Marketing Association
 
Drongo: Zoeken in Audiovisuele Documenten
Drongo: Zoeken in Audiovisuele DocumentenDrongo: Zoeken in Audiovisuele Documenten
Drongo: Zoeken in Audiovisuele Documenten
NOTaS
 
Content Curation; or how to be an Information Hero
Content Curation; or how to be an Information HeroContent Curation; or how to be an Information Hero
Content Curation; or how to be an Information Hero
GO opleidingen
 
My influences
My influencesMy influences
My influences
Beth Johnson
 
Consumer Technology - Social Listening in a Rapidly Changing Industry
Consumer Technology - Social Listening in a Rapidly Changing IndustryConsumer Technology - Social Listening in a Rapidly Changing Industry
Consumer Technology - Social Listening in a Rapidly Changing Industry
Brandwatch
 
Acuril 2016: Transition to customer focused Information services
Acuril 2016: Transition to customer focused Information servicesAcuril 2016: Transition to customer focused Information services
Acuril 2016: Transition to customer focused Information services
GO opleidingen
 
Stunning Entries From Sony World Photography Awards 2017
 Stunning Entries From Sony World Photography Awards 2017 Stunning Entries From Sony World Photography Awards 2017
Stunning Entries From Sony World Photography Awards 2017
maditabalnco
 
#OfficeAnywhere Photo Contest Finalists
#OfficeAnywhere Photo Contest Finalists#OfficeAnywhere Photo Contest Finalists
#OfficeAnywhere Photo Contest Finalists
RingCentral, Inc.
 
Cómo vivir de la profesion de blogger
Cómo vivir de la profesion de bloggerCómo vivir de la profesion de blogger
Cómo vivir de la profesion de blogger
Miguel Florido
 
EDUCATIONAL SPANISH SYSTEM
EDUCATIONAL SPANISH SYSTEMEDUCATIONAL SPANISH SYSTEM
EDUCATIONAL SPANISH SYSTEM
mcast243
 
Tips to ensure night driving
Tips to ensure night drivingTips to ensure night driving
Tips to ensure night driving
Eason Chan
 
Plotly Julia API
Plotly Julia APIPlotly Julia API
Plotly Julia API
E2D3.org
 
Becoming a New Manager - Todd DeLuca - STC Summit 2015
Becoming a New Manager - Todd DeLuca - STC Summit 2015Becoming a New Manager - Todd DeLuca - STC Summit 2015
Becoming a New Manager - Todd DeLuca - STC Summit 2015
Todd DeLuca, MTSC
 
Jack D Ryger: A Journey Through The Beautiful Seasons Of Central Park
Jack D Ryger: A Journey Through The Beautiful Seasons Of Central ParkJack D Ryger: A Journey Through The Beautiful Seasons Of Central Park
Jack D Ryger: A Journey Through The Beautiful Seasons Of Central Park
Jack D. Ryger
 
Seminário 04 - Prática Escolar: do erro como fonte de castigo ao erro como fo...
Seminário 04 - Prática Escolar: do erro como fonte de castigo ao erro como fo...Seminário 04 - Prática Escolar: do erro como fonte de castigo ao erro como fo...
Seminário 04 - Prática Escolar: do erro como fonte de castigo ao erro como fo...
Cosmo Matias Gomes
 
Axway Introduction & Digital Business (by Jo Van Audenhove & Rogier van Boxtel)
Axway Introduction & Digital Business (by Jo Van Audenhove & Rogier van Boxtel)Axway Introduction & Digital Business (by Jo Van Audenhove & Rogier van Boxtel)
Axway Introduction & Digital Business (by Jo Van Audenhove & Rogier van Boxtel)
ACA IT-Solutions
 
Impacto de las tics en la cultura de l mediacion a distancia para la educacio...
Impacto de las tics en la cultura de l mediacion a distancia para la educacio...Impacto de las tics en la cultura de l mediacion a distancia para la educacio...
Impacto de las tics en la cultura de l mediacion a distancia para la educacio...
rosa zambrano
 

Viewers also liked (20)

Dry Needling Brochure All Locations
Dry Needling Brochure All LocationsDry Needling Brochure All Locations
Dry Needling Brochure All Locations
 
Datos abiertos y participacion ciudadana - Conferencia Universidad Veracruzana
Datos abiertos y participacion ciudadana - Conferencia Universidad VeracruzanaDatos abiertos y participacion ciudadana - Conferencia Universidad Veracruzana
Datos abiertos y participacion ciudadana - Conferencia Universidad Veracruzana
 
09 cannes insights mma
09 cannes insights mma09 cannes insights mma
09 cannes insights mma
 
ֆրանսիա
ֆրանսիաֆրանսիա
ֆրանսիա
 
Drongo: Zoeken in Audiovisuele Documenten
Drongo: Zoeken in Audiovisuele DocumentenDrongo: Zoeken in Audiovisuele Documenten
Drongo: Zoeken in Audiovisuele Documenten
 
Content Curation; or how to be an Information Hero
Content Curation; or how to be an Information HeroContent Curation; or how to be an Information Hero
Content Curation; or how to be an Information Hero
 
My influences
My influencesMy influences
My influences
 
Consumer Technology - Social Listening in a Rapidly Changing Industry
Consumer Technology - Social Listening in a Rapidly Changing IndustryConsumer Technology - Social Listening in a Rapidly Changing Industry
Consumer Technology - Social Listening in a Rapidly Changing Industry
 
Acuril 2016: Transition to customer focused Information services
Acuril 2016: Transition to customer focused Information servicesAcuril 2016: Transition to customer focused Information services
Acuril 2016: Transition to customer focused Information services
 
Stunning Entries From Sony World Photography Awards 2017
 Stunning Entries From Sony World Photography Awards 2017 Stunning Entries From Sony World Photography Awards 2017
Stunning Entries From Sony World Photography Awards 2017
 
#OfficeAnywhere Photo Contest Finalists
#OfficeAnywhere Photo Contest Finalists#OfficeAnywhere Photo Contest Finalists
#OfficeAnywhere Photo Contest Finalists
 
Cómo vivir de la profesion de blogger
Cómo vivir de la profesion de bloggerCómo vivir de la profesion de blogger
Cómo vivir de la profesion de blogger
 
EDUCATIONAL SPANISH SYSTEM
EDUCATIONAL SPANISH SYSTEMEDUCATIONAL SPANISH SYSTEM
EDUCATIONAL SPANISH SYSTEM
 
Tips to ensure night driving
Tips to ensure night drivingTips to ensure night driving
Tips to ensure night driving
 
Plotly Julia API
Plotly Julia APIPlotly Julia API
Plotly Julia API
 
Becoming a New Manager - Todd DeLuca - STC Summit 2015
Becoming a New Manager - Todd DeLuca - STC Summit 2015Becoming a New Manager - Todd DeLuca - STC Summit 2015
Becoming a New Manager - Todd DeLuca - STC Summit 2015
 
Jack D Ryger: A Journey Through The Beautiful Seasons Of Central Park
Jack D Ryger: A Journey Through The Beautiful Seasons Of Central ParkJack D Ryger: A Journey Through The Beautiful Seasons Of Central Park
Jack D Ryger: A Journey Through The Beautiful Seasons Of Central Park
 
Seminário 04 - Prática Escolar: do erro como fonte de castigo ao erro como fo...
Seminário 04 - Prática Escolar: do erro como fonte de castigo ao erro como fo...Seminário 04 - Prática Escolar: do erro como fonte de castigo ao erro como fo...
Seminário 04 - Prática Escolar: do erro como fonte de castigo ao erro como fo...
 
Axway Introduction & Digital Business (by Jo Van Audenhove & Rogier van Boxtel)
Axway Introduction & Digital Business (by Jo Van Audenhove & Rogier van Boxtel)Axway Introduction & Digital Business (by Jo Van Audenhove & Rogier van Boxtel)
Axway Introduction & Digital Business (by Jo Van Audenhove & Rogier van Boxtel)
 
Impacto de las tics en la cultura de l mediacion a distancia para la educacio...
Impacto de las tics en la cultura de l mediacion a distancia para la educacio...Impacto de las tics en la cultura de l mediacion a distancia para la educacio...
Impacto de las tics en la cultura de l mediacion a distancia para la educacio...
 

Similar to Chemical Text Mining for Current Awareness of Pharmaceutical Patents

Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical PatentsChemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
NextMove Software
 
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
NextMove Software
 
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Age
petermurrayrust
 
Preparing Compliant eCTD Submissions
Preparing Compliant eCTD SubmissionsPreparing Compliant eCTD Submissions
Preparing Compliant eCTD Submissions
Scott Abel
 
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
ACS San Francisco 2010 CINF Talk
ACS San Francisco 2010 CINF TalkACS San Francisco 2010 CINF Talk
ACS San Francisco 2010 CINF Talk
Markus Sitzmann
 
New Product Introductions - CAS
New Product Introductions - CASNew Product Introductions - CAS
New Product Introductions - CAS
Dr. Haxel Consult
 
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping AlgorithmsEvaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
dan2097
 
5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk
5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk
5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk
Markus Sitzmann
 
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
Michel Dumontier
 
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Advances in Automatic Chemical Spelling Correction
Advances in Automatic Chemical Spelling CorrectionAdvances in Automatic Chemical Spelling Correction
Advances in Automatic Chemical Spelling Correction
NextMove Software
 
Overview of SureChEMBL
Overview of SureChEMBLOverview of SureChEMBL
Overview of SureChEMBL
George Papadatos
 
Cheminformatics II
Cheminformatics IICheminformatics II
Cheminformatics II
baoilleach
 
Automated Extraction of Reactions from the Patent Literature
Automated Extraction of Reactions from the Patent LiteratureAutomated Extraction of Reactions from the Patent Literature
Automated Extraction of Reactions from the Patent Literature
dan2097
 
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
open_phacts
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
Dr. Haxel Consult
 
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...
Graham Smith
 
ICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBLICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBL
Dr. Haxel Consult
 
Patents At Mach(Final)
Patents At Mach(Final)Patents At Mach(Final)
Patents At Mach(Final)
Jay Bhatt
 

Similar to Chemical Text Mining for Current Awareness of Pharmaceutical Patents (20)

Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical PatentsChemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
 
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
 
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Age
 
Preparing Compliant eCTD Submissions
Preparing Compliant eCTD SubmissionsPreparing Compliant eCTD Submissions
Preparing Compliant eCTD Submissions
 
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
 
ACS San Francisco 2010 CINF Talk
ACS San Francisco 2010 CINF TalkACS San Francisco 2010 CINF Talk
ACS San Francisco 2010 CINF Talk
 
New Product Introductions - CAS
New Product Introductions - CASNew Product Introductions - CAS
New Product Introductions - CAS
 
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping AlgorithmsEvaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
 
5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk
5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk
5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk
 
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
 
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
 
Advances in Automatic Chemical Spelling Correction
Advances in Automatic Chemical Spelling CorrectionAdvances in Automatic Chemical Spelling Correction
Advances in Automatic Chemical Spelling Correction
 
Overview of SureChEMBL
Overview of SureChEMBLOverview of SureChEMBL
Overview of SureChEMBL
 
Cheminformatics II
Cheminformatics IICheminformatics II
Cheminformatics II
 
Automated Extraction of Reactions from the Patent Literature
Automated Extraction of Reactions from the Patent LiteratureAutomated Extraction of Reactions from the Patent Literature
Automated Extraction of Reactions from the Patent Literature
 
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
 
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...
Enabling HTS Hit follow up via Chemo informatics, File Enrichment, and Outsou...
 
ICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBLICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBL
 
Patents At Mach(Final)
Patents At Mach(Final)Patents At Mach(Final)
Patents At Mach(Final)
 

Recently uploaded

June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 

Recently uploaded (20)

June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 

Chemical Text Mining for Current Awareness of Pharmaceutical Patents

  • 1. Chemical Text Mining for Current Awareness of Pharmaceutical Patents Daniel Lowe and Roger Sayle NextMove Software Cambridge, UK ACS National Meeting, Philadelphia, USA 19th August 2012
  • 2. US patent applications by year 35000 30000 25000 Patent applications per year 20000 Pharma 15000 C07D 10000 5000 0 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012* *2012 includes patent applications published on or before 9th August 2012. “Pharma” is defined as IPC codes C07*, A61K, A61P and A01N. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 3. USPTO Bulk Downloads ACS National Meeting, Philadelphia, USA 19th August 2012
  • 4. Workflow ACS National Meeting, Philadelphia, USA 19th August 2012
  • 5. Green Book 1976-2000 ACS National Meeting, Philadelphia, USA 19th August 2012
  • 6. SGml (2001 grants) • Uses different tags to Red Book documents • May contain unclosed tags: <PCIT> <DOC><DNUM><PDAT>5154857</PDAT></DNUM> <DATE><PDAT>19921000</PDAT></DATE></DOC> <PARTY-US> <NAM><SNM><STEXT><PDAT>Goto et al.</PDAT></STEXT></SNM></NAM> </PARTY-US> <PNC><PDAT>25229963</PDAT></PNC></PCIT><CITED-BY-EXAMINER> ACS National Meeting, Philadelphia, USA 19th August 2012
  • 7. Red Book (2002 – present) <heading id="h-0082" level="1">Example 30</heading> <heading id="h-0083" level="1">Preparation of (E)-2-amino-N-(3-(2-(2,6,6- trimethylcyclohex-1-enyl)vinyl)phenyl)acetamide</heading> <p id="p-0883" num="1052"><chemistry id="CHEM-US-00235" num="00235"> <img id="EMI-C00235" he="20.07mm" wi="57.57mm" file="US20090170841A1- 20090702-C00235.TIF" img-content="chem" img-format="tif"/> </chemistry> </p> <p id="p-0884" num="1053">(E)-2-amino-N-(3-(2-(2,6,6-trimethylcyclohex-1- enyl)vinyl)phenyl)acetamide was prepared following the method used in Example 15.</p> <p id="p-0885" num="1054">Step 1: Coupling of Wittig reagent 24 with 3- nitrobenzaldehyde gave 1-nitro-3-(2-(2,6,6-trimethylcyclohex-1- enyl)vinyl)benzene as a light yellow oil. Yield (0.639 g, 95%), isomer ratio 4:1 ratio trans:cis.</p> <p id="p-0886" num="1055">trans-isomer: <sup>1</sup>H NMR (300 MHz, CDCl<sub>3</sub>) &#x3b4; 8.24 (t, J=1.9 Hz, 1H), 8.04 (m, 1H), 7.69 (d, J=7.7 Hz, 1H), 7.47 (t, J=8.0 Hz, 1H), 6.83 (dd, J=16.3, 0.85 Hz, 1H), 6.40 (d, J=16.3 Hz, 1H), 2.06 (t, J=6.2 Hz, 2H), 1.81 (s, 3H), 1.65 (m, 2H), 1.52 (m, 2H), 1.08 (s, 6H);</p> ACS National Meeting, Philadelphia, USA 19th August 2012
  • 8. Html generated from red book ACS National Meeting, Philadelphia, USA 19th August 2012
  • 9. Benefits of clean Input • Patent feed text: Cis-2,3,6,7,12,12a-hexahydro-2-benzyl6-(4-methoxyphenyl)-pyrazino0x9b 2',1':6,1 !pyrido0x9b 3,4-b!indole-1,4-dione New line in • Extracted from USPTO source: middle of name! Cis-2,3,6,7,12,12a-hexahydro-2-benzyl6-(4-methoxyphenyl)- pyrazino[2',1':6,1]pyrido[3,4-b]indole-1,4-dione • LeadMine entity: Cis-2,3,6,7,12,12a-hexahydro-2-benzyl-6-(4-methoxyphenyl)- pyrazino[2',1':6,1]pyrido[3,4-b]indole-1,4-dione ACS National Meeting, Philadelphia, USA 19th August 2012
  • 10. leadMine 2.0 • Dictionary and grammar based general entity recogniser • Tokenization determined by the terms to be recognised ACS National Meeting, Philadelphia, USA 19th August 2012
  • 11. Default Dictionaries Dictionary Example Size Molecule benzoic acid Infinite Dictionary ranitidine 11,201 Registry # GW-409544 Large but finite CAS Number 7732-18-5 Large but finite Element gold 185 Fragment phenyl Infinite Atom Fragment chloro 11 Polymer polystyrene 74 Generic alkane 362 Noise formal 16 ACS National Meeting, Philadelphia, USA 19th August 2012
  • 12. LeadMine 2.0 • Dictionaries to be used are configurable e.g. protein targets, genes, diseases, reaction names etc. • Matching speed is independent of dictionary size • Any dictionary can be used with spelling correction to match lexically close entities ACS National Meeting, Philadelphia, USA 19th August 2012
  • 13. LeadMine 2.0 Configuration #A company registry number for a compound [dictionary] location CFDictR.cfx Multiple dictionaries entityType R htmlColor #90b0ff can map to the same caseSensitive false useSpellingCorrection false entity type #A molecule e.g. 2-methylpyridine [dictionary] location CFDictM.cfx entityType M htmlColor violet enforceBracketing true Spelling correction can caseSensitive false useSpellingCorrection true be adjusted on a per minimumCorrectedEntityLength 9 maxCorrectionDistance 1 dictionary basis ACS National Meeting, Philadelphia, USA 19th August 2012
  • 14. Building Dictionaries • Uses Daciuk/Mihov’s algorithm to allow building dictionaries with millions of entities in linear time • Extremely large dictionaries are often smaller when compiled than the original input • 54 million synonyms from PubChem can be compiled to a dictionary slightly less than 1gb in 17 minutes and 20 seconds! Daciuk, J.; Mihov, S.; Watson, B. W.; Watson, R. E. Incremental construction of minimal acyclic finite-state automata. Computational linguistics 2000, 26, 3–16. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 15. Foreign Language Support • Chinese and Japanese chemical names may be rapidly converted to English as a pre- processing step “Translating IUPAC-like chemical nomenclature to and from simplified Chinese” 9:10 am, Wednesday, Global Opportunities in Chemical Information ACS National Meeting, Philadelphia, USA 19th August 2012
  • 16. Sample outpuT (HTml) ACS National Meeting, Philadelphia, USA 19th August 2012
  • 17. Sample outpuT (CSV) "in",E,"M",1,0,"COC1=CC=C(CN2CCC(CC2)O)C=C1","1-(4-methoxybenzyl)piperidin-4-ol", "in",E,"M",1,0,"BrCC1=CC=C(C=C1)OC","1-(bromomethyl)-4-methoxybenzene", "in",E,"M",1,0,"OC1CCNCC1","4-hydroxypiperidine", "in",E,"G",1,0,,"brine", "in",E,"M",1,0,"CN(C=O)C","dimethylformamide", "in",E,"M",1,0,"C(C)(=O)OCC","ethyl acetate", "in",E,"M",1,0,"CO","methanol", "in",E,"M",1,0,"C(Cl)Cl","methylene chloride", "in",E,"G",1,0,,"silica gel", "in",E,"M",1,0,"S(=O)(=O)([O-])[O-].[Na+].[Na+]","sodium sulfate", "in",E,"M",1,0,"C(C)N(CC)CC","triethylamine", "in",E,"N",1,0,"O","water", ACS National Meeting, Philadelphia, USA 19th August 2012
  • 18. BRAT (brat rapid annotation tool) T1 M 25 44 4-hydroxypiperidine T2 M 78 95 dimethylformamide T3 M 121 153 1-(bromomethyl)-4-methoxybenzene T4 M 178 191 triethylamine T5 M 404 417 ethyl acetate T6 N 439 444 water T7 G 457 462 brine T8 M 495 509 sodium sulfate T9 G 658 668 silica gel T10 M 675 683 methanol T11 M 684 702 methylene chloride T12 M 714 747 1-(4-methoxybenzyl)piperidin-4-ol ACS National Meeting, Philadelphia, USA 19th August 2012
  • 19. BRAT (brat rapid annotation tool) ACS National Meeting, Philadelphia, USA 19th August 2012
  • 20. PatfetcH ACS National Meeting, Philadelphia, USA 19th August 2012
  • 21. PatfetcH ACS National Meeting, Philadelphia, USA 19th August 2012
  • 22. PatfetcH-cont. • Recognises common USPTO grant/application number variants e.g. 6356863/US 6356863/006356863/US 6,356,863 B1 • Allows all USPTO patent grant/applications to be accessed as text or html from simple URLs e.g. patfetch/patents/6356863.html ACS National Meeting, Philadelphia, USA 19th August 2012
  • 23. “Macroscopic” analysis • Having all patents available allows for analysis that spans the entire corpus rather than being limited to a single patent • Example use cases – Identifying the key compounds in a patent – Finding the first instance of a molecule in the patent literature – Identify patents containing novel chemistry ACS National Meeting, Philadelphia, USA 19th August 2012
  • 24. Filtering irrelevant patents • Most irrelevant patents can be excluded by IPC codes. These are assigned by the USPTO to classify each patent. • Typical pharmaceutical IPC codes – CO7 (Organic Chemistry) – A61K (Preparations for medical, dental or toilet purposes) – A61P (Specific therapeutic activity of chemical compounds or medicinal preparations) – AO1N (Preservation of bodies of humans or animals or plants or parts thereof) ACS National Meeting, Philadelphia, USA 19th August 2012
  • 25. Finding the first mention of a compound • Trivial names of compound often won’t be present in the first patent synthesising a compound • Brand name Fabior, approved May 11, 2012 • Generic name Tazarotene • First mentioned in US05023341 ACS National Meeting, Philadelphia, USA 19th August 2012
  • 26. Finding the first mention of a compound • Brand name Erivedge, approved January 30, 2012 • Generic name Vismodegib • First mentioned in US20060063779A1 ACS National Meeting, Philadelphia, USA 19th August 2012
  • 27. Novel compounds per patent ACS National Meeting, Philadelphia, USA 19th August 2012
  • 28. Novel compounds per patent ACS National Meeting, Philadelphia, USA 19th August 2012
  • 29. Awareness of novel scaffolds Pitt, W. R.; Parry, D. M.; Perry, B. G.; Groom, C. R. Heteroaromatic rings of the future. Journal of medicinal chemistry 2009, 52, 2952–2963. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 30. Awareness of novel scaffolds US46041346, 1983 ACS National Meeting, Philadelphia, USA 19th August 2012
  • 31. Awareness of novel scaffolds US20040038959A1 ACS National Meeting, Philadelphia, USA 19th August 2012
  • 32. Rate of novel scaffold discovery 600 500 Novel scaffolds per year 400 300 6 atom ring fused to 5 atom ring 6 atom ring fused to 6 atom ring 200 100 0 *year not yet complete ACS National Meeting, Philadelphia, USA 19th August 2012
  • 33. Solvent Occurrences Extracted from reactions in 2008-2011 USPTO patent applications ACS National Meeting, Philadelphia, USA 19th August 2012
  • 34. Conclusions • Getting clean text from patents is an important starting point • LeadMine offers a highly configurable environment for performing entity extraction • Comprehensive coverage of the patent literature can assist in identifying the interesting aspects of a new patent ACS National Meeting, Philadelphia, USA 19th August 2012
  • 35. acknowledgements • Sorel Muresan and Paul Hongxing Xie, AstraZeneca. • Nicko Goncheroff, SureChem/Digital Science. • Colin Batchelor, Royal Society of Chemistry. • Peter Loew and Heinz Saller, InfoChem. • Pat Walters, Vertex Pharmaceuticals. • Thank you for your time. ACS National Meeting, Philadelphia, USA 20th August 2012
  • 36. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 37. Simple JAVA api ExtractEngine engine = new ExtractEngine(); EntityCollector collector = engine.processString("text to analyse"); List<Entity> foundEntities = collector.getEntities(); for (Entity entity : entities) { entity.getText(); entity.getEntityType(); entity.getBeg(); entity.getEnd(); } ACS National Meeting, Philadelphia, USA 19th August 2012