SlideShare a Scribd company logo
Patent Cheminformatics

Identification of key compounds in patents


3rd Strasbourg Summer School on Chemoinformatics
Strasbourg, France, 25-29 June 2012


Dr. Sorel Muresan

AstraZeneca R&D Mölndal
Outlook



Patents, brief into

Sources for accessing full text patents

Compound extraction from patents

Key compound prediction
What is a patent?

A patent   application is an agreement between
inventor and state, allowing an inventor a monopoly
over their invention for a limited time. In the EU,
applicants are required to disclose their inventions
in a manner sufficiently clear and complete for them to be
carried out by a person skilled in the art. In the United
States, inventors are additionally required to include the
‘best mode’ of making or practicing the invention.
Patents are very interesting documents

 Three reasons why life scientists should read patents more
   frequently

      1. Some information appears earlier in patents than in scientific
         journals

      2. Patents may contain sound data that never appear in the
         literature.

      3. Patents are a source of hard-to-get information from
         commercial suppliers.




Seeber, F. Nat. Protocols 2007
Patents as pharmaceutical data source
Complementary between journals and patents
 “In certain fields, new advances are disclosed in patents long before
 they are published in peer-reviewed journals.”         Grubb. W.P.

                 “Novel Cannabinoid-1 Receptor Inverse Agonist for the Treatment of Obesity”
                                                         modulates

                                                                     CNR1


  Patent application    Patent publication            Journal publication
      Nov 2002              Mar 2004                      Dec 2006


                  ~18 months                 2.5 years
  USPTO patents (PN: US20040058820)                                            (PMID: 17181138)
Unique chemistry from patents
   Data from AstraZeneca’s Chemistry Connect




               ~6% of compound structures exemplified in patents
               were also published in journal articles


Muresan, S. et al. DDT 2011
Southan, C. et al J.Cheminfo. 2009
Anatomy of a patent

   Front page - contains a wealth of information about
     the patent

   Detailed description of the invention - the heart of a
     patent application. It generally describes one or more
     preferred embodiments of the invention in enough detail
     to enable someone of ordinary skill in the art to make or
     use the invention without having to resort to undue
     experimentation

   Claims - the most important part of the patent application.
     Define the scope of patent protection afforded to the
     owner of a patent.

C.P. Miller, M.J. Evans, The Chemist's Companion Guide to Patent Law, 2011
Searching for patents

Official public portals




Open full-text




Structure -> Patents & Patent -> Structure
Manually extracted SAR data (commercial)

• GOSTAR (GVKBIO Online Structure Activity Relationship Database) is a
  comprehensive database that captures explicit relationships between the three
  entities of publications, compounds and sequences.

• It includes 2.6 million compounds linked to 3,500 sequences with 12.5M SAR
  points extracted from 43,000 patents and 67,000 articles from 125 journals
Annotate patents manually
Brat – rapid annotation tool (http://brat.nlplab.org/index.html)
Annotate patents manually
Brat – rapid annotation tool (http://brat.nlplab.org/index.html)
Name-2-structure
OPSIN – http://opsin.ch.cam.ac.uk/
Chemical Named Entity Recognition (CNER)



                      N-Ethyl-p-menthane-3-carboxamide

                                         Name-to-Structure
                                         software



                       C(C)NC(=O)C1CC(CCC1C(C)C)C
                                     O


                                          N
                                          H




                     WS3 cooling agent (CAS 39711-79-0)
Compound extraction via chemical text mining
Traditional text mining pipeline




• Determining the start and end of IUPAClike names in free text is a tricky
  business.
• Chemical names can contain whitespace, hyphens, commas,
  parenthesis, brackets, braces, apostrophes, superscripts, greek
  characters, digits and periods.
• This is made harder still by typos, OCR errors, hyphenation, linefeeds,
  XML tags, line and page numbers and similar noise.
OCR Errors: Compound Names


• lH-ben zimidazole → 1H-benzimidazole

• 4- (2-ADAMANTYLCARBAM0YL) -5-TERT-BUTYL-
  PYRAZOL-1-YL] BENZOIC ACID →
  4-(2-adamantylcarbamoyl)-5-tert-butyl-pyrazol-1-
  yl]benzoic acid

• triphenylposhine → triphenylphosphine
Text Mining Conversion by Name Class
                                           ChemAxon 5.5
  Class   Category             Names       converts 60%
                                         n2s_1 (%) n2s_2 (%) n2s_3 (%)   None (%)
   M      Molecule           7,262,798       81.4       64.8      77.1        7.8
   D      Dictionary           26,876        38.1       45.1       3.5       38.5
   R      Registry number     304,064           0         0         0        100
   C      CAS number           47,815           0         0         0        100
   E      Element                 836        92.8     58.5      78.7          3.2
                                           NCI/CADD Chemical Identifier Resolver
   P      Fragment           2,663,677       72.9     58.9         0         19.8
                                                     converts 48%
   A      Atom fragment            96        90.6     36.5         0          6.3
   Y      Polymer                 295           0       44.1      22.7       36.9
   G      Generic                1263         2.6        6.3       0.5       91.9
   N      Noise                   104        32.7        24       19.2       52.9
          Total             10,307,824       76.3       61.0      54.3       14.1



Muresan, S. ChemAxon UGM 2011
Extraction compounds from US20100221398
Google Patents + Chemicalize.org (ChemAxon)
Extraction compounds from US20100221398
Google Patents + Chemicalize.org (ChemAxon)
What is a key compound?


Drug candidate

Compound(s) with optimal physicochemical properties

Compouns(s) with the most suitable pharmacokinetic
 profile

The most biologically active tool or probe
Key compounds from patents

”Old school” techniques
      Look for clues in text: “most preferred” wording
      in claims; crystal form info; scale of synthesis

Structural information alone
      Frequency of group (FOG) analysis of
      exemplified compounds

Structures and SAR data
      Work out SAR using biological data and
      structures
Exercise 2b: Predicting Key Example Compounds
Key compound prediction from patents




 Figure 2. Graphical image of patent example compounds in chemical space. Each gray circle represents an example
 compound. The black circle, square, and triangle represent key compound candidates.



     Theory: Chemists carry out extensive SAR around key compounds.
     Cluster examples and look for centres of densely populated regions


 Kazunari Hattori; Hiroaki Wakabayashi; Kenta Tamaki; J. Chem. Inf. Model. 2008, 48, 135-142.
 DOI: 10.1021/ci7002686
 Copyright © 2008 American Chemical Society
Key compound prediction from patents
  From WO1996025405 the earliest patent which claims it, can you
  work out the structure of Bextra (Valdecoxib), the Pfizer NSAID?




74 exemplified cmpds
R-group decomposition (Free-Wilson like)
FOG ranking
Key compound prediction from patents
Frequency of group (FOG) analysis


1) Extract compounds from patents
   • manual curation (GVKBIO)
   • text mining (SureChemOpen, OSCAR)

2) Define core
   • manually (e.g. from patent Markush)
   • automated core perception

3) R-group decomposition (ChemAxon’s JChem)

4) Rank compounds based on FOG (Spotfire, EXCEL)
WO1996025405 - Bextra




 Source     #compounds   Bextra   Bextra
                         exists   ranked
 GVKBIO     74           Y        1 (broad core)
                                  1 (narrow core)
 SureChem   501          Y        1 (broad core)
                                  1 (narrow core)
EP268956 - Aciphex




  Source     #compounds   Aciphex   Aciphex
                          exists    ranked
  GVKBIO     27           Y         2 (core1)
                                    1 (core2)
  SureChem   168          Y         1 (core1)
                                    1 (core2)
EP268956 - Aciphex
EP296749 - Aricept




  Source     #compounds   Aricept   Aricept
                          exists    ranked
  GVKBIO     76           Y         1

  SureChem   108          Y         1
EP296749 - Arimidex




  Source     #compounds   Arimidex   Arimidex
                          exists     ranked
  GVKBIO     70           Y          1

  SureChem   267          Y          1
WO1989006649 - Raxar




X is CH, CF, CCI, or N.


  Source             #compounds   Raxar    Raxar
                                  exists   ranked
  GVKBIO             102          Y        5

  SureChem           0            N        n/a
US3912743 – PAXIL




 Source    #compounds   Paxil    Paxil
                        exists   ranked
 GVKBIO    60           Y        54
US3912743 – PAXIL
Key compound prediction
 48 drug patents, with more than 10 compounds, including the drug


Method           Key compound   Key compound
                 as the first   within the first 5
                 ranked         ranked
                 compound       compounds

Cluster seed     11 (23%)       26 (54%)
analysis

Molecular Idol   5 (10%)        22 (46%)


Frequency of     11 (23%)       17 (35%)
group analysis
JCIM paper
http://dx.doi.org/10.1021/ci3001293
Summary



• Unique chemistry from patents (8% out of 55M parent
  structures in AstraZeneca’s Chemistry Connect)

• Public sources for full text patent access and open
  source software for chemical text mining

• Frequency of group analysis is a simple process to
  visualize and rank patent compounds
Acknowledgements

•   Jon Winter
•   Christian Tyrchan
•   Jonas Boström
•   Mats Eriksson
•   Paul Xie
•   Chris Southan

• GVKBIO
• IBM
• SureChem

• Next Move Software
• ChemAxon
• Unilever Centre for Molecular
  Science Informatics

More Related Content

Viewers also liked

presentation on NANO-TECH REGENERATIVE FUEL CELL
presentation on NANO-TECH REGENERATIVE FUEL CELLpresentation on NANO-TECH REGENERATIVE FUEL CELL
presentation on NANO-TECH REGENERATIVE FUEL CELL
cutejuhi
 
Fuel cell seminar
Fuel cell seminarFuel cell seminar
Fuel cell seminar
Bipin Gupta
 
Nano Tech Fuel Presentation
Nano Tech Fuel PresentationNano Tech Fuel Presentation
Nano Tech Fuel Presentation
CaptNano
 
Nano fuel cell
Nano fuel cellNano fuel cell
Nano fuel cell
Raj Kumar
 
Polyfuse
PolyfusePolyfuse
Fuel Cells
Fuel CellsFuel Cells
Fuel Cells
Ankit Saxena
 
Electric drives
Electric drivesElectric drives
Electric drives
Samsu Deen
 
PatSeer Introduction
PatSeer IntroductionPatSeer Introduction
PatSeer Introduction
Gridlogics
 
Fuel cells presentation
Fuel cells presentationFuel cells presentation
Fuel cells presentation
Vaibhav Chavan
 
Power electronic drives ppt
Power electronic drives pptPower electronic drives ppt
Power electronic drives ppt
Sai Manoj
 
Fuel cells
Fuel cellsFuel cells
Fuel cells
Santosh Damkondwar
 
Electrical ac & dc drives ppt
Electrical ac & dc drives pptElectrical ac & dc drives ppt
Electrical ac & dc drives ppt
chakri218
 
POWER ELECTRONIC DEVICES
POWER ELECTRONIC DEVICESPOWER ELECTRONIC DEVICES
POWER ELECTRONIC DEVICES
shazaliza
 
Electric drives
Electric drivesElectric drives
Electric drives
raj_e2004
 

Viewers also liked (14)

presentation on NANO-TECH REGENERATIVE FUEL CELL
presentation on NANO-TECH REGENERATIVE FUEL CELLpresentation on NANO-TECH REGENERATIVE FUEL CELL
presentation on NANO-TECH REGENERATIVE FUEL CELL
 
Fuel cell seminar
Fuel cell seminarFuel cell seminar
Fuel cell seminar
 
Nano Tech Fuel Presentation
Nano Tech Fuel PresentationNano Tech Fuel Presentation
Nano Tech Fuel Presentation
 
Nano fuel cell
Nano fuel cellNano fuel cell
Nano fuel cell
 
Polyfuse
PolyfusePolyfuse
Polyfuse
 
Fuel Cells
Fuel CellsFuel Cells
Fuel Cells
 
Electric drives
Electric drivesElectric drives
Electric drives
 
PatSeer Introduction
PatSeer IntroductionPatSeer Introduction
PatSeer Introduction
 
Fuel cells presentation
Fuel cells presentationFuel cells presentation
Fuel cells presentation
 
Power electronic drives ppt
Power electronic drives pptPower electronic drives ppt
Power electronic drives ppt
 
Fuel cells
Fuel cellsFuel cells
Fuel cells
 
Electrical ac & dc drives ppt
Electrical ac & dc drives pptElectrical ac & dc drives ppt
Electrical ac & dc drives ppt
 
POWER ELECTRONIC DEVICES
POWER ELECTRONIC DEVICESPOWER ELECTRONIC DEVICES
POWER ELECTRONIC DEVICES
 
Electric drives
Electric drivesElectric drives
Electric drives
 

Similar to Patent Cheminformatics: Identification of key compounds in patents

Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...
Chris Southan
 
SureChEMBL and Open PHACTS
SureChEMBL and Open PHACTSSureChEMBL and Open PHACTS
SureChEMBL and Open PHACTS
George Papadatos
 
Integrating Patents with Research Data
Integrating Patents with Research DataIntegrating Patents with Research Data
Integrating Patents with Research Data
Chris Southan
 
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Hitesh Patel
 
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
NextMove Software
 
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...
II-PIC 2017: Why did I miss that Patent? How value added databases of STN  he...II-PIC 2017: Why did I miss that Patent? How value added databases of STN  he...
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...
Dr. Haxel Consult
 
Introduction to Nanotechnology: Part 3
Introduction to Nanotechnology: Part 3Introduction to Nanotechnology: Part 3
Introduction to Nanotechnology: Part 3
glennfish
 
foglar book.pdf
foglar book.pdffoglar book.pdf
foglar book.pdf
BalqeesMustafa
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
Greg Landrum
 
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Kamel Mansouri
 
Fragment Based Drug Discovery
Fragment Based Drug DiscoveryFragment Based Drug Discovery
Fragment Based Drug Discovery
Anthony Coyne
 
Cheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural ProductsCheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural Products
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Tautomers - Advanced Databases for in-silico Screening?
Tautomers - Advanced Databases for in-silico Screening?Tautomers - Advanced Databases for in-silico Screening?
Tautomers - Advanced Databases for in-silico Screening?
Frank Oellien
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology
Sean Ekins
 
Patent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTSPatent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTS
open_phacts
 
Virtual screening of chemicals for endocrine disrupting activity through CER...
Virtual screening of chemicals for endocrine disrupting activity through  CER...Virtual screening of chemicals for endocrine disrupting activity through  CER...
Virtual screening of chemicals for endocrine disrupting activity through CER...
Kamel Mansouri
 
Chemical abstracts ii
Chemical abstracts   iiChemical abstracts   ii
Chemical abstracts ii
ramiah valliappan
 
Polymer Deformulation of a Medical Device case study
Polymer Deformulation of a Medical Device case studyPolymer Deformulation of a Medical Device case study
Polymer Deformulation of a Medical Device case study
Jordi Labs
 
Checking, Curating And Qualifying Chemistry
Checking, Curating And Qualifying ChemistryChecking, Curating And Qualifying Chemistry

Similar to Patent Cheminformatics: Identification of key compounds in patents (20)

Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...
 
SureChEMBL and Open PHACTS
SureChEMBL and Open PHACTSSureChEMBL and Open PHACTS
SureChEMBL and Open PHACTS
 
Integrating Patents with Research Data
Integrating Patents with Research DataIntegrating Patents with Research Data
Integrating Patents with Research Data
 
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
 
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...
II-PIC 2017: Why did I miss that Patent? How value added databases of STN  he...II-PIC 2017: Why did I miss that Patent? How value added databases of STN  he...
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...
 
Introduction to Nanotechnology: Part 3
Introduction to Nanotechnology: Part 3Introduction to Nanotechnology: Part 3
Introduction to Nanotechnology: Part 3
 
foglar book.pdf
foglar book.pdffoglar book.pdf
foglar book.pdf
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
 
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
 
Fragment Based Drug Discovery
Fragment Based Drug DiscoveryFragment Based Drug Discovery
Fragment Based Drug Discovery
 
Cheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural ProductsCheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural Products
 
Tautomers - Advanced Databases for in-silico Screening?
Tautomers - Advanced Databases for in-silico Screening?Tautomers - Advanced Databases for in-silico Screening?
Tautomers - Advanced Databases for in-silico Screening?
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology
 
Patent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTSPatent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTS
 
Virtual screening of chemicals for endocrine disrupting activity through CER...
Virtual screening of chemicals for endocrine disrupting activity through  CER...Virtual screening of chemicals for endocrine disrupting activity through  CER...
Virtual screening of chemicals for endocrine disrupting activity through CER...
 
Chemical abstracts ii
Chemical abstracts   iiChemical abstracts   ii
Chemical abstracts ii
 
Polymer Deformulation of a Medical Device case study
Polymer Deformulation of a Medical Device case studyPolymer Deformulation of a Medical Device case study
Polymer Deformulation of a Medical Device case study
 
Checking, Curating And Qualifying Chemistry
Checking, Curating And Qualifying ChemistryChecking, Curating And Qualifying Chemistry
Checking, Curating And Qualifying Chemistry
 

More from Sorel Muresan

Narrow Range Ethoxylates - Highly targeted performance for more effective cle...
Narrow Range Ethoxylates - Highly targeted performance for more effective cle...Narrow Range Ethoxylates - Highly targeted performance for more effective cle...
Narrow Range Ethoxylates - Highly targeted performance for more effective cle...
Sorel Muresan
 
Multifunctional hydrotropes
Multifunctional hydrotropesMultifunctional hydrotropes
Multifunctional hydrotropes
Sorel Muresan
 
Thickening with cationic surfactants
Thickening with cationic surfactantsThickening with cationic surfactants
Thickening with cationic surfactants
Sorel Muresan
 
Challenges with CLP and solutions offered by AkzoNobel
Challenges with CLP and solutions offered by AkzoNobelChallenges with CLP and solutions offered by AkzoNobel
Challenges with CLP and solutions offered by AkzoNobel
Sorel Muresan
 
Berol® ENV226 Plus - a formulator’s delight
Berol® ENV226 Plus - a formulator’s delightBerol® ENV226 Plus - a formulator’s delight
Berol® ENV226 Plus - a formulator’s delight
Sorel Muresan
 
Chemistry Connect
Chemistry ConnectChemistry Connect
Chemistry Connect
Sorel Muresan
 
Getting the Big Picture by Joining up the SAR dots
Getting the Big Picture by Joining up the SAR dotsGetting the Big Picture by Joining up the SAR dots
Getting the Big Picture by Joining up the SAR dots
Sorel Muresan
 
Automated spelling correction to improve recall rates of name-to-structure to...
Automated spelling correction to improve recall rates of name-to-structure to...Automated spelling correction to improve recall rates of name-to-structure to...
Automated spelling correction to improve recall rates of name-to-structure to...
Sorel Muresan
 

More from Sorel Muresan (8)

Narrow Range Ethoxylates - Highly targeted performance for more effective cle...
Narrow Range Ethoxylates - Highly targeted performance for more effective cle...Narrow Range Ethoxylates - Highly targeted performance for more effective cle...
Narrow Range Ethoxylates - Highly targeted performance for more effective cle...
 
Multifunctional hydrotropes
Multifunctional hydrotropesMultifunctional hydrotropes
Multifunctional hydrotropes
 
Thickening with cationic surfactants
Thickening with cationic surfactantsThickening with cationic surfactants
Thickening with cationic surfactants
 
Challenges with CLP and solutions offered by AkzoNobel
Challenges with CLP and solutions offered by AkzoNobelChallenges with CLP and solutions offered by AkzoNobel
Challenges with CLP and solutions offered by AkzoNobel
 
Berol® ENV226 Plus - a formulator’s delight
Berol® ENV226 Plus - a formulator’s delightBerol® ENV226 Plus - a formulator’s delight
Berol® ENV226 Plus - a formulator’s delight
 
Chemistry Connect
Chemistry ConnectChemistry Connect
Chemistry Connect
 
Getting the Big Picture by Joining up the SAR dots
Getting the Big Picture by Joining up the SAR dotsGetting the Big Picture by Joining up the SAR dots
Getting the Big Picture by Joining up the SAR dots
 
Automated spelling correction to improve recall rates of name-to-structure to...
Automated spelling correction to improve recall rates of name-to-structure to...Automated spelling correction to improve recall rates of name-to-structure to...
Automated spelling correction to improve recall rates of name-to-structure to...
 

Recently uploaded

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 

Recently uploaded (20)

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 

Patent Cheminformatics: Identification of key compounds in patents

  • 1. Patent Cheminformatics Identification of key compounds in patents 3rd Strasbourg Summer School on Chemoinformatics Strasbourg, France, 25-29 June 2012 Dr. Sorel Muresan AstraZeneca R&D Mölndal
  • 2. Outlook Patents, brief into Sources for accessing full text patents Compound extraction from patents Key compound prediction
  • 3. What is a patent? A patent application is an agreement between inventor and state, allowing an inventor a monopoly over their invention for a limited time. In the EU, applicants are required to disclose their inventions in a manner sufficiently clear and complete for them to be carried out by a person skilled in the art. In the United States, inventors are additionally required to include the ‘best mode’ of making or practicing the invention.
  • 4. Patents are very interesting documents Three reasons why life scientists should read patents more frequently 1. Some information appears earlier in patents than in scientific journals 2. Patents may contain sound data that never appear in the literature. 3. Patents are a source of hard-to-get information from commercial suppliers. Seeber, F. Nat. Protocols 2007
  • 5. Patents as pharmaceutical data source Complementary between journals and patents “In certain fields, new advances are disclosed in patents long before they are published in peer-reviewed journals.” Grubb. W.P. “Novel Cannabinoid-1 Receptor Inverse Agonist for the Treatment of Obesity” modulates CNR1 Patent application Patent publication Journal publication Nov 2002 Mar 2004 Dec 2006 ~18 months 2.5 years USPTO patents (PN: US20040058820) (PMID: 17181138)
  • 6. Unique chemistry from patents Data from AstraZeneca’s Chemistry Connect ~6% of compound structures exemplified in patents were also published in journal articles Muresan, S. et al. DDT 2011 Southan, C. et al J.Cheminfo. 2009
  • 7. Anatomy of a patent Front page - contains a wealth of information about the patent Detailed description of the invention - the heart of a patent application. It generally describes one or more preferred embodiments of the invention in enough detail to enable someone of ordinary skill in the art to make or use the invention without having to resort to undue experimentation Claims - the most important part of the patent application. Define the scope of patent protection afforded to the owner of a patent. C.P. Miller, M.J. Evans, The Chemist's Companion Guide to Patent Law, 2011
  • 8.
  • 9.
  • 10.
  • 11. Searching for patents Official public portals Open full-text Structure -> Patents & Patent -> Structure
  • 12. Manually extracted SAR data (commercial) • GOSTAR (GVKBIO Online Structure Activity Relationship Database) is a comprehensive database that captures explicit relationships between the three entities of publications, compounds and sequences. • It includes 2.6 million compounds linked to 3,500 sequences with 12.5M SAR points extracted from 43,000 patents and 67,000 articles from 125 journals
  • 13. Annotate patents manually Brat – rapid annotation tool (http://brat.nlplab.org/index.html)
  • 14. Annotate patents manually Brat – rapid annotation tool (http://brat.nlplab.org/index.html)
  • 16. Chemical Named Entity Recognition (CNER) N-Ethyl-p-menthane-3-carboxamide Name-to-Structure software C(C)NC(=O)C1CC(CCC1C(C)C)C O N H WS3 cooling agent (CAS 39711-79-0)
  • 17. Compound extraction via chemical text mining Traditional text mining pipeline • Determining the start and end of IUPAClike names in free text is a tricky business. • Chemical names can contain whitespace, hyphens, commas, parenthesis, brackets, braces, apostrophes, superscripts, greek characters, digits and periods. • This is made harder still by typos, OCR errors, hyphenation, linefeeds, XML tags, line and page numbers and similar noise.
  • 18. OCR Errors: Compound Names • lH-ben zimidazole → 1H-benzimidazole • 4- (2-ADAMANTYLCARBAM0YL) -5-TERT-BUTYL- PYRAZOL-1-YL] BENZOIC ACID → 4-(2-adamantylcarbamoyl)-5-tert-butyl-pyrazol-1- yl]benzoic acid • triphenylposhine → triphenylphosphine
  • 19. Text Mining Conversion by Name Class ChemAxon 5.5 Class Category Names converts 60% n2s_1 (%) n2s_2 (%) n2s_3 (%) None (%) M Molecule 7,262,798 81.4 64.8 77.1 7.8 D Dictionary 26,876 38.1 45.1 3.5 38.5 R Registry number 304,064 0 0 0 100 C CAS number 47,815 0 0 0 100 E Element 836 92.8 58.5 78.7 3.2 NCI/CADD Chemical Identifier Resolver P Fragment 2,663,677 72.9 58.9 0 19.8 converts 48% A Atom fragment 96 90.6 36.5 0 6.3 Y Polymer 295 0 44.1 22.7 36.9 G Generic 1263 2.6 6.3 0.5 91.9 N Noise 104 32.7 24 19.2 52.9 Total 10,307,824 76.3 61.0 54.3 14.1 Muresan, S. ChemAxon UGM 2011
  • 20. Extraction compounds from US20100221398 Google Patents + Chemicalize.org (ChemAxon)
  • 21. Extraction compounds from US20100221398 Google Patents + Chemicalize.org (ChemAxon)
  • 22. What is a key compound? Drug candidate Compound(s) with optimal physicochemical properties Compouns(s) with the most suitable pharmacokinetic profile The most biologically active tool or probe
  • 23. Key compounds from patents ”Old school” techniques Look for clues in text: “most preferred” wording in claims; crystal form info; scale of synthesis Structural information alone Frequency of group (FOG) analysis of exemplified compounds Structures and SAR data Work out SAR using biological data and structures
  • 24. Exercise 2b: Predicting Key Example Compounds Key compound prediction from patents Figure 2. Graphical image of patent example compounds in chemical space. Each gray circle represents an example compound. The black circle, square, and triangle represent key compound candidates. Theory: Chemists carry out extensive SAR around key compounds. Cluster examples and look for centres of densely populated regions Kazunari Hattori; Hiroaki Wakabayashi; Kenta Tamaki; J. Chem. Inf. Model. 2008, 48, 135-142. DOI: 10.1021/ci7002686 Copyright © 2008 American Chemical Society
  • 25. Key compound prediction from patents From WO1996025405 the earliest patent which claims it, can you work out the structure of Bextra (Valdecoxib), the Pfizer NSAID? 74 exemplified cmpds
  • 28. Key compound prediction from patents Frequency of group (FOG) analysis 1) Extract compounds from patents • manual curation (GVKBIO) • text mining (SureChemOpen, OSCAR) 2) Define core • manually (e.g. from patent Markush) • automated core perception 3) R-group decomposition (ChemAxon’s JChem) 4) Rank compounds based on FOG (Spotfire, EXCEL)
  • 29. WO1996025405 - Bextra Source #compounds Bextra Bextra exists ranked GVKBIO 74 Y 1 (broad core) 1 (narrow core) SureChem 501 Y 1 (broad core) 1 (narrow core)
  • 30. EP268956 - Aciphex Source #compounds Aciphex Aciphex exists ranked GVKBIO 27 Y 2 (core1) 1 (core2) SureChem 168 Y 1 (core1) 1 (core2)
  • 32. EP296749 - Aricept Source #compounds Aricept Aricept exists ranked GVKBIO 76 Y 1 SureChem 108 Y 1
  • 33. EP296749 - Arimidex Source #compounds Arimidex Arimidex exists ranked GVKBIO 70 Y 1 SureChem 267 Y 1
  • 34. WO1989006649 - Raxar X is CH, CF, CCI, or N. Source #compounds Raxar Raxar exists ranked GVKBIO 102 Y 5 SureChem 0 N n/a
  • 35. US3912743 – PAXIL Source #compounds Paxil Paxil exists ranked GVKBIO 60 Y 54
  • 37. Key compound prediction 48 drug patents, with more than 10 compounds, including the drug Method Key compound Key compound as the first within the first 5 ranked ranked compound compounds Cluster seed 11 (23%) 26 (54%) analysis Molecular Idol 5 (10%) 22 (46%) Frequency of 11 (23%) 17 (35%) group analysis
  • 39. Summary • Unique chemistry from patents (8% out of 55M parent structures in AstraZeneca’s Chemistry Connect) • Public sources for full text patent access and open source software for chemical text mining • Frequency of group analysis is a simple process to visualize and rank patent compounds
  • 40. Acknowledgements • Jon Winter • Christian Tyrchan • Jonas Boström • Mats Eriksson • Paul Xie • Chris Southan • GVKBIO • IBM • SureChem • Next Move Software • ChemAxon • Unilever Centre for Molecular Science Informatics