Integrating patent chemistry with
public and private non-patent
research resources	
  

Nicko Goncharoff           ACS Fall 2012
Andrew Hinton, PhD         19 August
Christopher Southan, PhD
SureChem Data Collection!

Database of automatically mined structure data
from text and images!
!
• 20M annotated US, EP, WO full text records
and Japan patent abstracts!
                             I!
• 12M unique chemical structures!
• MEDLINE – 19M abstracts (coming Q4)!
ª  Free resource for researchers!         ª  Professional search needs!
ª  Enables linking to public and          ª  Data export, alerts, patent family
    proprietary content                        search, chemical relevance filters…!




                           ª  API or Data Feed access to
                               chemistry & full text!
                           ª  Integrate with internal
                               databases & workflows
Chemistry Mining Workflow!
Public Patent Chemistry Landscape!
Current Patent Sources In PubChem!

                   4000000                                           3.7 M

                   3500000

                   3000000
Numbers of SID's




                                                            2.3 M
                   2500000

                   2000000

                   1500000

                   1000000

                    500000                   280 K
                               10 K
                         0
                             EPO(Sling)   Chemicalize.org    IBM     Thomson
                                                                    Thompson
                                                                     Pharma
Patent & Literature Sources in
                    PubChem !
                                                      The	
  Big	
  Three	
  
 Thomson Pharma,!                                                                                            ChEMBL + !
patents and literature !                                                                                 PubMed + Journals!
     3,756,283!                                                                                               918,077!
   41% lead-like!                                                                                           45% lead-like!
                                   3,291,940	
   281,920	
                        515,745	
  

                                                           52,975	
  

                                             129,448	
                   67,437	
  


                                                           2,113,169	
  




                  IBM,	
  	
  pre-­‐2000	
  patents	
  	
  	
  2,369,481	
  	
  	
  	
  32%	
  lead-­‐like	
  	
  
SureChem to Deposit All Structures*
      into PubChem - 2012!




• 1976 to present
• Deposition of structures only
• View related patents in SureChemOpen
• *Some filtering of common chemistry likely
SureChem and IBM in PubChem 

             (2 Example Patents)!
SureChem Total: 776! IBM Total : 527!
                                          US583593, Inhibitors of squalene
                                               synthetase and protein
                                            farnesyltransferase. Abbott !


   478	
       298	
     229	
          SureChem Total: 832 ! IBM Total: 239!




                                               686	
     146	
      93	
  
         WO-1994018188-A1 !
 4-hydroxy-benzopyran-2-ones and 4-
  hydroxy-cycloalkyl[b]pyran-2-ones
    HIV protease inhibitors, Upjohn!
Identifying Relevant Chemistry - IC50!
    US-20120035195-A1 BACE2, Hoffman LaRoche
Structures with IC50 Values!
         US-20120035195-A1




PDF       SureChemOpen       Excel
Search IC50 Structures in PubChem!

              search
SureChem Unique Contribution!


                SureChem
                                               Pubchem
                    79              96      (ThomsonPharma ,
                                               Chemicalize)




 Stage!                             No. of Structures!
 Available from SureChem (SC)!      1848!
 Pre-Exist in PubChem!              669!
 Pre-Exist – not from IC50 table!   573!
 Pre-Exist – from IC50 table!       96 (12 from TP + 84 via chemicalize.org)!
 Unique-SC with IC50!               79!

 Unique-SC – beyond IC50 table!     1100!
Identifying Relevant Chemistry!


                                 Patent 

                                 US-20120035195-A1!




http://opentox.informatik.uni-
   freiburg.de/ches-mapper/!
SureChem Chemical Relevance Filtering!
•  Frequency	
  counts	
  of	
  chemicals	
  within	
  patents	
  
•  AddiHonal	
  molecular	
  property	
  filtering	
  i.e.	
  Lipinski	
  descriptors	
  
 !
•  Natural	
  Language	
  Processing	
  –	
  based	
  indexing	
  of	
  Exemplified	
  Compounds	
  
 !
 !               Automated indexing of Exemplified Compounds in text!
Conclusion!
SureChem deposition into PubChem will

  –  Significantly expand public patent chemistry scope
  –  Contribute unique and timely MedChem-relevant data
  –  Enable open drug discovery and chemical biology
  –  Advance progress toward a more open, federated
     chemical information network

SureChem - Integrating with public and proprietary data sources (ACS Fall 2012)

  • 1.
    Integrating patent chemistrywith public and private non-patent research resources   Nicko Goncharoff ACS Fall 2012 Andrew Hinton, PhD 19 August Christopher Southan, PhD
  • 4.
    SureChem Data Collection! Databaseof automatically mined structure data from text and images! ! • 20M annotated US, EP, WO full text records and Japan patent abstracts! I! • 12M unique chemical structures! • MEDLINE – 19M abstracts (coming Q4)!
  • 5.
    ª  Free resourcefor researchers! ª  Professional search needs! ª  Enables linking to public and ª  Data export, alerts, patent family proprietary content search, chemical relevance filters…! ª  API or Data Feed access to chemistry & full text! ª  Integrate with internal databases & workflows
  • 6.
  • 7.
  • 8.
    Current Patent SourcesIn PubChem! 4000000 3.7 M 3500000 3000000 Numbers of SID's 2.3 M 2500000 2000000 1500000 1000000 500000 280 K 10 K 0 EPO(Sling) Chemicalize.org IBM Thomson Thompson Pharma
  • 9.
    Patent & LiteratureSources in PubChem ! The  Big  Three   Thomson Pharma,! ChEMBL + ! patents and literature ! PubMed + Journals! 3,756,283! 918,077! 41% lead-like! 45% lead-like! 3,291,940   281,920   515,745   52,975   129,448   67,437   2,113,169   IBM,    pre-­‐2000  patents      2,369,481        32%  lead-­‐like    
  • 10.
    SureChem to DepositAll Structures* into PubChem - 2012! • 1976 to present • Deposition of structures only • View related patents in SureChemOpen • *Some filtering of common chemistry likely
  • 11.
    SureChem and IBMin PubChem 
 (2 Example Patents)! SureChem Total: 776! IBM Total : 527! US583593, Inhibitors of squalene synthetase and protein farnesyltransferase. Abbott ! 478   298   229   SureChem Total: 832 ! IBM Total: 239! 686   146   93   WO-1994018188-A1 ! 4-hydroxy-benzopyran-2-ones and 4- hydroxy-cycloalkyl[b]pyran-2-ones HIV protease inhibitors, Upjohn!
  • 12.
    Identifying Relevant Chemistry- IC50! US-20120035195-A1 BACE2, Hoffman LaRoche
  • 13.
    Structures with IC50Values! US-20120035195-A1 PDF SureChemOpen Excel
  • 14.
    Search IC50 Structuresin PubChem! search
  • 15.
    SureChem Unique Contribution! SureChem Pubchem 79 96 (ThomsonPharma , Chemicalize) Stage! No. of Structures! Available from SureChem (SC)! 1848! Pre-Exist in PubChem! 669! Pre-Exist – not from IC50 table! 573! Pre-Exist – from IC50 table! 96 (12 from TP + 84 via chemicalize.org)! Unique-SC with IC50! 79! Unique-SC – beyond IC50 table! 1100!
  • 16.
    Identifying Relevant Chemistry! Patent 
 US-20120035195-A1! http://opentox.informatik.uni- freiburg.de/ches-mapper/!
  • 17.
    SureChem Chemical RelevanceFiltering! •  Frequency  counts  of  chemicals  within  patents   •  AddiHonal  molecular  property  filtering  i.e.  Lipinski  descriptors   ! •  Natural  Language  Processing  –  based  indexing  of  Exemplified  Compounds   ! ! Automated indexing of Exemplified Compounds in text!
  • 18.
    Conclusion! SureChem deposition intoPubChem will –  Significantly expand public patent chemistry scope –  Contribute unique and timely MedChem-relevant data –  Enable open drug discovery and chemical biology –  Advance progress toward a more open, federated chemical information network