Exploring SAR between Patents and PubChem

767 views

Published on

2012 ChemAxon users meeting

Published in: Health & Medicine, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
767
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Exploring SAR between Patents and PubChem

  1. 1. Using Chemicalize.org with Other OpenResources to Extract SAR from Patents and Explore Intersects in PubChem Christopher Southan ChrisDS Consulting, Göteborg, Sweden, Prepared for the ChemAxon UGM, May 2012, version 2nd May [1]
  2. 2. Key Relationships in Patents and Papers MAQALPWLLLWMGAGVLPAHGTQHGIRLPLRSGLGGA PLGLRLPRETDEEPEEPGRRGSFVEMVDNLRGKSGQGY YVEMTVGSPPQTLNILVDTGSSNFAVGAAPHPFLHRYYQ RQLSSTYRDLRKGVYVPYTQGKWEGELGTDLVSIPHGP NVTVRANIAAITESDKFFINGSNWEGILGLAYAEIARPDD SLEPFFDSLVKQTHVPNLFSLQLCGAGFPLNQSEVLASV GGSMIIGGIDHSLYTGSLWYTPIRREWYYEVIIVRVEINGQ DLKMDCKEYNYDKSIVDSGTTNLRLPKKVFEAAVKSIKA ASSTEKFPDGFWLGEQLVCWQAGTTPWNIFPVISLYLM GEVTNQSFRITILPQQYLRPVEDVATSQDDCYKFAISQSS TGTVMGAVIMEGFYVVFDRARKRIGFAVSACHVHDEFRT AAVEGPFVTLDMEDCGYNIPQTDESTLMTIAYVMAAICAL FMLPLCLMVCQWRCLRCLRQQHDDFADDISLLK Document Assay Result Compound Target Discerning and mapping these 2011 http://www.ncbi.nlm.nih.gov/pubmed/21569515 relatioshionships from documents is crucial and demanding Chemicalize.org is a 2010 http://www.citeulike.org/user/cdsouthan/article/8637426 significant advance in open chemistry extraction2012 http://www.slideshare.net/cdsouthan/southan-bio-it2012patents [2]
  3. 3. Practical Utilities• Name-to-struc (n>s) for selected or batch conversions from patents, papers, abstracts, web pages and other sources• Intersect different content at identity or similarity level• Molecular properties and bulk download• Extracted structures archived, searchable and sharable• Similarity display of analogue series from a document• Bulk upload to PubChem for intersects and triage• Result display in JChem for Excel• Can iterate with OPSIN for IUPAC fixes [3]
  4. 4. Chemicalize.org Exploitation Challenges• Specific retrieval of patent or other source (e.g. target recall)• Working different sources (e.g. CiteXplore/espace/Scibite for retrieval, Google for cross-checks, WIPO for images and tables, Freepatentsonline for deeper queries)• Eyeballing original documents for relevant sections• Locating exemplified drug-relevant/lead-like structures with data links• For many patents examples >> activity data links > potent structures• Selecting best sources/family members for optimal IUPAC extraction quality (e.g. US pats and FPO)• Filtering novel structures from common chemistry• Need to be PubChem cogniscant for effective triage• For a variety of reasons some documents have low extraction rates• Tricks and work-rounds enhance exploitation [4]
  5. 5. Target Recall: CiteExplore• Title only ”DPPIV” Medline = 37 Patents = 31• Title + abstract ”DPPIV” Medline = 402 Patents = 144• Title + abstract ”dipeptidyl peptidase” Medline = 4,838 Patents = 1,520• Title + abstract ” inhibitor” Medline = 772,053 Patents = 124,516• Title + abstract ” diabetes” Medline = 431,299 Patents = 36,792• Title + abstract ”DPPIV OR dipeptidyl peptidase AND inhibitor AND diabetes” Medline = 1,105 Patents = 604CiteXplore is restricted to EBI patent abstracts so you can get higher recall atfull-text sources such as SureChemOpen, EPO/espace, WIPO and FPO(but not search Medline in parallel) [5]
  6. 6. Target Alerts: SciBite US2012040982 DPPIV Boeringer Ingelheim Feb 2012 [6]
  7. 7. Slicing and Dicing US2012040982 (I)• Chemicalize converted 1,390 structures from the FreePatentsOnline (FPO) URL• From the 497 examples 486 converted• Need to scan the document and iterate with scroll bar to spot lead-like structures [7]
  8. 8. Slicing and Dicing US2012040982 (II)• OPSIN picks up some of what chemicalize misses (e.g. 389 above) but not all• OPSIN error reports may help fix a series for Chemicalize (e.g 1 vs. L)• Practically more important if that example has potent activity [8]
  9. 9. Slicing and Dicing US2012040982 (III)• Similarity display clearly picks out the lead-like analog series (top)• Select via FPO text > example list only, > Word > PDF > chemicalize upload > SDF download 486 structures (bottom)• However, from the partial descriptions these may include prophetics• Also download 28 claimed examples via PDF [9]
  10. 10. Slicing and Dicing US2012040982 (IV)• Can locate an SAR table with 11 point IC50s• But.... only 9 examples below 100 nM, example 25 is 56 nM• The designation of series 1 and 2 obfuscates their example identity [10]
  11. 11. PubChem Triage of Chemicalize Output (I)• Example 25 SMILES > neither an exact match nor tautomer – thus novel• Repeat search at 95% Tanimoto > 289 neighbors > cluster• Closest PubChem analog > ChemSpider > SureChem > Novo Nordisk DPPIV patent from 2005 [11]
  12. 12. PubChem Triage of Chemicalize Output (II)• Total extraction from US2012040982 > 1,390 SDFs > 1387 uploaded > 7 “failed”• 493 exact matches (= preexisting PubChem CIDs)• 486 example-only SDFs > upload > 21 exact-match CIDs• 34 claims-only give 9 exact-match CIDs, primary sources were:• 5 from ChEMBL from a Boeringer Ingleheim 2007 Publication• 7 from Thomson Pharma• 2 from ChemSpider with SureChem links to Boeringer Ingleheim patents• Thus 461 examples chemicalized from US2012040982 are “novel” structures• However, cannot check enatiomeric or tautomeric inexact matches from PubChem interface (only for existing CIDs) [12]
  13. 13. PubChem Triage of Chemicalize Output (III)• Chemicalize examples-plus-claims US201204098 = 29 CIDs (search 36 above)• Thomson Pharma/Discovery gate intersect is ~ Derwent WPI (search 31)• This matched 20 from the 29 (search 36), presumably DWPI extractions• ChEMBL (7) matched 6 from 29 (i.e. extracted from papers)• SLING matched 8 from 29 (i.e. extracted from EPO patents)• It was thus possible to intersect the chemicalize extractions from this patent with four independent primary sources in PubChem from patents and publications [13]
  14. 14. Patent ”Walking” from Chemicalize similarity results (I)• The similarity results from one example gave 1734 matches out to Tanimoto 0.5, extending ”beyond” the example space of US2012040982.• Scrolling these shows at Tanimoto 0.6, with shared substructures in blue, connect to a different older patent US7772226, also for DPPIV, from Eisai [14]
  15. 15. Patent ”Walking” from Chemicalize similarity results (II)• US7772226 from FPO converted 1127 (i.e. more than the 992 from PatBase)• 680 matched PubChem CIDs• Example 228 CC#CCN1C(=NC2=C1C(=O)NC(OC1=CC=C(C=C1)C(=O)OC(=O)C(F)(F)F)=N2)N1CCNCC1 had a 12 nM IC50 for DPPIV• Can even ”walk” to a third DPPIV patent WO2007071738 from Novartis [15]
  16. 16. Extracting from CiteXplore ChEMBL • CiteXplore lists ChEMBL IUPACs and IDs • Can chemicalize all ChEMBL structures from from one paper • Difficult to ID these in ChEMBL • Upload 8 structures to PubChem • 7 match ChEMBL IDs • Only one matches the 29 from US2012040982 • Thus paper probably from mutiple patents [16]
  17. 17. Mining PubMed Central Full-text Papers (I)• Only a few examples converted direct• So > wordpad > direct chemicalize (iterate) > web page (Google sites)• Download > Upload to JChem for Excel• Add in IC50 values from paper [17]
  18. 18. Mining PubMed Central Full-text Papers (II) • Add the SAR data from the paper into the structure table • These had no exact matches in PubChem [18]
  19. 19. Chemicalizing the DrugBank Entry for DPPIV 41 conversions of inhbitors, many are PDB ligands [19]
  20. 20. Can Even Extract Catalogues that have no SMILES or InChIs.... Tocris DPPIV inhibitor > chemicalize > PubChem > 6 analogs [20]
  21. 21. Conclusions• Chemicalize.org is powerful, flexible and free, as in beer....• Significantly enables small-scale roll-your-own patent mining• Ditto for journal article/abstract mining (e.g. for papers not captured in ChEMBL)• You still need perspicacity to discern SAR details• Complementary to commercial patent databases populated by manual extraction (e.g. you can extract more structures)• Commercial automated patent extraction databases typically combine ChemAxon n>s with other algorithms (e.g. http://www.chemaxon.com/library/benchmarking- chemaxon%E2%80%99s-name-to-structure-batch-tool-on-patent-text/)• While they thus out-perform chemicalize, it is still very useful for intersecting journal articles or other sources against any databases• Significant novel content (w.r.t. public databases) is accumulating via ”default crowdsourcing” in the chemicalize archive which becomes an important cross- check source and can be ”walked” between documents• Combined with OPSIN and OSRA structures from most sources are extractable• Synergies with sources such as PubChem, PubMed Central, ChEMBL and SureChemOpen will advance academic drug discovery and chemical biology [21]
  22. 22. Questions WelcomeChrisDS Consulting: http://www.cdsouthan.info/Consult/CDS_cons.htmMobile: +46(0)702-530710Skype: cdsouthanEmail: cdsouthan – at - hotmail.comTwitter: http://twitter.com/#!/cdsouthanBlog: http://cdsouthan.blogspot.com/ (includes postings on patent themes)LinkedIN: http://www.linkedin.com/in/cdsouthanWebsite: http://www.cdsouthan.info/CDS_prof.htmPublications: http://www.citeulike.org/user/cdsouthan/publications/order/yearCitations: http://scholar.google.com/citations?user=y1DsHJ8AAAAJ&hl=enPresentations: http://www.slideshare.net/cdsouthan [22]

×