Mining Drug Targets, Structures and Activity Data


Published on

Presenation at BioIT 2012

Published in: Technology
1 Comment
1 Like
  • Well done! Very good overview on how to get started. Thanks!
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Mining Drug Targets, Structures and Activity Data

  1. 1. Mining Drug Targets, Structures and Activity Data Using Open Full-Text Patent Sources and Web Tools Christopher Southan ChrisDS Consulting, Göteborg, Sweden, Prepared for BioIT, Boston, April 2012, Track 11, Open Source Solutions, Wednesday, 13:45 [1]
  2. 2. Introduction [2]
  4. 4. The Good News: Patent Mining Utility• Novel bioactive chemical structures related to drug discovery exceeding those in journals by at least five-fold.• Encompass academic, as well as commercial, global med. chem. output.• Targets, assays, mechanisms of action, disease descriptions and in-vivo data.• ~ 70% of data initially patent-only, some never disclosed elswhere.• Include synthetic descriptions and other useful enabling information.• Precede journal or meeting reports by ~ 1.5 to 5 years.• Can be complementary to papers (e.g. larger SAR matrix).• Intersect with papers at chemistry, target, disease, author and citation levels• IP exploitable for Neglected Tropical Disease research becoming ”open”. [4]
  5. 5. The Bad News: Patent Mining Can be Tough • High-specificity retrieval of relevant documents difficult • Massive chaff-to-wheat ratio in 100s of pages • Differences in layout, house style and data location • Markush permutation • Variability in IUPAC strings and image rendering • Use of non-standard gene/protein names • Obfuscation via; – Qualitative or binned assay results – Structure-to-data links non-obvious, patchy or absent – Less than 50% of titles include target names – The ”hiding the lead and core structures” game – Blunderbuss disease and use exemplifications – Tense ambiguity (i.e. ”could be” vs. ”was” done) • Quality judgments dificult • Patents cite papers and patents but few papers cite patents • Document redundancy of Kind codes, patent families and equivalents • Finding drug candidate first-filings is difficult • The PDF hamburger problem and OCR noise [5]
  6. 6. Reasons for Rolling-your-own Patent Chemistry and Data Extraction• Limited budget• You are likely to be a tacit super-curator by profession• Best-of-both-worlds synergy with licensed sources (e.g. digging deeper)• Combine automated outputs with manual triage• Develop a technical understanding and comparison of vendor offerings• Commercial dbs cap the number of manually-extracted examples• Need SAR analogues for a few targets rather than many (e.g. mechanistic enzymology or systems chemical biology)• Only require data sampling across specific disease areas• Not overly concerned about false-negatives (i.e. don’t need comprehensive prior-art check or scoping of claims)• Open tools operate on any text or web source, not just patents• You may already have commercial text mining capability• Flexibility of intersecting patent with literature chemistry (e.g. ChEMBL, journals you subscribe to, PubMed and PMC)• You can slice-and-dice PubChem patent chemistry in ways complementary to commercial databases [6]
  7. 7. Open Sources and Tools Overview• Searching metadata, abstracts and text – Official public portals: EPO/Espacenet, USPTO, WIPO, EBI CiteExplore – Open full-text: FreePatentsOnline, Google patents, Google Scholar, et al.• Metadata, full-text and chemical structure search - SureChemOpen• Bulk name-to-structure conversion - ChemAxon Chemicalize• Individulal name-to-structure - OPSIN• Conversion of images to structures - OSRA• Sketcher inputs – many options• Corroborative search in SureChemOpen, PubChem, ChemSpider, Chemicalize• EPO patent number searching in PubChem• for cutting pages and for sections or tables• Utopia bioentity mark-up (those below not included in this presentation but relevant)• NCI/CADD Chemical Identifier Resolver and Online SMILES Translator• Open cheminformatics tools – CDK, ChemViz, Taverna, OpenBabel etc.• OSCAR/PatentEye, Murray-Rust group, Laconde et al, SCRIPDB, Juristica group (n.b. Google should give urls for all these source and tool names) [7]
  8. 8. So What’s in PubChem ? [8]
  9. 9. PubChem Patent-derived Content ~6 million• ~ 2.8 million Discovery Gate/Thomson Pharma intersect mainly Derwent WPI pharmaceutical patents plus some journal extractions• ~ 5.1 million (allpat above) is the union of Thomson/Derwent plus IBM• ~ 3.5 million of these are Lipinski-ROF compliant• ~ 40% journal-extracted structures in ChEMBL have a match in the 5.1 million• ~ 70% of these are Lipinski-ROF compliant• ~ 90% of these have assay data• ~ 60% of the IBM structures (1.5 million) are novel as defined by unique CIDs• ~ 2.3 million SureChem pre-2007 structures also in there (but not selectable) [9]
  10. 10. Chemistry > Patents in PubChem [10]
  11. 11. You found a CID, what are the Patent and Journal links?PubChem > BioAssay/ChEMBL > CiteExplore/PubMed/ > PDB [11]
  12. 12. Patent Links from SLING and IBM [12]
  13. 13. PubChem > SureChem > Patent > Stucture > Data > Target [13]
  14. 14. Target-Centric Patent Searching [14]
  15. 15. Synonym Recall• Title only BACE1 = 8• Title + abstract BACE1 = 97• Title + abstract BACE2 = 29• Title + abstract BACE = 392• Title + abstract ”Beta secretase” = 1056• Title + abstract memapsin = 87• Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR Memapsin = 1383• Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR Memapsin AND inhibitors = 841• Same query to PubMed (this interface) = 1031 [15]
  16. 16. Target Query > Patent Retrieval from Espacenet [16]
  17. 17. Linking Examples to Data in the Patent [17]
  18. 18. Extracting Chemical Structrures [18]
  19. 19. IUPAC-to-structure: OPSIN Instalable application Also chemical dictionary conversionsResult; Example 31 structure is 24 nM BACE1 inhibitor [19]
  20. 20. Image-to-strucuture: OSRA• Patchy results but fixable by editing and similarity iteration in PubChem• Also an installable application• Useful to cross-check between images and IUPACs [20]
  21. 21. Follow-up Searching [21]
  22. 22. Structure Search in PubChem SMILES (via OPSIN, OSRA, SureChem, Chemicalize or sketcher)Often see stero differences to the Derwent entry in PubChem [22]
  23. 23. PubChem Similarity ”Walking”• 2D and 3D different results• Can do multiple steps• Can ”read” CID history• Possible to ”walk” between patents• Look for links to ChEMBL, BioAssay, PubMed, chemical suppliers etc. [23]
  24. 24. Direct Patent <> Chemistry [24]
  25. 25. SureChemOpen: Patent Retrieval• Patent searching, chemistry-to-patent and patent-to-chemistry in one portal• Higher rate of name-to-structure conversion than Chemicalize or OPSIN (but not bulk export) [25]
  26. 26. SurChemOpen, WIPO, OPSIN and PubChem Result 1nm (?) BACE2 inhibitor with assay and synthesis details. [26]
  27. 27. SureChemOpen: Structure > PatentDirect answers to: ”which patents contain compounds simiar to my query”and ”show me all the compounds in these patents” [27]
  28. 28. Non-target Activity Data and Bulk Chemistry Extraction [28]
  29. 29. Malaria Query: CiteExplore > WIPO Example 60, sub-200nM potency, with solubilty and clearance data [29]
  30. 30. Espacenet EP2391601 > ChemAxon • Description URL from Espacenet pasted into • Most of 74 examples converted • Example 60 had 4 analgues in PubChem at 95% Tamimoto (e.g. CID 46852300) but no exact match • Claims section was Markush description so no relevant structures converted [30]
  31. 31. EP2391601 > Chemicalize > PubChem Chemicalize Similarity listing PubChem Tanimoto sub-cluster• EP2391601 description text > Chemicalize SDF download > PubChem Structure Search upload = 311 structures• Of these 206 have PubChem exact matches• Of these 176 have Thomson Pharma matches• The example cluster (Thomson/Derwent extraction) cluster is ~15• The example cluster from Chemicalize is ~ 90• Ipso facto Chemicalize extracted at least 70 novel structures• But only 10 examples were in the highest-potency bin [31]
  32. 32. Tips and Tricks [32]
  33. 33. Tables and Recalcitrant IUPACs PDF Find tables Snip image Online OCR Word Pad Chemicalize OPSIN OSRA • iterative fixing of OCR errors (e.g. 1 vs l) • cross-check Mw in the document [33]
  34. 34. Utopia Mark-up of Patent IntroductionBioentity mark-up (green) via EMBL Reflect with rich call-out options [34]
  35. 35. Tips for Joining Everything up• SureChemOpen is continuing to back-fill and add features.• Check the Chemicalize archive (~ 0.5 million) for unique content.• Between Chemicalize, OSRA, OPSIN and sketching you can extract most things (e.g. journal or meeting abstracts, PubMed Central full-text, catalogues, wiki pages, blog posts and MeSH IUPACs).• Check PubChem ”same connectivity” for tautomer forms in different CIDs.• Check PubChem ”similar” compounds for analogues even if you cannot track back to a patent number.• Most PDB ligands published by companies have a patent analogue series.• Espacenet text chemicalizes well but FreePantentsOnline can be better.• Google Scholar tracks patent citations.• Full-text is good but don’t forget to eyeball the original PDF• You can ”walk” between patents by 2D/3D clusters, inventors or citations.• Less-common author/inventor names may track a journal paper back to a patent.• CiteExplore includes selectable ChEMBL structure links.• Check ChEMBL structures for SureChem links via ChemSpider.• On a good day you can paste OCR table data into Excel.• You can set SciBitely patent keyword alerts and see posts on Twitter. [35]
  36. 36. Conclusions• Roll-your-own patent mining can take you a long way.• Complementary to commerical databases.• Target-centric recall and specificity is reasonable.• Published patents are indexed and open text-extracted within weeks.• You need perspicacity to dig out SAR details.• Can cherry pick examples by potency or collate whole series• Establishing intersects between journal articles and patents is valuable.• Exemplified structures typically cover a broader range of analogue space and SAR data than papers.• You can ”walk” between patents via citation and chemistry clustering.• PubChem already contains over 6 million patent-derived structures with more depositions and links expected.• The increased public surfacing of chemical structres and bioactivity data from patents will expedite medicinal chemistry, tropical disease research and chemical biology. [36]
  37. 37. Questions WelcomeChrisDS Consulting: +46(0)702-530710Skype: cdsouthanEmail: cdsouthan – at - hotmail.comTwitter:!/cdsouthanBlog: (includes postings on patent themes)LinkedIN: [37]