Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open resources


Published on

Christopher Southan (The IUPHAR/BPS Guide to PHARMACOLOGY, UK)
While the raison d'être of patents is Intellectual Property (IP) there is a growing awareness of the scientific value of their data content. This is particularly so in medicinal chemistry and associated bioactivity domains where disclosed compounds and associated data not only exceeds that published in papers by several-fold and surfaces years earlier, but is also, paradoxically; completely open (i.e. no paywalls). Scientists have traditionally extracted their own relationships or used commercial sources but the last few years have seen a “big bang” in patent extractions submitted to open databases, including nearly 20 million structures now in PubChem.

This tutorial will:

Outline the statistics of patent chemistry in various open sources
Introduce a spectrum of open resources and tools
Enable an understanding of target identification, bioactivity and SAR extraction from patents and connecting these relationships to papers
Cover aspects of medicinal chemistry patent mining
Include hands on exercises using open source antimalarial research as examples

The focus will be on public databases and patent office portals, since these can be transparently demonstrated. However, the essential complementarity with commercial resources will be touched on. Those engaged in Competitive Intelligence will also find the material relevant.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open resources

  1. 1. Digging bioactive chemistry out of patents using open resources • While the raison d'être of patents is Intellectual Property (IP) there is a growing awareness of the scientific value of their data content. This is particularly so in medicinal chemistry and associated bioactivity domains where disclosed compounds and associated data not only exceeds that published in papers by several-fold and surfaces years earlier, but is also, paradoxically; completely open (i.e. no paywalls). Scientists have traditionally extracted their own relationships or used commercial sources but the last few years have seen a “big bang” in patent extractions submitted to open databases, including over 20 million structures now in PubChem. 1
  2. 2. Outline • Statistics of patent chemistry in various sources • Open resources, databases and tools • Target identification • Bioactivity and SAR extraction • Connecting these relationships to papers • Medicinal chemistry patent mining • Exercises using antimalarial research as examples • Complementarity with commercial resources. • Competitive Intelligence N.b. not in scope just now, web services, APIs, RDF or SAR modelling per se This is a suggested list that can be extended to related topics attendees would like to cover (at least if within the cognisance of the presenter!) 2
  3. 3. 3 Preamble
  4. 4. Biog 4 Chris Southan joined the IUPHAR/BPS Guide to Pharmacology database curation team as Senior Cheminformatican in 2013. Previously he was a Drug Discovery Consultant at TW2Informatics in Göteborg Sweden, working on patent informatics. Prior to this he was a contractor for AstraZeneca Knowledge Engineering, 2009-2011 working on Chemistry Connect and Pharma Connect. Earlier positions include the ELIXIR Database Provider Survey for the EBI (2008-9), Principle Scientist and Bioinformatics Team Leader at AstraZeneca (2004-7) and senior bioinformatics positions in Oxford Glycosciences (2002-3) Gemini Genomics (2001) and SmithKline Beecham (1987–2000). He has a PhD from the University of Munich, M.Sc. in Virology from Reading University and a B.Sc.Hons. in Biochemistry from Dundee University. Further information on LinkedIN IUPHAR/BPS Guide to PHARMACOLOGY Publications: PubMed ORCHID ID 0000-0001-9580-0446 Blog: Bio < > Chem Presentations: Slideshare Twitter: TW2Informatics:
  5. 5. Audience assumptions • Some familiarity with SAR distillation from the literature • Many of you could extract examples from a patent by hand • Database cognisance, including PubMed and PubChem (SID, CID) • More interest in recent than historical SAR • Not obsessively concerned with false-negatives (i.e. missed data) • Not greatly perturbed by the fuzziness of public sources (that you might grumble about for commercial ones) • Familiar with the mess of patent families and Kind codes • Familiar with protein names and identifiers • Familiar with obfuscation that can confound SAR extraction • Focused on Med Chem for human diseases • Most of this tutorial could apply across other domains (e.g. IPC code A01N for pesticides and herbicides) • No boundaries between Drug Discovery and Chemical Biology • Aware academic Drug Discovery is accelerating relative to commercial 5
  6. 6. References (I) 6 Chapter in: Samuel Chackalamannil, Rotella and Ward, (eds.) Comprehensive Medicinal Chemistry III vol. 3, pp. 464–487. Oxford: Elsevier., ISBN: 9780128032008
  7. 7. References (II) 7
  8. 8. Core assumption: can we believe patent SAR results? • We know the data has value but difficult to extrinsically asses quality • As for other domains, Med Chem has an experimental reproducibility crisis • This reflects equivocality w.r.t. antibodies, cell lines and chemistry (e.g. supplier purity and probes vs PAINS) • For patents high-replicate error ranges are rarely included • Re-synthesis fidelity also rarely reported (ever?) • Cf. “Dispensing processes impact apparent biological activity as determined by computational and statistical analyses” (PMID 23658723) • We could hope that internal relative SAR across a series is more consistent than externally comparative absolute numbers • We know some inventor teams are world-class, well cited medicinal chemists but can we assess the less famous? • The same QC considerations apply to papers • ChEMBL surfaces the worryingly wide IC50/Ki/Kd ranges on nominally same assays from different papers • We can also intersect some patent and paper values • Is the internal consistency of patent-derived SAR models a useful QC? 8
  9. 9. Introductory example 9 • 138 detailed descriptions of the series • WO2013083991 SureChEMBL- PubChem • IC50 cross-reactivity data from no less than five cell-based enzyme assays • Human NMT1 (P30419), human NMT2 (O60551) Plasmodium vivax (A5K1A2) Plasmodium falciparum (Q8ILW6) and Leishmania donovani (Q8ILW6) • myristoyltransferase-patent-and-pdb.html
  10. 10. 10 Stats
  11. 11. So how much useful SAR is in the patent corpus? • Definition for SAR: Bioactivity assay "A" (e.g. for an enzyme) with a quantitative result "R" (e.g. an IC50) for a compound "C" (defined chemical structure) as an activity modulator (e.g. inhibition) of protein target "P“ (also for cellular targets e.g. antinfectives) • A useful shorthand for this mapping is “D-A-R-C-P” • Excelra (ex GVKBIO) provides good statistical starting point • bioactive-chemistry-from-patents-and-papers • April 2017 numbers were 1.34 mill cpds from 112K papers and 3.35 mill from 71K patents, 0.18 million overlap • From the earlier PMID 24204758, 12 cpds/paper and 46/patent • Human protein targets 3383 in former 2431 in latter, 3882 combined with 546 patent-only • The Excelra absolute activity numbers dependent on their capping rules for binned data (e.g. IC50 between 10 and 100nM) • Binned data still useful for modelling • Where are the enzyme activators? 11
  12. 12. Independent estimates of SAR total • WIPO PATENTSCOPE A61 and C07 PCTs = 93,253 • Not all have SAR data from novel composition of matter first-filings • Many will be “secondary” filings (e.g. synthesis and/or crystallisation) • Generic companies file many of these for de-risked cpds • Some first-filings for a chemotype series may not have any activity data disclosed (stats unknown) • We can thus assume that extractable SAR from med chem patents in the last five years may be only 30- 50,000 documents • Guestimate: ~50K patents ~ 3.50 million bioactive structures (c.f. Excelra 3.35 million) • Asian patents under-represented? (i.e. are we missing unique structures & SAR) 12
  13. 13. BindingDB public SAR curation: useful benchmark extraction stats • Patents: 1,879 • Binding measurements: 199,588 • Compounds: 132,170 • Target proteins: 1,225 • Assays: 2,668 • Average Number of Targets per Patent: 1.95 • Usually primary plus a specificity paralogue cross-screen • ~70 compounds/patent • ~100 affinity measurements/patent Data courtesy of Tiqing Liu and Michael Gilson, Oct 2017 13
  14. 14. Patent chemistry stats inside PubChem 14
  15. 15. The three major CNER sources inside PubChem 15 IBM = 10.7 SCRIPDB = 4.0 SureChEMBL = 17.6 2.9 2.4 4.7 10.1 0.6 0.4 0.50 Counts (Oct 2017) are CIDs in millions Union = 21.7 3-way = 2.4 3 + 2-way = 8.1 Unique= 13.5 Raises questions about corroboration vs divergence
  16. 16. The chemistry stats: wheat vs chaff • If we except a certain proportion of binned data as useful, the max SAR we could expect to align is ~ 3 to 4 mill strutures • But how can we select these from the 22 million (and climbing) in PubChem? • The easiest way is to come in from the literature with clean structures • This can expand the SAR around a target anywhere from 2 to 10-fold • But have an unknown statistic; what proportion of good patent SAR sets, including for novel targets, never get into a paper? (examples anyone?) • Exelra have some relevant stats on this – does anyone else? 16
  17. 17. 17 Sources
  18. 18. Sources offer a broad spectrum of utilities • Connecting to patents via structures from papers • Connecting via targets and/or diseases from papers • Proximity “Walking” doc <>doc, target <> target, struc <> struc • Finding patents via metadata (e.g. assignee, target and date) • Viewing chemistry content in the document • Establishing if the document has useful SAR • Finding which sources have extracted chemistry • Mapping the structures to the activity values • DIY extraction of structures not yet in a source (e.g. images and/or IUPAC strings • Collating an SAR table • Best to get familiar with in-depth functionality of a few sources • Many roads lead to Rome so difficult to know which is most efficient • I certainly have not tried all those of probable utility 18
  19. 19. Source : BindingDB target-mapped SAR extraction 19
  20. 20. BindingDB • Pre-cooked expert curation • Modest but steady growth • Easy to browse list • Structures > PubChem and subsumed > ChEMBL • Targets mapped to UniProt even for titles with no target • Many search features, some unique (i.e. different to ChEMBL) • Novel targets from patents and unique journal selection • Download full SAR sets example no. > structure > activity > target • Lag time in PubChem indexing • No antinfective whole organism targets • US publications some years behind the WO first pubs • Dependent on CWU structures that are not all correct 20
  21. 21. Source: WIP0 PATENTSCOPE 21
  22. 22. WIP0 PATENTSCOPE • Comprehensive and up to date • Instant metrics (yellow highlight) as you toggle search parameters • Sign in for saved searches • Useful instant graphics on result lists • Search reports can “walk” you to other relevant filings • Pithy examiner comments (almost) amusing • Limited text search fields • In-line table images, pros and cons • Slow image loading • Inventor/applicant conflation 22
  23. 23. The WIPO “gift horse” 23 • ~ 7 million strucs, WO and US from 1978 • WIPO collab w. InfoChem and NextMove • False-negatives (i.e. examples missed) • Not yet in PubChem • Limited utility for SAR mining so far
  24. 24. Source: EPO Espace 24 I prefer WIPO as a search portal but Espace is useful for INPADOC families
  25. 25. Source: SureChEMBL 25
  26. 26. SureChEMBL • For SAR extraction the best first-stop-shop (after BindingDB) • Chemistry indexed a week or less from publication date • Family-wide structure downloads • Powerful combination of filters and search functionality • Multiple source x-refs including PubChem and ChEMBL • Can correct IUPAC failures and paste out example blocks • Usual caveats of CNER (but hey, 18 mill structures for free) • Extraction confounded by dense image tables • WIPOs less well extracted that USPTOs (but OCR not their fault) • Overhead of futile common chemistry extraction • Slow image load times and structure step-through • Need to watch PubChem load dates(via SIDs) • The feature that never appeared :( 26
  27. 27. Source: PubChem 27
  28. 28. PubChem • Mother of all searchable portals with 22 mill patent compounds • SureChEMBL, ChEMBL, BindingDB and IBM are in it • Massive feature set including Entrez • Patent and PubMed connectivity via structure • Very useful Identifier Exchange Service for set mapping • Can upload SD files (e.g. from Chemicalize or SciFinder) • Transparent and navigable chemistry rules (e.g. “same connectivity”) • Slice ‘n dice full Boolean search history • Extensive filter options • Direct Venn from CID lists < 10,000 • Similar compound clustering > isolate an SAR series • Can “walk” though chemical neighbourhood > cluster > cluster patent hop by chemotype (target neutral) • Navigation can be daunting • Some large sources should be kicked out IMHO • Interface queries often time out 28
  29. 29. New search interface includes patents 29
  30. 30. Source: ChEMBL 30
  31. 31. ChEMBL • Gateway to chemistry manually extracted from journals • 0.39 mill structure mapped across to SureChEMBL • This gives direct journal < > patent connectivity • Powerful query, filtration, browsing and target indexing • Release 23 has1.02 mill structures and assay data from 67,722 papers • Circular subsumation of 0.5 mill structures from confirmed PubChem Bioassays • Integrates the BindingDB patent curation (but sync lag) • Indexed in PubChem BioAssay • Target-linked entries subsumed into BindingDB • Linked to EPMC • Good for paper <> patent • Not linked to PubMed • Up to 2 year lag for papers • Selective journal capture 31 0.39 mill ChEMBL 1.34 mill SureChEMBL 17.23 mill
  32. 32. Source: Europe PubMed Central 32
  33. 33. Europe PubMed Central • Fully featured literature search functionality • Big plus is the (HAS_CHEMBL:y) select for chemistry • Gives query > paper > ChEMBL chemistry > SureChEMBL and/or > PubChem • Bioentity mark up from other sources • De facto two-stop shop with PubMed which has different functionality • Warning, their patent abstracts not updated since 2012 33
  34. 34. Source: PubMed 34
  35. 35. PubMed • Largest entry point to connect Med Chem papers < > patents • Entities disclosed, ie target protein IDs , affiliations, chemical structure • Power of Entrez, including MeSH • PubMed > PDB (via MMD) good for CID of ligands > patent • Can connect inventors with unusual names • However, papers typically ~ 2 years behind “fresh” patents • May find enough SAR for popular targets not to bother with patents • But (unless paper in ChEMBL) you may have to DIY extract entities including chemistry • Patents good at citing papers (US mandated to be thorough) • However, many authors avoid citing their patents • Connect into literature via targets and diseases and thence > patents • JFTR disease searching in patent text largely useless (titles maybe) • Patent reviews valuable but tend to be in hard-to-get journals 35
  36. 36. Patent review articles: doing the groundwork for you 36
  37. 37. Unusual linking 37
  38. 38. PubMed > PubChem > Guide to Pharmacology BACE2 page 38
  39. 39. Source: Open Google 39 Date cutting e.g. by one year, actually works
  40. 40. Source: Google patents 40
  41. 41. 41 Targets
  42. 42. Status of human targets from open sources (as UniProt x-refs) 42 Oct 2016 Oct 2017 • Most of have chemistry > target via papers (thus can search patents) • Outer limit of data-supported druggable proteome • Some patent only in BindingDB
  43. 43. Patent retrieval by target names: not so easy 43
  44. 44. Patent retrieval by target names 44 In : Lecture Notes in Bioinformatics (ISBN 978-3-642-15119-4) P Lambrix and G Kemp (Eds.) Springer Verlag, pp 106-121, 2010
  45. 45. Classification of target names in titles 45 AWK Gene and protein names can be noisy and inconsistently used by applicants but HGN approved symbol usage seems to be improving
  46. 46. The three levels of title 46
  47. 47. 47 Tools
  48. 48. Utility of tools • Can re-run IUPACs and images where automated conversion failed • Synergies of gap filling from working between the original document, the SureChEMBL output and the OPSIN and OSRA tools • Can run on PubMed abstracts, individually or bulk • Can isolate example series of structures that has the SAR • Useful for extraction from papers not in ChEMBL • May be necessary to convert between formats e.g. for uploading to PubChem 48
  49. 49. Simple PubMed search 49
  50. 50. Venny • Excellent for set comparisons of any strings < over 10,000 • E.g. CIDs, InChIKeys or UniProt IDs • It automatically de-duplicates • Download complete intersects and diffs from any segment of the Venn 50
  51. 51. PubChem structure search 51
  52. 52. PubChem Identifier Exchange 52
  53. 53. OpenBabel 53 Format conversions e.g. SciFinder SDF to InChIKey
  54. 54. Example of coverage from US9181236 54 • 173 BindingDB CIDs curated from PubChem via US9181236 • 405 substances SDF from SciFinder OpenBabel > 391 IK > 362 CIDs • 1657 rows > 834 SureChEMBL IDs > 664 CIDs • 3-way Venn of CIDs
  55. 55. from ChemAxon 55
  56. 56. Chemicalize Google patent webpage result 56
  57. 57. OPSIN for IUPAC names 57 • Conversion of compound 19 from WO2016096979 after fixing OCR errors • N-r3-r(26',3i?)-5-amino-2-methyl-3-(trifluoromethyl)-3,4- dihvdropyrrol-2-yl1- 4-fluoro-phenyl1-5-chloro-pyridine-2-carboxamide. • Good for iterative correction via error flagging that Chemicalize will not
  58. 58. 58 Result examples
  59. 59. Getting SAR out the hard way 59
  60. 60. Collating the hard way • Three versions of the SAR table from WO2016096979 • On the left is the original from page 64 of the PDF • In the centre is the corresponding section of the SureChEMBL mark-up • The right hand panel is an Excel paste-across of the centre section • But you have to complete by pasting SMILES of structures on previous page 60
  61. 61. Getting SAR out the easy, via BindingDB 61
  62. 62. 62 So what's next?
  63. 63. Wish list Yup, we can dig a lot of SAR out of patents But wouldn’t it be nice if….. • Clavariate re-instated the Derwent patent chemistry feed to PubChem • Open standard SAR modelling tools (with AI natch’) maybe Knime? (table in > model out) • These might show large patent SAR sets better than from papers • Someone indexed full text patents by gene name counts inside the description section (SureChEMBL for OpenPhacts?) • SureChEMBL would finally bring in their document section stats • Run the SureChEMBL engine on full-text papers and PubMed abstracts • European PubMed Central updated their EPO C07/A61 patent abstracts from 2012 • We could paste large text chunks > Chemicalize but not run out of points • Patents could be more like good papers…. 63
  64. 64. Could the future be automatic? 64
  65. 65. Impressive results 65
  66. 66. 66 That’s it for now