Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent-extracted structures in PubChem

787 views

Published on

Christopher Southan (The IUPHAR/BPS Guide to PHARMACOLOGY, UK)
As of August 2017, the major automated patent chemistry extractions (in ascending size, NextMove, SCRIPDB, IBM and SureChEMBL) are included submitters for 21.5 million CIDs from the PubChem total of 93.8. The following aspects will be expanded in this presentation, starting with advantages; a) while the relative coverage between open and commercial sources is difficult to determine (PMID 26457120) it is clear that the majority of patent-exemplified structures of medicinal chemistry interest (i.e. from C07 plus A61) are now in PubChem b) this allows most first-filings of lead series and clinical candidates to be tracked d) the PubChem tool box has query, analysis, clustering and linking features difficult to match in commercial sources, e) many structures can be associated with bioactivity data f) connections between manually curated papers and patents can be made via the 0.48 million CID intersects with ChEMBL. However, looking more closely also indicates disadvantages; a) extraction coverage is compromised by dense image tables and poor OCR quality of WO documents, b) SureChEMBL is the only major open pipeline continuously running in situ but has a PubChem updating lag, c) automated extraction generates structural “noise” that degrades chemistry quality d) PubChem patent document metadata indexing is patchy (although better for SureChEMBL in situ) d) nothing in the records indicateas IP status, e) continual re-extraction of common chemistry results in over-mapping (e.g. 126,949 patents for aspirin and 14,294 for atorvastatin), f) authentic compounds are contaminated with spurious mixtures and never-made virtuals, including 1000s of deuterated drugs g) linking between assay data and targets is still a manual exercise. However, all things considered the PubChem patent “big bang” presents users with the best of both worlds (PMID 26194581). Academics or smaller enterprises who cannot afford commercial solutions can now patent mine extensively. For those who have such subscriptions, PubChem has become an essential adjunct/complementary source for the analysis of patent chemistry and associated bio entities such as diseases and drug targets.

Published in: Internet
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (Unlimited) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download Full EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ACCESS WEBSITE for All Ebooks ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... Download doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent-extracted structures in PubChem

  1. 1. www.guidetopharmacology.org Looking at the gift horse: pros and cons of patent- extracted structures in PubChem Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Integrative Physiology, University of Edinburgh. ICIC Heidelberg, Monday 23rd Oct 2017 https://www.slideshare.net/secret/v4A5eUTuYvT28X 1 22 million
  2. 2. Abstract (will be skipped for the presentation) 2 As of August 2017, the major automated patent chemistry extractions (in ascending size, NextMove, SCRIPDB, IBM and SureChEMBL) are included submitters for 21.5 million CIDs from the PubChem total of 93.8. The following aspects will be expanded in this presentation, starting with advantages; a) while the relative coverage between open and commercial sources is difficult to determine (PMID 26457120) it is clear that the majority of patent-exemplified structures of medicinal chemistry interest (i.e. from C07 plus A61) are now in PubChem b) this allows most first-filings of lead series and clinical candidates to be tracked d) the PubChem tool box has query, analysis, clustering and linking features difficult to match in commercial sources, e) many structures can be associated with bioactivity data f) connections between manually curated papers and patents can be made via the 0.48 million CID intersects with ChEMBL. However, looking more closely also indicates disadvantages; a) extraction coverage is compromised by dense image tables and poor OCR quality of WO documents, b) SureChEMBL is the only major open pipeline continuously running in situ but has a PubChem updating lag, c) automated extraction generates structural “noise” that degrades chemistry quality d) PubChem patent document metadata indexing is patchy (although better for SureChEMBL in situ) d) nothing in the records indicates IP status, e) continual re-extraction of common chemistry results in over- mapping (e.g. 126,949 patents for aspirin and 14,294 for atorvastatin), f) authentic compounds are contaminated with spurious mixtures and never-made virtuals, including 1000s of deuterated drugs g) linking between assay data and targets is still a manual exercise. However, all things considered the PubChem patent “big bang” presents users with the best of both worlds (PMID 26194581). Academics or smaller enterprises who cannot afford commercial solutions can now patent mine extensively. Even for those with commercial subscriptions, PubChem has become an essential adjunct/complementary source for the analysis of patent chemistry and associated bio entities such as diseases and drug targets.
  3. 3. Outline • History of patent chemistry feeds to PubChem • Relative source contributions • Caveats with automated extraction • Source intersects • Fragmentation • Source extraction comparisons • Circularity for virtuals • Mixtures • Lag times • Conclusions • References • Workshop alert 3
  4. 4. Chemical Named Entity Recognition (CNER) • Automated process of documents in > structures out • SureChEMBL pipeline shown above, other sources similar • Name-to-Struc (n2s) by look-up and/or IUPAC translation, image-to- struc (i2s) and mol files from USPTO Complex Work Units (CWUs) • Indexing usually added e.g. abstract, descriptions, claims • As well as patents, IBM run PubMed abstracts and PMC 4
  5. 5. History of patent chemistry feeds into PubChem • 2006 Thomson (now Clavariat) Pharma, manual extractions from patents and papers, 4.3 mil (but ceased Jan 2016) • 2011 IBM phase 1 Chemical Named Entity Recognition (CNER) 2.5 mil • SLING Consortium EPO extraction 0.1 mil • 2012 SCRIPDB, CNER + Complex Work Units (CWU) 4.0 mil • 2013 SureChem, CNER + image, 9.0 mil • 2014 BindingDB manual activity curation 0.13 mill • 2015 (CNER+images + CWU) • SureChEMBL 13.0 mil • IBM phase 2, 7.0 mil, • NextMove Software 1.4 mil synthesis mapping • 2016 SureChEMBL 15.8 mil • 2017 IBM Phase 3, 6.0 mill 5
  6. 6. 2011 “fizzle” > 2015 “big bang” 6
  7. 7. Pro: Oct 2017, from 93.89 mill PubChem CIDs 7
  8. 8. Pro: PubChem indexes IPC splits Con: document indexing is USPTO dominated (i.e. early WO’s missed) Con: Entrez cant handle the joins 8
  9. 9. Con: Mw plots reveal CNER fragmentation 9 ChEMBL + Thomson Pharma = 5.6 million manual extraction Patent CNER = 21.8 million
  10. 10. Con: those “Chessbordanes” still hanging around…… 10
  11. 11. Pros & cons arising from intersects and filters 11
  12. 12. Intersects and diffs for major CNER sources Pro: corroboration, Con: divergence 12 IBM = 10.7 SCRIPDB = 4.0 SureChEMBL = 17.6 2.9 2.4 4.7 10.1 0.6 0.4 0.50 Counts (Oct 2017) are CIDs in millions Union = 21.7 3-way = 2.4 3 + 2-way = 8.1 Unique= 13.5
  13. 13. Con: circular extraction of virtual enumerations 13 1511 codeine records, mainly 563 deuterations from Auspex US7872013 > 3-source multiplexing 652 InChI key inner layer records via 266 stereos of vorapaxar via Schering US20080085923 > 4-source multiplexing in UniChem
  14. 14. Pro: good coverage, con: not complete • Compared SureChEMBL and IBM with SciFinder and Reaxys for a small patent set (i.e. open vs commercial) • Concluded; “50–66 % of the relevant content from the latter was also found in the former” • Equivalent comparisons in the latest PubChem would record a higher overlap • Probability of completely missing a recently exemplified series completely getting lower 14 Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents, Senger, et al. J. Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL) http://www.ncbi.nlm.nih.gov/pubmed/26457120
  15. 15. Examining extraction selectivity for same patent 15
  16. 16. Coverage from US9181236 Pro: convergence, Con: divergence 16 • 173 BindingDB CIDs curated from PubChem via US9181236 • 405 substances SDF from SciFinder OpenBabel > 391 IK > 362 CIDs • 1657 rows > 834 SureChEMBL IDs > 664 CIDs • 3-way Venn of CIDs
  17. 17. Con: the common chemistry problem 17 Spurious patent < > cpd indexing: aspirin = 131,410, atorvastatin = 14,968, ethanol = 72,027
  18. 18. Con: the mixtures problem 18
  19. 19. Con: no open automated SAR extraction Pro: DIY manual SAR extraction aligned to PubChem structures Pro: ~2K patents have target-mapped BindingDB curated SAR 19 • SAR table from WO2016096979, Jansen BACE1 inhibitors • Left to right, page from the PDF, SureChEMBL mark-up and Excel paste-across
  20. 20. Con: Lag in SureChEMBL> PubChem synch times • Internal UniChem load at EBI, 10 Oct = 18691416 • PubChem submission, 07 Oct = 17687607 • Latest in situ entries below for 12 Oct • Extraction in SureChEMBL within a week or less of pub date 20
  21. 21. Con: IBM CNER > 80% of all PubChem < > PMID links 21 • IBM extracts PubMed abstracts as well as patents • PubChem < > structures to PMID • Automated associations swamp out expert-curated assignments • Specificity/accuracy is equivocal
  22. 22. Conclusions • For the PubChem patent chemistry “Big Bang” the pros massively outweigh the cons (i.e. it’s not a bad horse …) • Contributors are to be congratulated and PubChem for wrangling them • However, it is important to look closely at the gift horse….. • Users need to understand CNER quirks, pitfalls and confounding artefacts • PubChem slicing and filtering can partially ameliorate these • Activity-to-target mapping for SAR extraction still pinch point • Open extraction is a crucial comparator for commercial efforts • Those without commercial sources are well enabled for patent mining • Those with commercial sources can synergise with open searching 22
  23. 23. Info 23 http://cdsouthan.blogspot.com/ many posts have the tag “patents” http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.guidetopharmacology.org/ http://www.sciencedirect.com/science/article/pii/B9780124095472138144
  24. 24. Questions? (but wait …. there’s more, a Tuesday tutorial) 24

×