www.guidetopharmacology.org
The Open Patent Chemistry “Big Bang”:
Implications, Opportunities and Caveats
Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY,
Centre for Integrative Physiology, University of Edinburgh
http://www.slideshare.net/cdsouthan/the-open-patent-chemistry-big-
bang-implications-opportunities-and-caveats
Prepared for
1
Outline
• Big Bang in PubChem
• Balancing IP against bioactivity mining
• Relative source coverage
• Comparing Mwts
• Activity gap
• Unique content
• Mixtures
• CWUs
• Virtuals of various types
• Orthogonal paper
• Conclusions
• References
2
History of patent chemistry feeds into PubChem
• 2006 - Thomson (Reuters) Pharma (TRP) manual extractions from
patents and papers (now 4.3 mil, ~40% patents)
• 2011- IBM phase 1 chemical named entity recognition (CNER) 2.5 mil
- SLING Consortium EPO extraction 0.1 mil
• 2012 - SCRIPDB, CNER plus Complex Work Units (CWU) 4.0 mil
• 2013 - SureChem, CNER + image, 9.0 mil
• 2014 - BindingDB USPTO assay extraction (CWU) 0.07 mil
• 2015- (CNER+images + CWU)
• SureChEMBL 13.0 mil
• IBM phase 2, 7.0 mil,
• NextMove Software 1.4 mil synthesis mapping
3
“Big Bang” of CNER PubChem source submissions (SIDs)
4
IBM II + SureChEMBL + NM
IBM I
SCRIPDB
Current PubChem patent chemistry
• 31.7 mil patent-extracted structures (Oct 2015)
• = 20% of 158 mil total Substance Identifiers (SIDs)
• CIDs with patent SIDs = 17.8 from total of 60.8 mil = 30%
• 2.8 million patent document numbers indexed
• * TRP estimated and “half-open” (i.e. structures and dates but document links
require a Cortelis subscription)
5
SID counts in mil
Opportunities from the Big Bang:
balancing the IP vs SAR utility split
IP assessment
• De facto crucial prior art
• Differential coverage as an adjunct to
commercial sources
• Facilitates IP mining for those who
cannot afford commercial offerings
• PubChem content is chemistry from
patents, not patented chemistry
• CNER is brainless compared to expert
IP-relevance selection
• Claim extraction generally poor
• CNER-extracted chemistry artefacts can
confound assessments (e.g. virtuals)
• Dense image tables still a coverage gap
• Major sources currently static in
PubChem (except SureChEMBL & TRP)
• Asian chemistry shortfall
• The “common chemistry” problem
Bioactivity data-mining
• Circa 5x more SAR that literature
• Chemistry > data via PubChem pat
number indexing > free full-text
• Patent families collapse to < 100K
C07D primary documents
• Advanced query options in
SureChEMBL including SciBite
bioentity mark-up
• Challenge of judging scientific quality
• Synthesis extraction (NextMove)
• Valuable intersects with papers and
targets via ChEMBL
• Easy intersecting with DIY chemistry
extraction from any document
• Only ~ 5 mil structures potentially
linkable to bioactivity data
• Thus ~ 12 million have marginal utility
• Drug structure multiplexing problem
6
Major PubChem CNER patent sources at the compound level:
structural corroboration but also divergence
7
SCRIPDB = 4.0
(SID:CID 1.5)
IBM = 7.9
(SID:CID 1.2)
SureChEMBL = 14.6
(SID:CID 1.0)
0.66
2.12
0.67 8.56
0.53 3.26
1.95 Counts are Compound
Identifiers (CIDs) in millions
with a union of 17.8
Patent CNER vs manual bioactivity sources in PubChem:
structural corroboration but also divergence
8
SCRIPDB + IBM
+SureChEMBL = 17.8
Thomson (Reuters) Pharma = 4.3
ChEMBL = 1.4
16.13
0.18
0.12 0.90
1.35 0.26
2.55
Counts are CIDs in millions
Mw plots indicate the CNER fragmentation problem
9
The bioactivity-gap:
majority of patent chemistry has no linked data
10
1.8 mil CNER CIDs
Compare with a
bioactivity-focussed
source e.g. Guide to
PHARMACOLOGY
(GtoPdb) 6037 CIDs
Patent-unique structures : a mixed blessing
11
Patent-picking: vendors listing probable non-stock structures
12
Has been reduced since the recent
deprecation of 20 million Angene SIDs
CNER whitespace problem: mixtures from WO2010053438
13
US6589997: missing punctuation > CNER fails and mixtures
14
NextMove
SureChEMBL (have now fixed this document)
Mixture extractions: more problematic than useful
15
N.b. PubChem ameliorates the issue by splitting all SID/CID mixtures to
component CIDs while maintaining the back-mapping
CWU chemistry: from the sublime…
16
To the ridiculous…. “Chessbordane” CWU virtuals
17
C362H422
Virtuals II: stereo enumerations from US 20080085923
18
260 CIDs > 581 SIDs from IBM,
SureChEMBL, SCRIPDB, Thomson
Pharma and Discovery Gate
Virtuals III: deuterated enumerations from US20080045558
19
986 deuterated CIDs > 2818
SIDs from IBM, SureChEMBL
and SCRIPDB,
Very virtual: d100 dalbavancin
20
Submitted to PubChem by Thomson Pharma (only) on 16th of March 2009
Recent orthogonal analysis of Big Bang impact
• Compares SureChEMBL and IBM with SciFinder and Reaxys for a small
patent set (i.e. open vs commercial)
• Concludes; “50–66 % of the relevant content from the latter was also found
in the former”
• Equivalent comparisons executed in PubChem, along the lines presented
here, would record a higher overlap
• This would be via contributions from the other three open sources and
mixture splitting
• Note the update schedule for SurChEMBL in PubChem will be quarterly, but
new patent chemistry surfaces in SureChEMBL at the EBI within 2-4 days and
is refreshed in the EBI UniChem resource ~ monthly
21
Managing expectations: assessment of chemistry databases generated by
automated extraction of chemical structures from patents, Senger, et al. J.
Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL)
Conclusions
• The “Big Bang” value massively outweighs the caveats
• All sources contributing to open patent chemistry are to be congratulated,
and PubChem for wrangling them
• PubChem slice-and-dice functionality is informative for comparing sources
• Bioactivity mining is extensively enabled but still challenging
• IP assessment also not straightforward but playing field has levelled
• But we do need to look the gift horse in the mouth
• Important to resolve and understand quirks, artefacts and pitfalls
• PubChem filters can partially ameliorate some of these
• Between open and commercial we are approaching the best of both worlds
• It will be interesting to see where we go from here
22
References and questions please
23
http://cdsouthan.blogspot.com/ 19 posts have the tag “patents”
http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624
(with PubMed Commons data link)
N.b. from the aspect of reproducibility, anyone needing technical tips to reproduce or
extend the PubChem queries used for these slides is welcome to contact me
www.ncbi.nlm.nih.gov/pubmed/25415348
ACS “Deuterogate” slides http://www.slideshare.net/cdsouthan/causes-and-consequences-of-
automated-extraction-of-patentspecified-virtual-deuterated-drugs
//nar.oxfordjournals.org/content/early/2015/10/11/nar.gkv1037

The open patent chemistry “big bang”: Implications, opportunities and caveats

  • 1.
    www.guidetopharmacology.org The Open PatentChemistry “Big Bang”: Implications, Opportunities and Caveats Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Integrative Physiology, University of Edinburgh http://www.slideshare.net/cdsouthan/the-open-patent-chemistry-big- bang-implications-opportunities-and-caveats Prepared for 1
  • 2.
    Outline • Big Bangin PubChem • Balancing IP against bioactivity mining • Relative source coverage • Comparing Mwts • Activity gap • Unique content • Mixtures • CWUs • Virtuals of various types • Orthogonal paper • Conclusions • References 2
  • 3.
    History of patentchemistry feeds into PubChem • 2006 - Thomson (Reuters) Pharma (TRP) manual extractions from patents and papers (now 4.3 mil, ~40% patents) • 2011- IBM phase 1 chemical named entity recognition (CNER) 2.5 mil - SLING Consortium EPO extraction 0.1 mil • 2012 - SCRIPDB, CNER plus Complex Work Units (CWU) 4.0 mil • 2013 - SureChem, CNER + image, 9.0 mil • 2014 - BindingDB USPTO assay extraction (CWU) 0.07 mil • 2015- (CNER+images + CWU) • SureChEMBL 13.0 mil • IBM phase 2, 7.0 mil, • NextMove Software 1.4 mil synthesis mapping 3
  • 4.
    “Big Bang” ofCNER PubChem source submissions (SIDs) 4 IBM II + SureChEMBL + NM IBM I SCRIPDB
  • 5.
    Current PubChem patentchemistry • 31.7 mil patent-extracted structures (Oct 2015) • = 20% of 158 mil total Substance Identifiers (SIDs) • CIDs with patent SIDs = 17.8 from total of 60.8 mil = 30% • 2.8 million patent document numbers indexed • * TRP estimated and “half-open” (i.e. structures and dates but document links require a Cortelis subscription) 5 SID counts in mil
  • 6.
    Opportunities from theBig Bang: balancing the IP vs SAR utility split IP assessment • De facto crucial prior art • Differential coverage as an adjunct to commercial sources • Facilitates IP mining for those who cannot afford commercial offerings • PubChem content is chemistry from patents, not patented chemistry • CNER is brainless compared to expert IP-relevance selection • Claim extraction generally poor • CNER-extracted chemistry artefacts can confound assessments (e.g. virtuals) • Dense image tables still a coverage gap • Major sources currently static in PubChem (except SureChEMBL & TRP) • Asian chemistry shortfall • The “common chemistry” problem Bioactivity data-mining • Circa 5x more SAR that literature • Chemistry > data via PubChem pat number indexing > free full-text • Patent families collapse to < 100K C07D primary documents • Advanced query options in SureChEMBL including SciBite bioentity mark-up • Challenge of judging scientific quality • Synthesis extraction (NextMove) • Valuable intersects with papers and targets via ChEMBL • Easy intersecting with DIY chemistry extraction from any document • Only ~ 5 mil structures potentially linkable to bioactivity data • Thus ~ 12 million have marginal utility • Drug structure multiplexing problem 6
  • 7.
    Major PubChem CNERpatent sources at the compound level: structural corroboration but also divergence 7 SCRIPDB = 4.0 (SID:CID 1.5) IBM = 7.9 (SID:CID 1.2) SureChEMBL = 14.6 (SID:CID 1.0) 0.66 2.12 0.67 8.56 0.53 3.26 1.95 Counts are Compound Identifiers (CIDs) in millions with a union of 17.8
  • 8.
    Patent CNER vsmanual bioactivity sources in PubChem: structural corroboration but also divergence 8 SCRIPDB + IBM +SureChEMBL = 17.8 Thomson (Reuters) Pharma = 4.3 ChEMBL = 1.4 16.13 0.18 0.12 0.90 1.35 0.26 2.55 Counts are CIDs in millions
  • 9.
    Mw plots indicatethe CNER fragmentation problem 9
  • 10.
    The bioactivity-gap: majority ofpatent chemistry has no linked data 10 1.8 mil CNER CIDs Compare with a bioactivity-focussed source e.g. Guide to PHARMACOLOGY (GtoPdb) 6037 CIDs
  • 11.
    Patent-unique structures :a mixed blessing 11
  • 12.
    Patent-picking: vendors listingprobable non-stock structures 12 Has been reduced since the recent deprecation of 20 million Angene SIDs
  • 13.
    CNER whitespace problem:mixtures from WO2010053438 13
  • 14.
    US6589997: missing punctuation> CNER fails and mixtures 14 NextMove SureChEMBL (have now fixed this document)
  • 15.
    Mixture extractions: moreproblematic than useful 15 N.b. PubChem ameliorates the issue by splitting all SID/CID mixtures to component CIDs while maintaining the back-mapping
  • 16.
    CWU chemistry: fromthe sublime… 16
  • 17.
    To the ridiculous….“Chessbordane” CWU virtuals 17 C362H422
  • 18.
    Virtuals II: stereoenumerations from US 20080085923 18 260 CIDs > 581 SIDs from IBM, SureChEMBL, SCRIPDB, Thomson Pharma and Discovery Gate
  • 19.
    Virtuals III: deuteratedenumerations from US20080045558 19 986 deuterated CIDs > 2818 SIDs from IBM, SureChEMBL and SCRIPDB,
  • 20.
    Very virtual: d100dalbavancin 20 Submitted to PubChem by Thomson Pharma (only) on 16th of March 2009
  • 21.
    Recent orthogonal analysisof Big Bang impact • Compares SureChEMBL and IBM with SciFinder and Reaxys for a small patent set (i.e. open vs commercial) • Concludes; “50–66 % of the relevant content from the latter was also found in the former” • Equivalent comparisons executed in PubChem, along the lines presented here, would record a higher overlap • This would be via contributions from the other three open sources and mixture splitting • Note the update schedule for SurChEMBL in PubChem will be quarterly, but new patent chemistry surfaces in SureChEMBL at the EBI within 2-4 days and is refreshed in the EBI UniChem resource ~ monthly 21 Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents, Senger, et al. J. Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL)
  • 22.
    Conclusions • The “BigBang” value massively outweighs the caveats • All sources contributing to open patent chemistry are to be congratulated, and PubChem for wrangling them • PubChem slice-and-dice functionality is informative for comparing sources • Bioactivity mining is extensively enabled but still challenging • IP assessment also not straightforward but playing field has levelled • But we do need to look the gift horse in the mouth • Important to resolve and understand quirks, artefacts and pitfalls • PubChem filters can partially ameliorate some of these • Between open and commercial we are approaching the best of both worlds • It will be interesting to see where we go from here 22
  • 23.
    References and questionsplease 23 http://cdsouthan.blogspot.com/ 19 posts have the tag “patents” http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624 (with PubMed Commons data link) N.b. from the aspect of reproducibility, anyone needing technical tips to reproduce or extend the PubChem queries used for these slides is welcome to contact me www.ncbi.nlm.nih.gov/pubmed/25415348 ACS “Deuterogate” slides http://www.slideshare.net/cdsouthan/causes-and-consequences-of- automated-extraction-of-patentspecified-virtual-deuterated-drugs //nar.oxfordjournals.org/content/early/2015/10/11/nar.gkv1037