Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

www.guidetopharmacology.org
Deuterogate: Causes and consequences of
automated extraction of patent-specified virtual
deuterated drugs feeding into PubChem
Christopher Southan
IUPHAR/BPS Guide to PHARMACOLOGY, Center for Integrative
Physiology, University of Edinburgh
ACS Boston CINF session: Enabling Machines to "Read" the
Chemical Literature: Techniques
1
http://www.slideshare.net/cdsouthan/causes-and-consequences-of-automated-extraction-of-
patentspecified-virtual-deuterated-drugs

Abstract
2
The strategy of deuterating drugs to improve clinical profiles via the kinetic isotope effect has
been known for over 50 years. However, recent development candidates have been predicated
on a surge of opportunistic patent filings between 2008 and 2011. For automated chemical
named entity recognition (CNER) these present particular challenges. These are investigated
in this work by comparing sources of the 80K deuterated compounds inside PubChem. Of
these, 45K originate from the patent CNER submissions of SCRIPDB, IBM and SureChEMBL
plus 23K from Thomson Pharma via manual expert curation (MEXC). For CNER there are
three options, image extraction, recognition of [2H] in IUPAC text forms or Complex Work Unit
(CWU) molfiles obtained from the USPTO. For images, conversions to structures using OSRA
with explicit H and D positions failed. Tests with chemicalize.org and OPSIN established that
text “deuterio” did convert. The SureChEMBL pipeline also handles the “dx” prefix (e.g. methyl-
d3). These tests, combined with inspection of SureChEMBL export records, confirmed that
deuteration feeding into PubChem from patents was predominantly image-only derived. It was
also clear that CWUs had provided the majority of these via molfiles. However, despite
conceptually simillar CNER pipelines the three CNER sources showed divergent capture.
Importantly, inspection of patents from the three major applicants in the deuteration IP Gold
Rush indicated little reduction to practice. The unexpected consequences are that most of
~25K derivatives in PubChem of ~500 established drugs. are virtual, (i.e. the structures do not
exist). This achilles heel of CNER will be discussed, since it presents database users with the
dilemma between virtual swamping but possible IP significance on the one hand, verses the
permanent absence of linked bioactivity data on the other.

Dalbavancin
4
FDA approved May 2014

US20090062182: Deuterium-enriched dalbavancin
6

OSRA:fails on explicit “D-” image > struct
8

The extraction problem for deuts
• Majority of patents are image-only so no conversion
• IUPAC specification of “detero” and “deuterio” is rare but
OPSIN, SureChEMBL and chemicalize.org will do the
name-to-struc
• Thomson (Derwent) and SciFinder draw them in manually
for conversion
• SureChEMBL, SCRIPDB and IBM use the Complex Work
Units from the USPTO
• These include the molfiles drawn by the contractors and
are the major source of deuteration in PubChem
9

Codeine: the enumeration record from US20080045558
10
Left panel shows a section from one of approximately 55 pages of images.
Right panel shows the first three examples from the 520 intersect between the
992 CIDs retrieved via the patent number and the 551 from “Same,
Connectivity” for codeine (CID 5284371), ranked by Mw.
Thomson Pharma only extracted three examples from this patent

SureChEMBL indexing
11
First structure in the list SCHEMBL12905541 corresponds to CID
237918906 which has merged the SureChEMBL SID 237918906 with
SCRIPDB SID 141460523.

Source divergence in deuteration capture
13
TRP, SCR and SCH have an approximate three-way split, with the union of
64195 covering 81% of PubChem deuteration (77882 March 2015)

Propagation: UniChem indexing
14

Deuteration over time: patent surge in Thomson Pharma
15
TRP deuteration in PubChem on a per-year basis (left vertical axis and hatched
bars) with patent publication dates taken from the USPTO for Auspex, Concert and
Protia combined (the right hand vertical axis and solid lines with triangles).

Picking off drug structures
16

SciFinder results indicate invention by consortium
• SciFinder facilitated certain queries orthogonal to PubChem (e.g.
assignee query for substances)
• 19841 isotopic substances were derived from 165 Auspex patents
• Concert 6766 from 189
• Protia 1959 from 252
• Remarkably, the substance union query gave 28076 with an intersect
of only 30 as deuteration reagents
• This means the assignees somehow contrived to divide up ~ 600
drug filings (i.e. to avoid each others claims)
17

Consequences and problems of virtual deuteration
• Classic case of unintended consequences
• Confounding drug analogue searching
• Breaking the PubChem unofficial rule of extant-only compounds
• Extant and virtual structures cannot be computationally separated
• Secondary submitters cause intra-PubChem proliferation
• Persistence as no-data entries
• Proliferation between open databases
• Both commercial sources of patent chemistry and source
aggregation projects within pharmaceutical companies will be
affected
• Annotation can be confounded (e.g. the attribution of biological
study in SciFinder)
• Equivocal IP situation
18

Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

Similar to Causes and consequences of automated extraction of patent-specified virtual deuterated drugs (20)

More from Chris Southan

More from Chris Southan (20)

Recently uploaded

Recently uploaded (20)

Causes and consequences of automated extraction of patent-specified virtual deuterated drugs