SlideShare a Scribd company logo
www.guidetopharmacology.org
Deuterogate: Causes and consequences of
automated extraction of patent-specified virtual
deuterated drugs feeding into PubChem
Christopher Southan
IUPHAR/BPS Guide to PHARMACOLOGY, Center for Integrative
Physiology, University of Edinburgh
ACS Boston CINF session: Enabling Machines to "Read" the
Chemical Literature: Techniques
1
http://www.slideshare.net/cdsouthan/causes-and-consequences-of-automated-extraction-of-
patentspecified-virtual-deuterated-drugs
Abstract
2
The strategy of deuterating drugs to improve clinical profiles via the kinetic isotope effect has
been known for over 50 years. However, recent development candidates have been predicated
on a surge of opportunistic patent filings between 2008 and 2011. For automated chemical
named entity recognition (CNER) these present particular challenges. These are investigated
in this work by comparing sources of the 80K deuterated compounds inside PubChem. Of
these, 45K originate from the patent CNER submissions of SCRIPDB, IBM and SureChEMBL
plus 23K from Thomson Pharma via manual expert curation (MEXC). For CNER there are
three options, image extraction, recognition of [2H] in IUPAC text forms or Complex Work Unit
(CWU) molfiles obtained from the USPTO. For images, conversions to structures using OSRA
with explicit H and D positions failed. Tests with chemicalize.org and OPSIN established that
text “deuterio” did convert. The SureChEMBL pipeline also handles the “dx” prefix (e.g. methyl-
d3). These tests, combined with inspection of SureChEMBL export records, confirmed that
deuteration feeding into PubChem from patents was predominantly image-only derived. It was
also clear that CWUs had provided the majority of these via molfiles. However, despite
conceptually simillar CNER pipelines the three CNER sources showed divergent capture.
Importantly, inspection of patents from the three major applicants in the deuteration IP Gold
Rush indicated little reduction to practice. The unexpected consequences are that most of
~25K derivatives in PubChem of ~500 established drugs. are virtual, (i.e. the structures do not
exist). This achilles heel of CNER will be discussed, since it presents database users with the
dilemma between virtual swamping but possible IP significance on the one hand, verses the
permanent absence of linked bioactivity data on the other.
Introduction
3
Dalbavancin
4
FDA approved May 2014
Scifinder extraction
5
US20090062182: Deuterium-enriched dalbavancin
6
Protia portfolio
7
OSRA:fails on explicit “D-” image > struct
8
The extraction problem for deuts
• Majority of patents are image-only so no conversion
• IUPAC specification of “detero” and “deuterio” is rare but
OPSIN, SureChEMBL and chemicalize.org will do the
name-to-struc
• Thomson (Derwent) and SciFinder draw them in manually
for conversion
• SureChEMBL, SCRIPDB and IBM use the Complex Work
Units from the USPTO
• These include the molfiles drawn by the contractors and
are the major source of deuteration in PubChem
9
Codeine: the enumeration record from US20080045558
10
Left panel shows a section from one of approximately 55 pages of images.
Right panel shows the first three examples from the 520 intersect between the
992 CIDs retrieved via the patent number and the 551 from “Same,
Connectivity” for codeine (CID 5284371), ranked by Mw.
Thomson Pharma only extracted three examples from this patent
SureChEMBL indexing
11
First structure in the list SCHEMBL12905541 corresponds to CID
237918906 which has merged the SureChEMBL SID 237918906 with
SCRIPDB SID 141460523.
Deuterated source splits
12
Source divergence in deuteration capture
13
TRP, SCR and SCH have an approximate three-way split, with the union of
64195 covering 81% of PubChem deuteration (77882 March 2015)
Propagation: UniChem indexing
14
Deuteration over time: patent surge in Thomson Pharma
15
TRP deuteration in PubChem on a per-year basis (left vertical axis and hatched
bars) with patent publication dates taken from the USPTO for Auspex, Concert and
Protia combined (the right hand vertical axis and solid lines with triangles).
Picking off drug structures
16
SciFinder results indicate invention by consortium
• SciFinder facilitated certain queries orthogonal to PubChem (e.g.
assignee query for substances)
• 19841 isotopic substances were derived from 165 Auspex patents
• Concert 6766 from 189
• Protia 1959 from 252
• Remarkably, the substance union query gave 28076 with an intersect
of only 30 as deuteration reagents
• This means the assignees somehow contrived to divide up ~ 600
drug filings (i.e. to avoid each others claims)
17
Consequences and problems of virtual deuteration
• Classic case of unintended consequences
• Confounding drug analogue searching
• Breaking the PubChem unofficial rule of extant-only compounds
• Extant and virtual structures cannot be computationally separated
• Secondary submitters cause intra-PubChem proliferation
• Persistence as no-data entries
• Proliferation between open databases
• Both commercial sources of patent chemistry and source
aggregation projects within pharmaceutical companies will be
affected
• Annotation can be confounded (e.g. the attribution of biological
study in SciFinder)
• Equivocal IP situation
18

More Related Content

Viewers also liked

Capturing BIA-10-2474 and related FAAH inhibitor data
Capturing BIA-10-2474 and related FAAH inhibitor dataCapturing BIA-10-2474 and related FAAH inhibitor data
Capturing BIA-10-2474 and related FAAH inhibitor data
Chris Southan
 
Patent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsPatent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEs
Chris Southan
 
From Biological Data to Clinical Applications: Positioning a digital infrastr...
From Biological Data to Clinical Applications: Positioning a digital infrastr...From Biological Data to Clinical Applications: Positioning a digital infrastr...
From Biological Data to Clinical Applications: Positioning a digital infrastr...
Michel Dumontier
 
Connecting antimalarial data
Connecting antimalarial dataConnecting antimalarial data
Connecting antimalarial data
Chris Southan
 
Assessing GtoPdb ligand content in PubChem
Assessing GtoPdb ligand content in PubChemAssessing GtoPdb ligand content in PubChem
Assessing GtoPdb ligand content in PubChem
Chris Southan
 
GtoPdb and GtoImmuPdb in context
GtoPdb and GtoImmuPdb in contextGtoPdb and GtoImmuPdb in context
GtoPdb and GtoImmuPdb in context
Chris Southan
 
Correct drug structures for pharmacology
Correct drug structures for pharmacologyCorrect drug structures for pharmacology
Correct drug structures for pharmacology
Chris Southan
 
Exploiting Edinburgh's Guide to PHARMACOLOGY database as a source of protein ...
Exploiting Edinburgh's Guide to PHARMACOLOGY database as a source of protein ...Exploiting Edinburgh's Guide to PHARMACOLOGY database as a source of protein ...
Exploiting Edinburgh's Guide to PHARMACOLOGY database as a source of protein ...
Chris Southan
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
Chris Southan
 
Antimalarial drug dscovery data disclosure
Antimalarial drug dscovery data disclosureAntimalarial drug dscovery data disclosure
Antimalarial drug dscovery data disclosure
Chris Southan
 
Will the correct drugs please stand up?
Will  the correct drugs please stand up?Will  the correct drugs please stand up?
Will the correct drugs please stand up?
Chris Southan
 

Viewers also liked (11)

Capturing BIA-10-2474 and related FAAH inhibitor data
Capturing BIA-10-2474 and related FAAH inhibitor dataCapturing BIA-10-2474 and related FAAH inhibitor data
Capturing BIA-10-2474 and related FAAH inhibitor data
 
Patent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsPatent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEs
 
From Biological Data to Clinical Applications: Positioning a digital infrastr...
From Biological Data to Clinical Applications: Positioning a digital infrastr...From Biological Data to Clinical Applications: Positioning a digital infrastr...
From Biological Data to Clinical Applications: Positioning a digital infrastr...
 
Connecting antimalarial data
Connecting antimalarial dataConnecting antimalarial data
Connecting antimalarial data
 
Assessing GtoPdb ligand content in PubChem
Assessing GtoPdb ligand content in PubChemAssessing GtoPdb ligand content in PubChem
Assessing GtoPdb ligand content in PubChem
 
GtoPdb and GtoImmuPdb in context
GtoPdb and GtoImmuPdb in contextGtoPdb and GtoImmuPdb in context
GtoPdb and GtoImmuPdb in context
 
Correct drug structures for pharmacology
Correct drug structures for pharmacologyCorrect drug structures for pharmacology
Correct drug structures for pharmacology
 
Exploiting Edinburgh's Guide to PHARMACOLOGY database as a source of protein ...
Exploiting Edinburgh's Guide to PHARMACOLOGY database as a source of protein ...Exploiting Edinburgh's Guide to PHARMACOLOGY database as a source of protein ...
Exploiting Edinburgh's Guide to PHARMACOLOGY database as a source of protein ...
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
 
Antimalarial drug dscovery data disclosure
Antimalarial drug dscovery data disclosureAntimalarial drug dscovery data disclosure
Antimalarial drug dscovery data disclosure
 
Will the correct drugs please stand up?
Will  the correct drugs please stand up?Will  the correct drugs please stand up?
Will the correct drugs please stand up?
 

Similar to Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

Patent Cheminformatics: Identification of key compounds in patents
Patent Cheminformatics: Identification of key compounds in patentsPatent Cheminformatics: Identification of key compounds in patents
Patent Cheminformatics: Identification of key compounds in patentsSorel Muresan
 
Integrating Patents with Research Data
Integrating Patents with Research DataIntegrating Patents with Research Data
Integrating Patents with Research Data
Chris Southan
 
Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChem
Chris Southan
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
Dr. Haxel Consult
 
Patent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTSPatent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTS
open_phacts
 
Multiplexing analysis of 1000 approved drugs in PubChem
Multiplexing analysis of 1000 approved drugs in PubChemMultiplexing analysis of 1000 approved drugs in PubChem
Multiplexing analysis of 1000 approved drugs in PubChem
Chris Southan
 
GPU-accelerated Virtual Screening
GPU-accelerated Virtual ScreeningGPU-accelerated Virtual Screening
GPU-accelerated Virtual Screening
Olexandr Isayev
 
Cheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural ProductsCheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural Products
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Kamel Mansouri
 
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor ActivityCoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
Kamel Mansouri
 
Knowledge is Property- All YOU need to know ABC of Patent Searching
Knowledge is Property- All YOU need to know ABC of Patent SearchingKnowledge is Property- All YOU need to know ABC of Patent Searching
Knowledge is Property- All YOU need to know ABC of Patent Searching
Prity Khastgir IPR Strategic India Patent Attorney Amplify Innovation
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology
Sean Ekins
 
Webinar : Predicting Pharmacology and Safety Profiles with AurPASS
Webinar : Predicting Pharmacology and Safety Profiles with AurPASSWebinar : Predicting Pharmacology and Safety Profiles with AurPASS
Webinar : Predicting Pharmacology and Safety Profiles with AurPASS
Aureus Sciences
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
Dr. Haxel Consult
 
FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
Chris Southan
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
Sunghwan Kim
 
Workflows supporting drug discovery against malaria
Workflows supporting drug discovery against malariaWorkflows supporting drug discovery against malaria
Workflows supporting drug discovery against malaria
Barry Hardy
 
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
Chris Southan
 

Similar to Causes and consequences of automated extraction of patent-specified virtual deuterated drugs (20)

Patent Cheminformatics: Identification of key compounds in patents
Patent Cheminformatics: Identification of key compounds in patentsPatent Cheminformatics: Identification of key compounds in patents
Patent Cheminformatics: Identification of key compounds in patents
 
Integrating Patents with Research Data
Integrating Patents with Research DataIntegrating Patents with Research Data
Integrating Patents with Research Data
 
Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChem
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
 
Patent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTSPatent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTS
 
Multiplexing analysis of 1000 approved drugs in PubChem
Multiplexing analysis of 1000 approved drugs in PubChemMultiplexing analysis of 1000 approved drugs in PubChem
Multiplexing analysis of 1000 approved drugs in PubChem
 
GPU-accelerated Virtual Screening
GPU-accelerated Virtual ScreeningGPU-accelerated Virtual Screening
GPU-accelerated Virtual Screening
 
Cheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural ProductsCheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural Products
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
 
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor ActivityCoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
 
Knowledge is Property- All YOU need to know ABC of Patent Searching
Knowledge is Property- All YOU need to know ABC of Patent SearchingKnowledge is Property- All YOU need to know ABC of Patent Searching
Knowledge is Property- All YOU need to know ABC of Patent Searching
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology
 
Webinar : Predicting Pharmacology and Safety Profiles with AurPASS
Webinar : Predicting Pharmacology and Safety Profiles with AurPASSWebinar : Predicting Pharmacology and Safety Profiles with AurPASS
Webinar : Predicting Pharmacology and Safety Profiles with AurPASS
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
 
Workflows supporting drug discovery against malaria
Workflows supporting drug discovery against malariaWorkflows supporting drug discovery against malaria
Workflows supporting drug discovery against malaria
 
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 

More from Chris Southan

Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivity
Chris Southan
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulations
Chris Southan
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
Chris Southan
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updae
Chris Southan
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
Chris Southan
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?
Chris Southan
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCP
Chris Southan
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteins
Chris Southan
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFER
Chris Southan
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
Chris Southan
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology
Chris Southan
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 poster
Chris Southan
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagens
Chris Southan
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
Chris Southan
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand up
Chris Southan
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide Tribulations
Chris Southan
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIR
Chris Southan
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology update
Chris Southan
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProt
Chris Southan
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem Connectivity
Chris Southan
 

More from Chris Southan (20)

Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivity
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulations
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updae
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCP
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteins
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFER
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 poster
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagens
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand up
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide Tribulations
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIR
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology update
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProt
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem Connectivity
 

Recently uploaded

Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
IvanMallco1
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
Sérgio Sacani
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
Richard Gill
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
kumarmathi863
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 

Recently uploaded (20)

Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 

Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

  • 1. www.guidetopharmacology.org Deuterogate: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs feeding into PubChem Christopher Southan IUPHAR/BPS Guide to PHARMACOLOGY, Center for Integrative Physiology, University of Edinburgh ACS Boston CINF session: Enabling Machines to "Read" the Chemical Literature: Techniques 1 http://www.slideshare.net/cdsouthan/causes-and-consequences-of-automated-extraction-of- patentspecified-virtual-deuterated-drugs
  • 2. Abstract 2 The strategy of deuterating drugs to improve clinical profiles via the kinetic isotope effect has been known for over 50 years. However, recent development candidates have been predicated on a surge of opportunistic patent filings between 2008 and 2011. For automated chemical named entity recognition (CNER) these present particular challenges. These are investigated in this work by comparing sources of the 80K deuterated compounds inside PubChem. Of these, 45K originate from the patent CNER submissions of SCRIPDB, IBM and SureChEMBL plus 23K from Thomson Pharma via manual expert curation (MEXC). For CNER there are three options, image extraction, recognition of [2H] in IUPAC text forms or Complex Work Unit (CWU) molfiles obtained from the USPTO. For images, conversions to structures using OSRA with explicit H and D positions failed. Tests with chemicalize.org and OPSIN established that text “deuterio” did convert. The SureChEMBL pipeline also handles the “dx” prefix (e.g. methyl- d3). These tests, combined with inspection of SureChEMBL export records, confirmed that deuteration feeding into PubChem from patents was predominantly image-only derived. It was also clear that CWUs had provided the majority of these via molfiles. However, despite conceptually simillar CNER pipelines the three CNER sources showed divergent capture. Importantly, inspection of patents from the three major applicants in the deuteration IP Gold Rush indicated little reduction to practice. The unexpected consequences are that most of ~25K derivatives in PubChem of ~500 established drugs. are virtual, (i.e. the structures do not exist). This achilles heel of CNER will be discussed, since it presents database users with the dilemma between virtual swamping but possible IP significance on the one hand, verses the permanent absence of linked bioactivity data on the other.
  • 8. OSRA:fails on explicit “D-” image > struct 8
  • 9. The extraction problem for deuts • Majority of patents are image-only so no conversion • IUPAC specification of “detero” and “deuterio” is rare but OPSIN, SureChEMBL and chemicalize.org will do the name-to-struc • Thomson (Derwent) and SciFinder draw them in manually for conversion • SureChEMBL, SCRIPDB and IBM use the Complex Work Units from the USPTO • These include the molfiles drawn by the contractors and are the major source of deuteration in PubChem 9
  • 10. Codeine: the enumeration record from US20080045558 10 Left panel shows a section from one of approximately 55 pages of images. Right panel shows the first three examples from the 520 intersect between the 992 CIDs retrieved via the patent number and the 551 from “Same, Connectivity” for codeine (CID 5284371), ranked by Mw. Thomson Pharma only extracted three examples from this patent
  • 11. SureChEMBL indexing 11 First structure in the list SCHEMBL12905541 corresponds to CID 237918906 which has merged the SureChEMBL SID 237918906 with SCRIPDB SID 141460523.
  • 13. Source divergence in deuteration capture 13 TRP, SCR and SCH have an approximate three-way split, with the union of 64195 covering 81% of PubChem deuteration (77882 March 2015)
  • 15. Deuteration over time: patent surge in Thomson Pharma 15 TRP deuteration in PubChem on a per-year basis (left vertical axis and hatched bars) with patent publication dates taken from the USPTO for Auspex, Concert and Protia combined (the right hand vertical axis and solid lines with triangles).
  • 16. Picking off drug structures 16
  • 17. SciFinder results indicate invention by consortium • SciFinder facilitated certain queries orthogonal to PubChem (e.g. assignee query for substances) • 19841 isotopic substances were derived from 165 Auspex patents • Concert 6766 from 189 • Protia 1959 from 252 • Remarkably, the substance union query gave 28076 with an intersect of only 30 as deuteration reagents • This means the assignees somehow contrived to divide up ~ 600 drug filings (i.e. to avoid each others claims) 17
  • 18. Consequences and problems of virtual deuteration • Classic case of unintended consequences • Confounding drug analogue searching • Breaking the PubChem unofficial rule of extant-only compounds • Extant and virtual structures cannot be computationally separated • Secondary submitters cause intra-PubChem proliferation • Persistence as no-data entries • Proliferation between open databases • Both commercial sources of patent chemistry and source aggregation projects within pharmaceutical companies will be affected • Annotation can be confounded (e.g. the attribution of biological study in SciFinder) • Equivocal IP situation 18