Curatorial data wrangling for the Guide to PHARMACOLGY

www.guidetopharmacology.org
Virtues and vicissitudes of curatorial data wrangling:
the Guide to PHARMACOLGY experience
Christopher Southan, Adam J. Pawson, Joanna L. Sharman, Elena
Faccenda, Simon Harding, Jamie Davis, IUPHAR/BPS Guide to
PHARMACOLOGY, Centre for Integrative Physiology, University of
Edinburgh
Sun, Mar 13, 2016 CINF 6:Tomayto vs. Tomahto: Overcoming
Incompatibilities in Scientific Data 8:30 AM - 12:00 PM
1
http://www.slideshare.net/cdsouthan/curatorial-data-wrangling-for-the-guide-to-pharmacolgy

Abstract (will be skipped for presenting)
2
A wide range of valuable databases, both academic and commercial, use the curation
model to extract and standardise selected result sets from the literature. This is the
classic unstructured-to-structured transformation, predominantly of target binding data
(e.g. IC50, Ki or Kd) between ligands and targets. Since 2009 the Guide to
PHARMACOLOG (GoPdb) has now curated quantitative interactions between 1300
protein targets and 6000 ligands covering a substantial proportion of the druggable
proteome. The team has thus considerable expertise in the challenges of
standardisation. This needs to be interposed between not only the primary literature
but also other the databases that have extracted data. The wide range of compatibility
and other issues associated with selecting new content for GtoPdb will be outlined.
These will include the problem of equivocal measurement units as well as non-
standard chemical IDs, protein IDs being used by authors. The issue of data gaps will
also be expanded on. The presentation will conclude with an assessment of new
initiatives from several publishers for authors to mark-up their chemical compounds
before publication.

Outline
• Introducing our dataset
• Standardising entities
• Protein differences
• Chemistry
• The units case for BIA 10-2747
• Author mark-up
3

GtoPdb content
4
Human targets
Ligands

Standardising entities:
a major challenge of expert document extraction
• Judgment vs author primacy (e.g. do we transform/round units or fix errors?)
• Which document is it?
• Doi vs PubMed vs European PubMed Central
• Primary vs secondary literature
• Patent no format (WO- vs WO) kind codes (A1 B2) patent families
• Slide sets – is url persistent?
• Is the ligand chemically definable?
• Small molecule (PubChem CID, isomeric SMILES, InChIKey)
• A provenanced name-to-struc mapping for the company code name(s)
• Peptide (IUPAC, SMILES/InChI, sequence, HELM, Swiss-Prot feature line)
• Protein (clinical antibody, IMGT sequence, patent sequence, UniParc ID)
• Can the target be given a protein ID(s)?
• And the species was?
• Protein name/synonym-to-ID-to-sequence
• Was an alternative splice form or sequence variant specified?
• Should a complex be specified? (NC-IUPHAR complex, ChEMBL group, name
e.g. gamma secretase)
5

Protein ID and name standardisation:
philosophical and technical discordance between the pipelines
The 7-way concordance shrinks the proteome by 9%
6
UniProt + human = 151,569
UniProt + human + Swiss-Prot = 20,198
+ neXtProt = 20,040
+ HGNC = 19,836
+ Ensembl = 18,933
+ CCDS = 18,286
+ Entrez Gene ID = 18,245
+ RefSeq = 18,244
+ Evidence at protein level = 14,065

Protein
standardisation:
transatlantic
sequence
differences
UniProtKB - P56817 vs
NP_036236.1
7

Protein standardisation:
transatlantic (and intra-Europe) naming and splice variants
• Hinxton (HGNC) approved (gene) name: beta-site APP-cleaving enzyme 1
• Geneva (Swiss-Prot) recommended (protein) name: Beta-secretase 1
• Bethesda RefSeq NP_036236 beta-secretase 1 isoform A preproprotein
8

Making chemical curation easy:
Amgen giveth in their papers (IUPAC name in the title)
9

Making chemical curation difficult:
Amgen taketh away in their patents…
10

Measurement units: comparing with BIA-10-2474
• Suddenly (from Jan 15th 2016) there was serious need for
standardised comparative affinity values for FAAH inhibitors
• Bial 10-2474 in vitro results in WO2010074588 are 27.8 % inhibition
at 100nM in rat brain extracts
• J&J paper reports inactivation for enzyme but lists IC50
• Pfizer report (k(inact)/K(i) and IC(50) values of 40300 M(-1) s(-1) and
7.2 nM for PF-04457845
• V158866 from Vernalis; no structure (blinded) no papers, no affinity
data, no Phase II results
• Absence of normalised comparators for 10-2474 SAR compromises
in silico modeling and prediction studies
11

Measurement units:
Authors/inventors should read off their own affinity values
12
rather than us
having to

Will author mark-up overcome tomayto vs. tomahto?
• No – but it would help a lot
• Authors not always as au fait with their own entities compared to
biocurators
• Needs strong publisher/editor mandate and referee buy-in
• Automatic mark-up is useful for triage but can produce a dogs breakfast
of false-positives
• Initiatives underway from, Nature Chem. Biol. ACS J.Med Chem, Wiley
Brit. J. Pharmacol and others (see
http://cdsouthan.blogspot.com/2015/08/joining-chemistry-between-
journals-and.html)
• Need to accelerate in crucial areas (e.g. Zika and Dengue virus)
• Not only useful if authors cite their SAR patents but also push-back
against their attorneys to prevent data relationship obfuscation
13

Author-mark example: J Med Chem
14

Author/curator collaboration model: Brit J Pharmacol
15

References, acknowledgments and questions
16
http://www.ncbi.nlm.nih.gov/pubmed/24234439
http://www.ncbi.nlm.nih.gov/pubmed/23159359
GtoPdb FAQ section (explains curation guidelines)
http://www.guidetopharmacology.org/faq.jsp
http://cdsouthan.blogspot.com/2016/01/the-unfortunate-case-of-bia-10-2474.html
BIA 10-2747 story

Curatorial data wrangling for the Guide to PHARMACOLGY

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Curatorial data wrangling for the Guide to PHARMACOLGY

Similar to Curatorial data wrangling for the Guide to PHARMACOLGY (20)

More from Chris Southan

More from Chris Southan (20)

Recently uploaded

Recently uploaded (20)

Curatorial data wrangling for the Guide to PHARMACOLGY