Bentham & Hooker's Classification. along with the merits and demerits of the ...
Curatorial data wrangling for the Guide to PHARMACOLGY
1. www.guidetopharmacology.org
Virtues and vicissitudes of curatorial data wrangling:
the Guide to PHARMACOLGY experience
Christopher Southan, Adam J. Pawson, Joanna L. Sharman, Elena
Faccenda, Simon Harding, Jamie Davis, IUPHAR/BPS Guide to
PHARMACOLOGY, Centre for Integrative Physiology, University of
Edinburgh
Sun, Mar 13, 2016 CINF 6:Tomayto vs. Tomahto: Overcoming
Incompatibilities in Scientific Data 8:30 AM - 12:00 PM
1
http://www.slideshare.net/cdsouthan/curatorial-data-wrangling-for-the-guide-to-pharmacolgy
2. Abstract (will be skipped for presenting)
2
A wide range of valuable databases, both academic and commercial, use the curation
model to extract and standardise selected result sets from the literature. This is the
classic unstructured-to-structured transformation, predominantly of target binding data
(e.g. IC50, Ki or Kd) between ligands and targets. Since 2009 the Guide to
PHARMACOLOG (GoPdb) has now curated quantitative interactions between 1300
protein targets and 6000 ligands covering a substantial proportion of the druggable
proteome. The team has thus considerable expertise in the challenges of
standardisation. This needs to be interposed between not only the primary literature
but also other the databases that have extracted data. The wide range of compatibility
and other issues associated with selecting new content for GtoPdb will be outlined.
These will include the problem of equivocal measurement units as well as non-
standard chemical IDs, protein IDs being used by authors. The issue of data gaps will
also be expanded on. The presentation will conclude with an assessment of new
initiatives from several publishers for authors to mark-up their chemical compounds
before publication.
3. Outline
• Introducing our dataset
• Standardising entities
• Protein differences
• Chemistry
• The units case for BIA 10-2747
• Author mark-up
3
5. Standardising entities:
a major challenge of expert document extraction
• Judgment vs author primacy (e.g. do we transform/round units or fix errors?)
• Which document is it?
• Doi vs PubMed vs European PubMed Central
• Primary vs secondary literature
• Patent no format (WO- vs WO) kind codes (A1 B2) patent families
• Slide sets – is url persistent?
• Is the ligand chemically definable?
• Small molecule (PubChem CID, isomeric SMILES, InChIKey)
• A provenanced name-to-struc mapping for the company code name(s)
• Peptide (IUPAC, SMILES/InChI, sequence, HELM, Swiss-Prot feature line)
• Protein (clinical antibody, IMGT sequence, patent sequence, UniParc ID)
• Can the target be given a protein ID(s)?
• And the species was?
• Protein name/synonym-to-ID-to-sequence
• Was an alternative splice form or sequence variant specified?
• Should a complex be specified? (NC-IUPHAR complex, ChEMBL group, name
e.g. gamma secretase)
5
6. Protein ID and name standardisation:
philosophical and technical discordance between the pipelines
The 7-way concordance shrinks the proteome by 9%
6
UniProt + human = 151,569
UniProt + human + Swiss-Prot = 20,198
+ neXtProt = 20,040
+ HGNC = 19,836
+ Ensembl = 18,933
+ CCDS = 18,286
+ Entrez Gene ID = 18,245
+ RefSeq = 18,244
+ Evidence at protein level = 14,065
11. Measurement units: comparing with BIA-10-2474
• Suddenly (from Jan 15th 2016) there was serious need for
standardised comparative affinity values for FAAH inhibitors
• Bial 10-2474 in vitro results in WO2010074588 are 27.8 % inhibition
at 100nM in rat brain extracts
• J&J paper reports inactivation for enzyme but lists IC50
• Pfizer report (k(inact)/K(i) and IC(50) values of 40300 M(-1) s(-1) and
7.2 nM for PF-04457845
• V158866 from Vernalis; no structure (blinded) no papers, no affinity
data, no Phase II results
• Absence of normalised comparators for 10-2474 SAR compromises
in silico modeling and prediction studies
11
13. Will author mark-up overcome tomayto vs. tomahto?
• No – but it would help a lot
• Authors not always as au fait with their own entities compared to
biocurators
• Needs strong publisher/editor mandate and referee buy-in
• Automatic mark-up is useful for triage but can produce a dogs breakfast
of false-positives
• Initiatives underway from, Nature Chem. Biol. ACS J.Med Chem, Wiley
Brit. J. Pharmacol and others (see
http://cdsouthan.blogspot.com/2015/08/joining-chemistry-between-
journals-and.html)
• Need to accelerate in crucial areas (e.g. Zika and Dengue virus)
• Not only useful if authors cite their SAR patents but also push-back
against their attorneys to prevent data relationship obfuscation
13