Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Will the real proteins please stand up

Poster for BPS Pharmacology 2018, London

  • Be the first to comment

Will the real proteins please stand up

  1. 1. 3. Divergence of protein identifiers 2. Methods 7. References Will the real pharmacologically significant proteins please stand up? 1. Introduction Even in their more contemplative moments probably few pharmacologists cogitate on “so how many human proteins actually exist?” Nevertheless, on a practical level their engagement with names and identifiers (IDs) for pharmacological protein targets and disease mechanistic components is intense and includes navigating between databases and the literature. This work addresses three important aspects of protein equivocality that pharmacologists may less aware of but that we encounter head-on during curation of the IUPHAR/BPS Guide to PHARMACOLOGY [1 2]. These are: 1. Variability in canonical counts between 19,198 from the HUGO Gene Nomenclature Committee (HGNC) up to 21,341 in GeneCards, indicating a surprising annotation discordance for at least 10% of the human proteome 2. Uncertainty of alternatively spliced (AS) protein existence. While Ensembl predicts over 100,000 AS mRNAs, the verification of these by proteomics is 30-fold less than expected, inferring that the majority do not exist in vivo [3] 3. Evidence that some canonical Swiss-Prot (SP) entries are not the major isoform Using UniProt we ascertained the 4-way intersect between SP protein IDs, HGNC Gene Symbols, Ensembl genes and NCBI Gene IDs. The four sets were selected using cross-reference queries from the UniProt interface. We then accessed our internal protein statistics including the total human UniProt IDs that we had curated into GtoPdb and those for which we had annotated data-supported and pharmacologically-relevant ligand interactions. These were compared to the 4-way sequence set. We also counted proteins for which UniProt had curated splice forms using the query “Alternative splicing (KW-0025)”. We then and compared these with our ligand interaction set. We also inspected one splice form that has been annotated in GtoPdb and checked the information in SP. To address the isoform abundance question we queried the Annotation of principal and alternative splice isoforms (APPRIS) database to check targets [4]. 1. Harding SD, et al. (2018). Nucl. Acids Res. 46 (Database Issue): D1091-D1106. 2. Southan C, et al. (2018) ACS Omega 3(7), PMID: 30087946 3. Rodriguez JM et al. (2018). Nucl. Acids Res. 46 (Database Issue) D213-D217. 4. Tress ML, et al (2017) Trends Biochem Sci. 42(2):98-110. 5. Southan C (2017) F1000Res. 7;6:448. 5. Protein alternative splicing Christopher Southan, Simon D. Harding, Elena Faccenda, Adam J. Pawson and Jamie A. Davies. IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Discovery Brain Sciences, University of Edinburgh, UK 6. Discussion points • In addition to AS touched on here, additional sources of protein equivocality and heterogeneity include alternative initiations and post-translational modifications. • The multiplexing of these from a (still without a consensus) canonical set of ~19,000 proteins is predicted to run into the millions. • The significance of this for pharmacology, systems biology and drug discovery is acknowledged to be high but getting solid experimental data is difficult. • GtoPdb users are welcome to alert us to potentially curatable papers on differential ligand interactions related to any forms of protein heterogeneity @GuidetoPHARM 4. Comparing the consensus with GtoPdb We especially thank all contributors, collaborators and NC-IUPHAR members In the Venn diagram on the right the 4-way intersect shows that these four major global pipelines concur for less than 19,000 protein- coding genes. Most divergent is the 829 SP-only set. Inspection established many of these are categorised as pseudogenes by HGNC [5]. This surprising result includes some missing genomic cross- mappings inside SP. However, the consensus is close to the HGNC count of 19,118 (note Ensembl and NCBI reciprocally cross-map hence the empty sections) Our next step was to compare the 4- way set from the comparison above (blue) with a) all the human proteins we have entered in GtoPdb (yellow) and b) those proteins that have a curated interaction (mostly quantitative) against one or more of the 9405 ligands (green) The results were generally as expected in confirming the majority of our proteins are within the 4-way set (i.e. solidly supported). However, the analysis was valuable in detecting minor anomalies (represented in segments of 5,6 and 23). These are being followed-up but a major factor is that some of these are missing GeneID cross-references in Swiss-Prot (i.e. are blue false –ves) It is difficult to find papers with solid data showing AS affecting proteins for which we have curated ligand interactions and may thus exabit differential pharmacology. Many publications indicate that AS transcription is a) widespread, b) affects the majority of the mammalian proteome and is c) is likely to be functionally important in various biological contexts (e.g. tumours and brain tissue) even if the mechanisms are unclear. Notwithstanding, there are major uncertainties in proving the existence of AS proteins since they are difficult to verify in vivo. We approached this question by counting our interaction proteins with AS sequence variants annotated in Swiss-Prot. The results of this are shown on the right. The yellow circle indicates that 52% of human SP has at least one AS protein sequence annotated. This rises slightly to 54% in our interaction set (blue). Importantly, AS in SP is target-class specific rising to 70% for kinases but only 14% for GPCRs (since many are single--exon genes). Note that Ensembl predicts considerably more potential AS sequences than SP curates In GtoPdb we only assign quantitative and differentially-specific AS-ligand interactions if the papers meet our curatorial stringency. We also need evidence that data- supported differential binding has pharmacological significance. This is challenging for many reasons that cannot be expanded on here (but we would be pleads to discuss). Consequently, we have only one AS entry as the interaction between protein target 2903 as claudin18 and antibody ligand 9209 (below, together with the AS first exon). The specific case of claudin18 and extrapolation to other AS proteins in GtoPdb raises the question as to which sequence may be quantitatively dominant (i.e. the principle isoform in vivo). However, there are inherent challenges of quantifying AS- specific peptides by mass-spec proteomics or estimating surrogate relative abundancies from transcription data. We thus chose the APPRIS database which uses a range of computational methods fold coverage scores to select the most likely principal isoform. In this case the two SP scored equally.