The Yoyo Has Stopped: Reviewing the Evidence for a Low Basal Human Protein Number


Published on

Presented at: In Silico Analysis of Proteins: Celebrating the 20th Anniversary of Swiss-Prot Fortaleza, Brazil, August 2006

Published in: Technology
  • Be the first to comment

  • Be the first to like this

The Yoyo Has Stopped: Reviewing the Evidence for a Low Basal Human Protein Number

  1. 1. The Yoyo Has Stopped Reviewing the Evidence for a Low Basal Human Protein Number In Silico Analysis of Proteins: Celebrating the 20 th Anniversary of Swiss-Prot Fortaleza, Brazil, August 2006 Christopher Southan Molecular Pharmacology, AstraZeneca R&D, M ölndal
  2. 2. Presentation Outline <ul><li>The importance of gene number </li></ul><ul><li>Gene definition and detection </li></ul><ul><li>Genome inflation arguments </li></ul><ul><li>Post-completion changes in model eukaryotes </li></ul><ul><li>Ensembl pipeline numbers </li></ul><ul><li>The smORF question </li></ul><ul><li>Completed chromosomes </li></ul><ul><li>International Protein Index </li></ul><ul><li>Novel gene skimming </li></ul><ul><li>Updates </li></ul><ul><li>Conclusions </li></ul>
  3. 3. So Who Cares About Human Protein Coding Gene Number? <ul><li>Central to evolutionary questions of gene number expansion vs. protein diversity from alternative splicing and post-translational modifications </li></ul><ul><li>Mammalian gene totals expected to be similar but clade-specific genes may be important for speciation </li></ul><ul><li>Accurate ORF delineation essential for genetic association studies and transcript profilling </li></ul><ul><li>MS-based proteomics needs a complete ORFome for the peptide and protein identification search space </li></ul><ul><li>For Pharma and Biotech the numbers set finite limits for potential drug targets and therapeutic proteins </li></ul><ul><li>The Swiss-Prot Human Proteomics Initiative (HPI) team </li></ul>
  4. 4. Definitions <ul><li>The basal (unspliced) protein-coding gene number: “transcriptional units that translate to one or more proteins that share overlapping sequence identity and are products of the same unique genomic locus and strand orientation” </li></ul><ul><li>However, the Guidelines for Human Gene Nomenclature define a gene as: &quot;a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterised by sequence, transcription or homology&quot; </li></ul><ul><li>The increasing complexity of the transcriptome makes the wider definition of “gene” more difficult e.g. micro and antisence RNA </li></ul>
  5. 5. Identifying Protein Coding Genes <ul><li>In silico </li></ul><ul><li>Detection of protein identity in genomic DNA </li></ul><ul><li>Gene prediction with protein similarity support </li></ul><ul><li>Matches with ESTs that include ORFs and/or splice sites </li></ul><ul><li>Cross-species comparisons for orthologous exon detection </li></ul><ul><li>Presence of gene anatomy features e.g. CpG islands, promoters, transcription start sites, polyadenylation signals </li></ul><ul><li>Absence of pseudogene disablements or repeat elements </li></ul><ul><li>In vitro </li></ul><ul><li>Cloning of predicted genes </li></ul><ul><li>Detection of active transcription by Northern blot, RT-PCR or microarray hybridisation </li></ul><ul><li>Loss-of-function approaches </li></ul><ul><li>High-throughput transcript sampling by EST, MPSS or SAGE tags </li></ul><ul><li>Heterologous expression of cDNAs </li></ul><ul><li>Direct verification of protein sequence by Edman sequencing, mass-mapping and/or MS/MS sequencing </li></ul>
  6. 6. Historical Arguments and Estimates for High Gene Numbers <ul><li>Initial eukaryote (yeast/worm/fly) numbers assumed to be underestimates </li></ul><ul><li>Gene prediction programs have a significant false-negative rate </li></ul><ul><li>The Ensembl gene annotation pipeline is conservative </li></ul><ul><li>Mammalian protein and transcript coverage is incomplete </li></ul><ul><li>Chromosome annotation teams find more genes than automated pipelines </li></ul><ul><li>Selective transcript skimming experiments have revealed new genes </li></ul><ul><li>Extensive mamallian genomic sequence conservation outside known exons </li></ul><ul><li>Postulated large numbers of undetected small proteins (“smORFs” or “dark matter”) </li></ul><ul><li>EST clustering and commecial “gene inflation” claims </li></ul>Genesweep 2000 Literature estimates
  7. 7. Model Eukaryotes: No Significant Post-Completion Gene Increases <ul><li>S.pombe: 3% increase since 2002 </li></ul><ul><li>S.cerevisiae: 8% decrease since 1997 </li></ul><ul><li>C .elegans : 5% increase since 1998 </li></ul><ul><li>D.melanogaster: 0.2% increase since 2001 </li></ul><ul><li>Little increase in spite of global functional genomics focus </li></ul>
  8. 8. Human Transcripts: Post-genomic mRNA Growth in UniGene <ul><li>Rapid growth in redundant mRNA </li></ul><ul><li>But slow growth in clustered set ~ 9,000 over 2 years with plateau ~ 28,000 </li></ul><ul><li>Includes splice variants and some spurious ORFs </li></ul>
  9. 9. Ensembl Human Gene Number <ul><li>Only 22,218 genes, a decrease of 1826 over 4 years </li></ul><ul><li>Knowns: from 90% < 95% </li></ul><ul><li>Novel genes: 12,398 > 2,263 </li></ul><ul><li>Exons-per-gene: 6.5 < 9.6 </li></ul><ul><li>Alternative splicing: from 3,669 < to 8,078 </li></ul>
  10. 10. Addressing the smORF Question: Protein Size Distributions in Human SPTr Pre Oct-01 6.3% > 100aa Post Oct-01 5.5% > 100aa “ Novel” in title 3.4% > 100aa
  11. 11. Summarising the smORF Question <ul><li>The “triple postulate” i.e. a combination of gene prediction failiure, no homology and absence of transcription data, seems unlikely </li></ul><ul><li>No database evidence for increased bsence smORF discovery mammals </li></ul><ul><li>The observation that only ~1% of mouse genes have no detectable human homology contradicts the idea of large order-specific gene expansion in mammals </li></ul><ul><li>Although small proteins evolve more rapidly there is no precedent for complete loss of ortholog simillarity signal </li></ul><ul><li>Those much shorter than 100 residues will fall below the threshold necessary to fold into the domain structures necessary for biological function </li></ul><ul><li>No evidence for de-novo gene “invention” in higher eukaryotes </li></ul>
  12. 12. Release History of the International Protein Index: Only Slow Increases in the Non-redundant Protein Sets 56537 Entries
  13. 13. Experimental Transcript Skimming as Evidence for High Protein Numbers <ul><li>Exon arrays ( Dunham et al. 1999 ) </li></ul><ul><li>Gene arrays ( Penn et al. 2000 ) </li></ul><ul><li>RT-PCR ( Das et al. 2001 ) </li></ul><ul><li>SAGE-tags ( Saha et al. 2002, Chen et al. 2002) </li></ul><ul><li>Oligo tiling from 21 and 22 ( Kapranov et al. 2002 , Kampa, et al 2004) </li></ul><ul><li>Necessary to submit a full length ORF with the features of gene anatomy to the public databases before the discovery of novel proteins can be claimed – none of these publications submitted any </li></ul><ul><li>There is increasing evidence for significant amounts of non-ORF transcription in human and mouse </li></ul>
  14. 14. Gene Numbers for Individual Completed Chromosomes <ul><li>Averaging the completed chromosomes exceeds Ensembl genes by ~12% </li></ul><ul><li>Extrapolates to ~ 25,000 genes without “novel transcipts” or “putatives” </li></ul><ul><li>Extrapolates to ~ 28,000 genes without “putatives” </li></ul><ul><li>Extrapolates to ~ 31,000 genes with “putatives repeat elements </li></ul><ul><li>The chromosome reports were made at different times using different assemblies and different grades of gene definition and evidence support (e.g. different results for chromosome 7) </li></ul><ul><li>Difficult to explicitly cross-map VEGA vs. Ensembl chromosome gene numbers </li></ul><ul><li>Future status of novel transcripts and putative genes unclear – most will be non-coding </li></ul>
  15. 15. Disappearing Novelty <ul><li>EMBL hum cds March 2003 = 1491 </li></ul><ul><li>Plus “novel” = 159 </li></ul><ul><li>Plus PubMed 2003 </li></ul><ul><li>= 120 </li></ul><ul><li>Novel in title = 11 </li></ul><ul><li>Previous cds = 8 </li></ul><ul><li>Novel genes = 2 </li></ul><ul><li>Now both in RefSeq and Ens 18.34 </li></ul>AK091256 15-JUL-2002 √ NP_060123 ENSG00000141627 BK000950 26-FEB-2003 Dymeclin AF063599 02-JAN-2001 √ XP_166119 ENSG00000075407 AY184389 28-JAN-2003 Zinc finger ZZaPK BC030506 20-MAY-2002 √ NP_689998 ENSG00000174500 AF521911 14-JAN-2003 HGAL-IL4 inducible L15344 25-MAY-1995 √ NP_787048 ENSG00000084652 AF516206 17-FEB-2003 Taxilin _ _ _ _ AF512521 12-JAN-2003 Ligand-gated channel subunit AK092564 15-JUL-2002 √ _ ENSG00000163126 AF492401 18-JAN-2003 Diabetes related ankyrin repeat BC026194 09-APR-2002 √ NP_786887 ENSG00000164304 AF414185 27-FEB-2003 CAGE-1 AB051553 07-FEB-2001 √ XP_049218 ENSG00000133812 AY234241 20-MAR-2003 SBF2 _ √ _ _ AY191416 22-JAN-2003 Zygote arrest-1 BC041376 24-DEC-2002 √ XP_171208 ENSG00000172159 AY1377 03-MAR-2003 4.10 gene _ √ XP_072027 _ AY101377 01-MAR-2003 Testicular OR-4 associated Earlier sequence EST NCBI 31 Ens 31 Accession and date Name
  16. 16. Human Proteome Sampling by MS/MS Identification: A Paucity of Novel Genes <ul><li>3778 from plasma (Muthusamy et al 2005) </li></ul><ul><li>2486 from liver cells (Yan et al. 2006) </li></ul><ul><li>615 from the human heart mitochondria (Taylor et al. 2003) </li></ul><ul><li>500 from breast cancer cell membranes (Adams et al. 2003) </li></ul><ul><li>491 from microsomal fractions (Han et al. 2001) </li></ul><ul><li>311 from the splicesome (Rappsilber et al. 2002) </li></ul><ul><li>No verifiable data on gene prediction confirmation </li></ul><ul><li>One novel gene reported from a genome-only peptide match by Kuster et al in 2001 but this appeared from a high-throughput project later in the same year (Tr Q96DA0) </li></ul><ul><li>While there is no evidence of novel protein discovery there is a caveat on the availalable search space </li></ul>
  17. 17. Conclusions <ul><li>The model eukaryotes have shown no significant post-genomic rises in gene number </li></ul><ul><li>The Ensembl gene number has been essentially flat since 2001 </li></ul><ul><li>There is a set of ~2,000 predicted genes still eluding experimental verification – or may not be real ? </li></ul><ul><li>Putative genes from curated chromosmes could raise protein numbers but the status of this class of transcripts is in doubt </li></ul><ul><li>Early over-estimates explicable by non-ORF transcription </li></ul><ul><li>Post-genomic transcript coverage is predominantly re-sampling known genes </li></ul><ul><li>Database submissions of novel human genes have slowed to a trickle </li></ul><ul><li>No evidence for large numbers of cryptic smORFs </li></ul><ul><li>Proteomics has not revealed new proteins </li></ul><ul><li>Maybe Swiss-Prot can pop the champane corks when HPI hits 20,000 ? </li></ul>
  18. 18. Updates <ul><li>October 2004 Nature paper on finished human genome “20-25,000 protein-coding genes” </li></ul><ul><li>December 2005 Nature paper “The dog gene count (19,300) is substantially lower than the 22,000-gene models in the current human gene catalogue (EnsEMBL build 26). For many predicted human genes, we find no convincing evidence of a corresponding dog gene. Much of the excess in the human gene count is attributable to spurious gene predictions in the human genome (M. Clamp, personal communication).” </li></ul><ul><li>March 2006 Ensembl 23,701 </li></ul><ul><li>June 2006 Swiss-Prot HPI 14,445 </li></ul>
  19. 19. Acknowledgments and Reference <ul><li>Paul Kersey of the EBI for IPI figures </li></ul><ul><li>Lucas Wagner of the NCBI for the retrospective UniGene data </li></ul><ul><li>Numerous other people at NCBI, EBI, Swiss-Prot and Sanger Centre who graciously answered queries on their data collections </li></ul><ul><li>The Oxford Glycosciences Proteome Discovery Team </li></ul><ul><li>Southan C. Has the Yo-yo stopped? An assesment of human protein-coding gene number (2004) Proteomics ( 6 ):1712-26. PMID: 15174140 </li></ul>