The Yoyo Has Stopped:  Reviewing the Evidence for a Low Basal Human Protein Number
Upcoming SlideShare
Loading in...5

The Yoyo Has Stopped: Reviewing the Evidence for a Low Basal Human Protein Number



Presented at: In Silico Analysis of Proteins: Celebrating the 20th Anniversary of Swiss-Prot Fortaleza, Brazil, August 2006

Presented at: In Silico Analysis of Proteins: Celebrating the 20th Anniversary of Swiss-Prot Fortaleza, Brazil, August 2006



Total Views
Views on SlideShare
Embed Views



2 Embeds 2 1 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

The Yoyo Has Stopped:  Reviewing the Evidence for a Low Basal Human Protein Number The Yoyo Has Stopped: Reviewing the Evidence for a Low Basal Human Protein Number Presentation Transcript

  • The Yoyo Has Stopped Reviewing the Evidence for a Low Basal Human Protein Number In Silico Analysis of Proteins: Celebrating the 20 th Anniversary of Swiss-Prot Fortaleza, Brazil, August 2006 Christopher Southan Molecular Pharmacology, AstraZeneca R&D, M ölndal
  • Presentation Outline
    • The importance of gene number
    • Gene definition and detection
    • Genome inflation arguments
    • Post-completion changes in model eukaryotes
    • Ensembl pipeline numbers
    • The smORF question
    • Completed chromosomes
    • International Protein Index
    • Novel gene skimming
    • Updates
    • Conclusions
  • So Who Cares About Human Protein Coding Gene Number?
    • Central to evolutionary questions of gene number expansion vs. protein diversity from alternative splicing and post-translational modifications
    • Mammalian gene totals expected to be similar but clade-specific genes may be important for speciation
    • Accurate ORF delineation essential for genetic association studies and transcript profilling
    • MS-based proteomics needs a complete ORFome for the peptide and protein identification search space
    • For Pharma and Biotech the numbers set finite limits for potential drug targets and therapeutic proteins
    • The Swiss-Prot Human Proteomics Initiative (HPI) team
  • Definitions
    • The basal (unspliced) protein-coding gene number: “transcriptional units that translate to one or more proteins that share overlapping sequence identity and are products of the same unique genomic locus and strand orientation”
    • However, the Guidelines for Human Gene Nomenclature define a gene as: "a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterised by sequence, transcription or homology"
    • The increasing complexity of the transcriptome makes the wider definition of “gene” more difficult e.g. micro and antisence RNA
  • Identifying Protein Coding Genes
    • In silico
    • Detection of protein identity in genomic DNA
    • Gene prediction with protein similarity support
    • Matches with ESTs that include ORFs and/or splice sites
    • Cross-species comparisons for orthologous exon detection
    • Presence of gene anatomy features e.g. CpG islands, promoters, transcription start sites, polyadenylation signals
    • Absence of pseudogene disablements or repeat elements
    • In vitro
    • Cloning of predicted genes
    • Detection of active transcription by Northern blot, RT-PCR or microarray hybridisation
    • Loss-of-function approaches
    • High-throughput transcript sampling by EST, MPSS or SAGE tags
    • Heterologous expression of cDNAs
    • Direct verification of protein sequence by Edman sequencing, mass-mapping and/or MS/MS sequencing
  • Historical Arguments and Estimates for High Gene Numbers
    • Initial eukaryote (yeast/worm/fly) numbers assumed to be underestimates
    • Gene prediction programs have a significant false-negative rate
    • The Ensembl gene annotation pipeline is conservative
    • Mammalian protein and transcript coverage is incomplete
    • Chromosome annotation teams find more genes than automated pipelines
    • Selective transcript skimming experiments have revealed new genes
    • Extensive mamallian genomic sequence conservation outside known exons
    • Postulated large numbers of undetected small proteins (“smORFs” or “dark matter”)
    • EST clustering and commecial “gene inflation” claims
    Genesweep 2000 Literature estimates
  • Model Eukaryotes: No Significant Post-Completion Gene Increases
    • S.pombe: 3% increase since 2002
    • S.cerevisiae: 8% decrease since 1997
    • C .elegans : 5% increase since 1998
    • D.melanogaster: 0.2% increase since 2001
    • Little increase in spite of global functional genomics focus
  • Human Transcripts: Post-genomic mRNA Growth in UniGene
    • Rapid growth in redundant mRNA
    • But slow growth in clustered set ~ 9,000 over 2 years with plateau ~ 28,000
    • Includes splice variants and some spurious ORFs
  • Ensembl Human Gene Number
    • Only 22,218 genes, a decrease of 1826 over 4 years
    • Knowns: from 90% < 95%
    • Novel genes: 12,398 > 2,263
    • Exons-per-gene: 6.5 < 9.6
    • Alternative splicing: from 3,669 < to 8,078
  • Addressing the smORF Question: Protein Size Distributions in Human SPTr Pre Oct-01 6.3% > 100aa Post Oct-01 5.5% > 100aa “ Novel” in title 3.4% > 100aa
  • Summarising the smORF Question
    • The “triple postulate” i.e. a combination of gene prediction failiure, no homology and absence of transcription data, seems unlikely
    • No database evidence for increased bsence smORF discovery mammals
    • The observation that only ~1% of mouse genes have no detectable human homology contradicts the idea of large order-specific gene expansion in mammals
    • Although small proteins evolve more rapidly there is no precedent for complete loss of ortholog simillarity signal
    • Those much shorter than 100 residues will fall below the threshold necessary to fold into the domain structures necessary for biological function
    • No evidence for de-novo gene “invention” in higher eukaryotes
  • Release History of the International Protein Index: Only Slow Increases in the Non-redundant Protein Sets 56537 Entries
  • Experimental Transcript Skimming as Evidence for High Protein Numbers
    • Exon arrays ( Dunham et al. 1999 )
    • Gene arrays ( Penn et al. 2000 )
    • RT-PCR ( Das et al. 2001 )
    • SAGE-tags ( Saha et al. 2002, Chen et al. 2002)
    • Oligo tiling from 21 and 22 ( Kapranov et al. 2002 , Kampa, et al 2004)
    • Necessary to submit a full length ORF with the features of gene anatomy to the public databases before the discovery of novel proteins can be claimed – none of these publications submitted any
    • There is increasing evidence for significant amounts of non-ORF transcription in human and mouse
  • Gene Numbers for Individual Completed Chromosomes
    • Averaging the completed chromosomes exceeds Ensembl genes by ~12%
    • Extrapolates to ~ 25,000 genes without “novel transcipts” or “putatives”
    • Extrapolates to ~ 28,000 genes without “putatives”
    • Extrapolates to ~ 31,000 genes with “putatives repeat elements
    • The chromosome reports were made at different times using different assemblies and different grades of gene definition and evidence support (e.g. different results for chromosome 7)
    • Difficult to explicitly cross-map VEGA vs. Ensembl chromosome gene numbers
    • Future status of novel transcripts and putative genes unclear – most will be non-coding
  • Disappearing Novelty
    • EMBL hum cds March 2003 = 1491
    • Plus “novel” = 159
    • Plus PubMed 2003
    • = 120
    • Novel in title = 11
    • Previous cds = 8
    • Novel genes = 2
    • Now both in RefSeq and Ens 18.34
    AK091256 15-JUL-2002 √ NP_060123 ENSG00000141627 BK000950 26-FEB-2003 Dymeclin AF063599 02-JAN-2001 √ XP_166119 ENSG00000075407 AY184389 28-JAN-2003 Zinc finger ZZaPK BC030506 20-MAY-2002 √ NP_689998 ENSG00000174500 AF521911 14-JAN-2003 HGAL-IL4 inducible L15344 25-MAY-1995 √ NP_787048 ENSG00000084652 AF516206 17-FEB-2003 Taxilin _ _ _ _ AF512521 12-JAN-2003 Ligand-gated channel subunit AK092564 15-JUL-2002 √ _ ENSG00000163126 AF492401 18-JAN-2003 Diabetes related ankyrin repeat BC026194 09-APR-2002 √ NP_786887 ENSG00000164304 AF414185 27-FEB-2003 CAGE-1 AB051553 07-FEB-2001 √ XP_049218 ENSG00000133812 AY234241 20-MAR-2003 SBF2 _ √ _ _ AY191416 22-JAN-2003 Zygote arrest-1 BC041376 24-DEC-2002 √ XP_171208 ENSG00000172159 AY1377 03-MAR-2003 4.10 gene _ √ XP_072027 _ AY101377 01-MAR-2003 Testicular OR-4 associated Earlier sequence EST NCBI 31 Ens 31 Accession and date Name
  • Human Proteome Sampling by MS/MS Identification: A Paucity of Novel Genes
    • 3778 from plasma (Muthusamy et al 2005)
    • 2486 from liver cells (Yan et al. 2006)
    • 615 from the human heart mitochondria (Taylor et al. 2003)
    • 500 from breast cancer cell membranes (Adams et al. 2003)
    • 491 from microsomal fractions (Han et al. 2001)
    • 311 from the splicesome (Rappsilber et al. 2002)
    • No verifiable data on gene prediction confirmation
    • One novel gene reported from a genome-only peptide match by Kuster et al in 2001 but this appeared from a high-throughput project later in the same year (Tr Q96DA0)
    • While there is no evidence of novel protein discovery there is a caveat on the availalable search space
  • Conclusions
    • The model eukaryotes have shown no significant post-genomic rises in gene number
    • The Ensembl gene number has been essentially flat since 2001
    • There is a set of ~2,000 predicted genes still eluding experimental verification – or may not be real ?
    • Putative genes from curated chromosmes could raise protein numbers but the status of this class of transcripts is in doubt
    • Early over-estimates explicable by non-ORF transcription
    • Post-genomic transcript coverage is predominantly re-sampling known genes
    • Database submissions of novel human genes have slowed to a trickle
    • No evidence for large numbers of cryptic smORFs
    • Proteomics has not revealed new proteins
    • Maybe Swiss-Prot can pop the champane corks when HPI hits 20,000 ?
  • Updates
    • October 2004 Nature paper on finished human genome “20-25,000 protein-coding genes”
    • December 2005 Nature paper “The dog gene count (19,300) is substantially lower than the 22,000-gene models in the current human gene catalogue (EnsEMBL build 26). For many predicted human genes, we find no convincing evidence of a corresponding dog gene. Much of the excess in the human gene count is attributable to spurious gene predictions in the human genome (M. Clamp, personal communication).”
    • March 2006 Ensembl 23,701
    • June 2006 Swiss-Prot HPI 14,445
  • Acknowledgments and Reference
    • Paul Kersey of the EBI for IPI figures
    • Lucas Wagner of the NCBI for the retrospective UniGene data
    • Numerous other people at NCBI, EBI, Swiss-Prot and Sanger Centre who graciously answered queries on their data collections
    • The Oxford Glycosciences Proteome Discovery Team
    • Southan C. Has the Yo-yo stopped? An assesment of human protein-coding gene number (2004) Proteomics ( 6 ):1712-26. PMID: 15174140