Your SlideShare is downloading. ×
The Yoyo Has Stopped:  Reviewing the Evidence for a Low Basal Human Protein Number
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

The Yoyo Has Stopped: Reviewing the Evidence for a Low Basal Human Protein Number


Published on

Presented at: In Silico Analysis of Proteins: Celebrating the 20th Anniversary of Swiss-Prot Fortaleza, Brazil, August 2006 …

Presented at: In Silico Analysis of Proteins: Celebrating the 20th Anniversary of Swiss-Prot Fortaleza, Brazil, August 2006

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. The Yoyo Has Stopped Reviewing the Evidence for a Low Basal Human Protein Number In Silico Analysis of Proteins: Celebrating the 20 th Anniversary of Swiss-Prot Fortaleza, Brazil, August 2006 Christopher Southan Molecular Pharmacology, AstraZeneca R&D, M ölndal
  • 2. Presentation Outline
    • The importance of gene number
    • Gene definition and detection
    • Genome inflation arguments
    • Post-completion changes in model eukaryotes
    • Ensembl pipeline numbers
    • The smORF question
    • Completed chromosomes
    • International Protein Index
    • Novel gene skimming
    • Updates
    • Conclusions
  • 3. So Who Cares About Human Protein Coding Gene Number?
    • Central to evolutionary questions of gene number expansion vs. protein diversity from alternative splicing and post-translational modifications
    • Mammalian gene totals expected to be similar but clade-specific genes may be important for speciation
    • Accurate ORF delineation essential for genetic association studies and transcript profilling
    • MS-based proteomics needs a complete ORFome for the peptide and protein identification search space
    • For Pharma and Biotech the numbers set finite limits for potential drug targets and therapeutic proteins
    • The Swiss-Prot Human Proteomics Initiative (HPI) team
  • 4. Definitions
    • The basal (unspliced) protein-coding gene number: “transcriptional units that translate to one or more proteins that share overlapping sequence identity and are products of the same unique genomic locus and strand orientation”
    • However, the Guidelines for Human Gene Nomenclature define a gene as: "a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterised by sequence, transcription or homology"
    • The increasing complexity of the transcriptome makes the wider definition of “gene” more difficult e.g. micro and antisence RNA
  • 5. Identifying Protein Coding Genes
    • In silico
    • Detection of protein identity in genomic DNA
    • Gene prediction with protein similarity support
    • Matches with ESTs that include ORFs and/or splice sites
    • Cross-species comparisons for orthologous exon detection
    • Presence of gene anatomy features e.g. CpG islands, promoters, transcription start sites, polyadenylation signals
    • Absence of pseudogene disablements or repeat elements
    • In vitro
    • Cloning of predicted genes
    • Detection of active transcription by Northern blot, RT-PCR or microarray hybridisation
    • Loss-of-function approaches
    • High-throughput transcript sampling by EST, MPSS or SAGE tags
    • Heterologous expression of cDNAs
    • Direct verification of protein sequence by Edman sequencing, mass-mapping and/or MS/MS sequencing
  • 6. Historical Arguments and Estimates for High Gene Numbers
    • Initial eukaryote (yeast/worm/fly) numbers assumed to be underestimates
    • Gene prediction programs have a significant false-negative rate
    • The Ensembl gene annotation pipeline is conservative
    • Mammalian protein and transcript coverage is incomplete
    • Chromosome annotation teams find more genes than automated pipelines
    • Selective transcript skimming experiments have revealed new genes
    • Extensive mamallian genomic sequence conservation outside known exons
    • Postulated large numbers of undetected small proteins (“smORFs” or “dark matter”)
    • EST clustering and commecial “gene inflation” claims
    Genesweep 2000 Literature estimates
  • 7. Model Eukaryotes: No Significant Post-Completion Gene Increases
    • S.pombe: 3% increase since 2002
    • S.cerevisiae: 8% decrease since 1997
    • C .elegans : 5% increase since 1998
    • D.melanogaster: 0.2% increase since 2001
    • Little increase in spite of global functional genomics focus
  • 8. Human Transcripts: Post-genomic mRNA Growth in UniGene
    • Rapid growth in redundant mRNA
    • But slow growth in clustered set ~ 9,000 over 2 years with plateau ~ 28,000
    • Includes splice variants and some spurious ORFs
  • 9. Ensembl Human Gene Number
    • Only 22,218 genes, a decrease of 1826 over 4 years
    • Knowns: from 90% < 95%
    • Novel genes: 12,398 > 2,263
    • Exons-per-gene: 6.5 < 9.6
    • Alternative splicing: from 3,669 < to 8,078
  • 10. Addressing the smORF Question: Protein Size Distributions in Human SPTr Pre Oct-01 6.3% > 100aa Post Oct-01 5.5% > 100aa “ Novel” in title 3.4% > 100aa
  • 11. Summarising the smORF Question
    • The “triple postulate” i.e. a combination of gene prediction failiure, no homology and absence of transcription data, seems unlikely
    • No database evidence for increased bsence smORF discovery mammals
    • The observation that only ~1% of mouse genes have no detectable human homology contradicts the idea of large order-specific gene expansion in mammals
    • Although small proteins evolve more rapidly there is no precedent for complete loss of ortholog simillarity signal
    • Those much shorter than 100 residues will fall below the threshold necessary to fold into the domain structures necessary for biological function
    • No evidence for de-novo gene “invention” in higher eukaryotes
  • 12. Release History of the International Protein Index: Only Slow Increases in the Non-redundant Protein Sets 56537 Entries
  • 13. Experimental Transcript Skimming as Evidence for High Protein Numbers
    • Exon arrays ( Dunham et al. 1999 )
    • Gene arrays ( Penn et al. 2000 )
    • RT-PCR ( Das et al. 2001 )
    • SAGE-tags ( Saha et al. 2002, Chen et al. 2002)
    • Oligo tiling from 21 and 22 ( Kapranov et al. 2002 , Kampa, et al 2004)
    • Necessary to submit a full length ORF with the features of gene anatomy to the public databases before the discovery of novel proteins can be claimed – none of these publications submitted any
    • There is increasing evidence for significant amounts of non-ORF transcription in human and mouse
  • 14. Gene Numbers for Individual Completed Chromosomes
    • Averaging the completed chromosomes exceeds Ensembl genes by ~12%
    • Extrapolates to ~ 25,000 genes without “novel transcipts” or “putatives”
    • Extrapolates to ~ 28,000 genes without “putatives”
    • Extrapolates to ~ 31,000 genes with “putatives repeat elements
    • The chromosome reports were made at different times using different assemblies and different grades of gene definition and evidence support (e.g. different results for chromosome 7)
    • Difficult to explicitly cross-map VEGA vs. Ensembl chromosome gene numbers
    • Future status of novel transcripts and putative genes unclear – most will be non-coding
  • 15. Disappearing Novelty
    • EMBL hum cds March 2003 = 1491
    • Plus “novel” = 159
    • Plus PubMed 2003
    • = 120
    • Novel in title = 11
    • Previous cds = 8
    • Novel genes = 2
    • Now both in RefSeq and Ens 18.34
    AK091256 15-JUL-2002 √ NP_060123 ENSG00000141627 BK000950 26-FEB-2003 Dymeclin AF063599 02-JAN-2001 √ XP_166119 ENSG00000075407 AY184389 28-JAN-2003 Zinc finger ZZaPK BC030506 20-MAY-2002 √ NP_689998 ENSG00000174500 AF521911 14-JAN-2003 HGAL-IL4 inducible L15344 25-MAY-1995 √ NP_787048 ENSG00000084652 AF516206 17-FEB-2003 Taxilin _ _ _ _ AF512521 12-JAN-2003 Ligand-gated channel subunit AK092564 15-JUL-2002 √ _ ENSG00000163126 AF492401 18-JAN-2003 Diabetes related ankyrin repeat BC026194 09-APR-2002 √ NP_786887 ENSG00000164304 AF414185 27-FEB-2003 CAGE-1 AB051553 07-FEB-2001 √ XP_049218 ENSG00000133812 AY234241 20-MAR-2003 SBF2 _ √ _ _ AY191416 22-JAN-2003 Zygote arrest-1 BC041376 24-DEC-2002 √ XP_171208 ENSG00000172159 AY1377 03-MAR-2003 4.10 gene _ √ XP_072027 _ AY101377 01-MAR-2003 Testicular OR-4 associated Earlier sequence EST NCBI 31 Ens 31 Accession and date Name
  • 16. Human Proteome Sampling by MS/MS Identification: A Paucity of Novel Genes
    • 3778 from plasma (Muthusamy et al 2005)
    • 2486 from liver cells (Yan et al. 2006)
    • 615 from the human heart mitochondria (Taylor et al. 2003)
    • 500 from breast cancer cell membranes (Adams et al. 2003)
    • 491 from microsomal fractions (Han et al. 2001)
    • 311 from the splicesome (Rappsilber et al. 2002)
    • No verifiable data on gene prediction confirmation
    • One novel gene reported from a genome-only peptide match by Kuster et al in 2001 but this appeared from a high-throughput project later in the same year (Tr Q96DA0)
    • While there is no evidence of novel protein discovery there is a caveat on the availalable search space
  • 17. Conclusions
    • The model eukaryotes have shown no significant post-genomic rises in gene number
    • The Ensembl gene number has been essentially flat since 2001
    • There is a set of ~2,000 predicted genes still eluding experimental verification – or may not be real ?
    • Putative genes from curated chromosmes could raise protein numbers but the status of this class of transcripts is in doubt
    • Early over-estimates explicable by non-ORF transcription
    • Post-genomic transcript coverage is predominantly re-sampling known genes
    • Database submissions of novel human genes have slowed to a trickle
    • No evidence for large numbers of cryptic smORFs
    • Proteomics has not revealed new proteins
    • Maybe Swiss-Prot can pop the champane corks when HPI hits 20,000 ?
  • 18. Updates
    • October 2004 Nature paper on finished human genome “20-25,000 protein-coding genes”
    • December 2005 Nature paper “The dog gene count (19,300) is substantially lower than the 22,000-gene models in the current human gene catalogue (EnsEMBL build 26). For many predicted human genes, we find no convincing evidence of a corresponding dog gene. Much of the excess in the human gene count is attributable to spurious gene predictions in the human genome (M. Clamp, personal communication).”
    • March 2006 Ensembl 23,701
    • June 2006 Swiss-Prot HPI 14,445
  • 19. Acknowledgments and Reference
    • Paul Kersey of the EBI for IPI figures
    • Lucas Wagner of the NCBI for the retrospective UniGene data
    • Numerous other people at NCBI, EBI, Swiss-Prot and Sanger Centre who graciously answered queries on their data collections
    • The Oxford Glycosciences Proteome Discovery Team
    • Southan C. Has the Yo-yo stopped? An assesment of human protein-coding gene number (2004) Proteomics ( 6 ):1712-26. PMID: 15174140