Your SlideShare is downloading. ×
http://lectures.gersteinlab.org/ppt/Gencode-winter08-20090121-pseudogenes/Gencode-winter08-20090121-pseudogenes
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

http://lectures.gersteinlab.org/ppt/Gencode-winter08-20090121-pseudogenes/Gencode-winter08-20090121-pseudogenes

926
views

Published on

Yale Pseudogene Analysis as part of GENCODE Project …

Yale Pseudogene Analysis as part of GENCODE Project
Sanger Center 2009.01.20, 11:20-11:40
Date Given: 01/21/2009

Published in: Education, Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
926
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
22
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Yale Pseudogene Analysis as part of GENCODE Project Sanger Center 2009.01.20, 11:20-11:40 Mark B Gerstein Yale (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Illustration from Gerstein & Zheng (2006). Sci Am.
  • 2. Overview
    • Outline
      • Flow
      • Additional Pseudogene Annotation
      • Pseudogene Sets
    • Regular conference calls
      • Sanger / Havana: Jenn, Adam
      • UCSC: Rachel, Mark
      • Discuss difficult pseudogene cases, pipeline improvements and additional pseudogene annotation
    • Questions for you:
      • consis. Labels ?
      • Pfam/ensembl usage ?
      • trans. set ?
      • consensus set ?
  • 3. Overall Flow: Pipeline Runs, Coherent Sets, Annotation, Transfer to Sanger
    • Overall Approach
      • Overall Pipeline runs at Yale and UCSC, yielding raw pseudogenes
      • Extraction of coherent subsets for further analysis and annotation
      • Passing to Sanger for detailed manual analysis and curation
      • Incorporation into final GENCODE annotation
      • Pipeline modification
    • Chronology of Sets
      • Unitary pseudogenes
      • Ribosomal Protein pseudogenes
        • (Hard then easy)
      • Transcribed pseudogenes annotated earlier
      • Strong overlap consensus pseudogenes
      • Pseudogenes associated with SDs
      • GAPDH pseudogenes
  • 4. Additional Annotation on Pseudogenes: Consistent Labeling (Ontology)
  • 5. Gene duplicated_from mutated_from retrotranscribed_from Protein family contained_in has_pseudogene_family mRNA Protein transcribed_into translated_into Pseudogene family type_of Sequence Region Segmental Duplication Syntenic Region etc is_a contained_in contained_in has_parent_protein derives_from Domain Ontology Pseudogene Upper Ontology [Lam et al., NAR DB Issue (in press, '09)] Pseudogene Duplicated Unitary Transcribed Regulatory Processed … …
  • 6. Pseudogene Duplicated Unitary Transcribed Regulatory Processed Classified Type Semi Processed Duplicated Processed Polymorphic Feature Evidence Sequence Homology Intra-Genome Homology Cross-Genome Homology Recognition Feature Disablement Premature Stop Codon Regulatory Element Lost Frameshift Pseudo-PolyA Tail Nucleus Mitochondria Origin has a is a Domain Ontology *Unprocessed Proposed HAVANA [Lam et al., NAR DB Issue (in press, '09)]
  • 7. Collections to provide further annotation OTTHUMG00000000445 HAVANA ENSP00000384933.frag.301873 BC067222.1-28 urn:lsid:pseudogene.org:9606.Pseudogene:27570 urn:lsid:pseudogene.org:9606.Pseudogene:27579 urn:lsid:pseudogene.org:9606.Pseudogene:29 Human Pseudogenes Yale UCSC Ribosomal Protein Transcribed
  • 8. Additional Annotation on Pseudogenes: Families & Histories
  • 9. Flat Files Table Browser tables.pseudogene.org DAS UCSC Genome Browser
    • 12 eukaryotic species
      • Human, mouse, rat, chimp…
      • 100,052 pseudogenes
    • 64 prokaryotic species
      • 6,412 pseudogenes
    • 28,237 human pseudogenes
      • 13+ unique human sets
    SVN [Karro et al., NAR ('07)] single pgene query
  • 10. Pseudofam Database
    • Data Sources
      • Ensembl, Pfam, BioMart
      • Pseudopipe
    • Highlighted Features
      • Browse families
        • Statistics
          • Enrichment
          • P-value, etc
      • Search families
        • Pfam ID/Acc
        • Ensembl ID
        • Pseudogene ID
      • Correlate families
        • Genes
        • Pseudogenes
        • Parents
    [Lam et al., NAR DB Issue (in press, '09)]
  • 11. Pseudofam Construction
    • Data Generation
      • Identify pseudogenes by proteins
      • Map parent proteins to protein families
      • Assign pseudogenes to their parent families
      • Align the pseudogenes in pseudogene families
      • Calculate the key statistics and organize the data into database
    • Alignment
      • Align pseudogene to parent
      • Transfer alignment from Pfam
      • Combine and adjust the alignments to build the pseudofam alignment
    [Lam et al., NAR DB Issue (in press, '09)]
  • 12. Pseudofam Statistics: Enrichment of pseudogenes within a family ("Living vs Dead") Total (10 Eukaryotes) Human [Lam et al., NAR DB Issue (in press, '09)]
  • 13. Pseudogene Set #2: RP pseudogenes
  • 14. Human Ribosomal Proteins (RP) [Zhang et al . Genome Research , 2002] 79 Functional RP genes Nakao A, Yoshihama M, Kenmochi N: RPG: the Ribosomal Protein Gene database. Nucleic Acids Res 2004, 32:D168-170.
  • 15. Human Ribosomal Proteins (RP) ~ 2000 RP  genes [Zhang et al . Genome Research , 2002] 79 Functional RP genes
  • 16. Number of RP pseudogenes (identified by pipeline) RP pseudogenes constitute the largest family of pseudogenes. Almost all are processed: There are ~90 clearly duplicated ones in the human genome [Balasubramanian et al., Genome Biol. ('09)] Organism Processed Fragments Low confidence Human 1822 218 212 Chimp 1462 219 160 Mouse 2092 326 413 Rat 2848 343 450
  • 17. Number of each type of human ribosomal protein processed pseudogenes appears unrelated to expression level or to number in mouse [Balasubramanian et al., Genome Biol. ('09)]
  • 18. Using Synteny to Improve Annotation of RP pgenes Synteny derived based on local gene orthology [Balasubramanian et al., Genome Biol. ('09)]
  • 19. Syntenic proc pseudogenes [Balasubramanian et al., Genome Biol. ('09)] Species1- Species2 Number of syntenic pgenes Human-chimp 1282 Human-mouse 6 Human-rat 11 Rat-mouse 394
  • 20. A human-mouse syntenic pgene, likely to be coding [Balasubramanian et al., Genome Biol. ('09)] Nakao A, Yoshihama M, Kenmochi N: RPG: the Ribosomal Protein Gene database. Nucleic Acids Res 2004, 32:D168-170.
  • 21. Set #3: Transcribed Pseudogenes Annotated Earlier
  • 22. Harrison '05 set of Transcribed Pseudogenes
    • 233 Transcribed from ~8000 Processed Pseudogenes
    • Evidence for Transcription
      • 8% Refseq mRNAs
      • 32% Unigene consensus sequences
      • 72% dbEST expressed sequence tags
      • 32% Oligonucleotide microarray data (extra support)
    • Highly decayed
      • Fraction with Ka/Ks ≥ 0.5 is 54%
    Harrison et al. (2005) NAR
  • 23. Set #4: Strong Consensus Pseudogenes
  • 24. Consensus Pseudogenes
    • Yale & UCSC agreement
    • Yale, UCSC & HAVANA agreement
    • HAVANA disagreement
  • 25.
    • Initially, 50 nucleotide overlap
    • Why so low?
    • What about percentage overlap?
    Overlap
  • 26.  
  • 27.  
  • 28.  
  • 29. High Quality Overlaps?
  • 30.  
  • 31. gencode.gersteinlab.org/consensus
    • Django-based
    • Filter by genomic coordinates, source
    • Yale & UCSC Consensus track
    • Flagged pseudogenes
    • Verification status
  • 32. Set #5: SD pseudogenes
  • 33. Segmental Duplications (SDs)
    • SDs are regions of the genome with  90% sequence identity and  1kb in length
    • Clues about pseudogene formation from pseudogenes located in SDs, for example, annotation of “duplicated-processed” pseudogenes
  • 34. Pseudogene families and Segmental Duplications (SDs)
    • SDs comprise ~5-6% of the human genome but contain ~17.8% genes, 45.7% duplicated pgenes and 21.6% processed pgenes
    • Relative values of correlation coefficients in the plots above consistent with the observation that SDs contain more pgenes than parent genes
    [Lam et al., NAR DB Issue (in press, '09)]
  • 35. Summary
    • Sets
      • RP, transcribed ‘05, consensus, SD....
    • Additional Annotation
      • families & labels
  • 36. Credits
    • Yale: S Balasubramanian, P Cayting, H Lam, Y Liu, G Fang, N Carriero, R Robilotto, E Khurana
    • Alums: D Zheng, P Harrison, Z Zhang
    • Sanger: A Frankish, J Harrow
    • UCSC: R Harte, M Diekhans
  • 37. Extra
  • 38. Collections
    • Combination of multiple sets
      • Human50, UCSC, GAPDH, Transcribed, Unitary, etc.
    • Also can have attributes
    • View by set or as a whole
  • 39. Pseudofam & Pseudogene.org Integration PseudoPipe svn annotations via shared ID
  • 40. gencode.gersteinlab.org/consensus