Comparative genomicsin eukaryotesGene family analysis  Klaas Vandepoele, PhDProfessor Ghent UniversityComparative & Integr...
Workflow2
Applications of clustering the        proteome(s)       Gene families form the basis for the evolutionary        (or phyl...
I. Structural annotation: genome-        wide versus family-wise       Rationale family-wise annotation           Since ...
Workflow family-wise annotation            procedure  Collecting experi-        MSA experimental                          ...
Experimental representativesInterProScanPFAM HMM logo     Clustalw + JalView6
BLAST / HMMsearch    1. Use multiple sequence       alignment to create HMM profile    2. Use HMM profile to search for   ...
Representatives + putative homologs                                                                        BioEdit Sequenc...
Representatives + putative homologsSuffix finalcds indicates corrected gene model compared to the original gene modelgener...
Examples of family-specific protein         motifs        B-type cyclins have HxKF signature        Cyclin destruction b...
Examples of family-specific protein     Arabidopsis     Rice                        motifs                      D-type cy...
Classification using phylogenetic                tree construction        A- and B-type cyclins          are mitotic cycli...
Unraveling functional divergence using     Genes   large-scale expression compendia13                           Plant tiss...
Unraveling functional divergence using             large-scale expression compendia                                      A...
II. Orthology & paralogy        A major goal of sequence analysis is evolutionary         reconstruction. It is critical ...
Orthology & paralogy inference     Organism phylogeny        Gene phylogenies     (species tree)                gene dupli...
In- and outparalogy17   Sonnhammer & Koonin: Orthology, paralogy and proposed classification for paralog subtypes
Tree reconciliation        The automatic detection of speciation and duplication         events using a species tree and ...
III. Types of proteome analysis19
The evolution of multi-domain     proteins20
Interpreting the output of an all-       against-all similarity search     Metrics for sequence similarity:     • E-value,...
Clustering of similar sequences             Proteins = vertices ~ nodes        Sequence similarity relationship = edges22
Clustering of similar sequences23
Advanced methods for protein         (orthology) clustering        Sequence similarity-based            COG (RBH)       ...
Overview methodologies     BBH                               Inparanoid            COG                                 spe...
IV. Resources26
Resources (bis)        Ensembl (Vertebrates)        EnsembGenomes (Metazoa, Protists,         Fungi, Plants & Bacteria) ...
Hands-on        Goal: identify and characterize gene family         members encoding for talin 2 (TLN2)         1.   Sele...
Upcoming SlideShare
Loading in …5
×

BITS - Comparative genomics: gene family analysis

1,627 views

Published on

This is the second presentation of the BITS training on 'Comparative genomics'.

It reviews the different methods of investigating sequence homology on the gene family level.

Thanks to Klaas Vandepoele of the PSB department.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,627
On SlideShare
0
From Embeds
0
Number of Embeds
37
Actions
Shares
0
Downloads
40
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

BITS - Comparative genomics: gene family analysis

  1. 1. Comparative genomicsin eukaryotesGene family analysis Klaas Vandepoele, PhDProfessor Ghent UniversityComparative & Integrative GenomicsVIB – Ghent University, Belgium http://www.bits.vib.be
  2. 2. Workflow2
  3. 3. Applications of clustering the proteome(s)  Gene families form the basis for the evolutionary (or phylogenetic) analysis of  Detection of orthologs and paralogs  Gene duplication, family expansions, pseudogene formation and gene loss  Species taxonomies  Horizontal Gene Transfer (HGT)  Evolution of gene structure • Introns • Protein domain organisation & (re)arrangements  Base composition and codon usage3
  4. 4. I. Structural annotation: genome- wide versus family-wise  Rationale family-wise annotation  Since every gene has different (sequence) characteristics and different genes evolve at different rates, using these characteristics to determine homologous gene models will improve the overall structural annotation quality  Properties:  Slow & nearly-manual procedure  High-quality gene models revealing biological novel findings4
  5. 5. Workflow family-wise annotation procedure Collecting experi- MSA experimental Family HMMbuildmental representatives representatives HMM profile EST/cDNA BLAST Species X proteome Protein motifs Ab initio gene prediction Correction gene model Putative HMMsearch Homologs Classification using Phylogenetic trees5 Detailed characterization http://hmmer.janelia.org/
  6. 6. Experimental representativesInterProScanPFAM HMM logo Clustalw + JalView6
  7. 7. BLAST / HMMsearch 1. Use multiple sequence alignment to create HMM profile 2. Use HMM profile to search for similar proteins7
  8. 8. Representatives + putative homologs BioEdit Sequence EditorSuffix finalcds indicates corrected gene model compared to the original gene modelgenerate by the ab-initio gene prediction  Multiple sequence alignments assist in the detection and correction of errors in the structural annotation (missed exon)8
  9. 9. Representatives + putative homologsSuffix finalcds indicates corrected gene model compared to the original gene modelgenerate by the ab-initio gene prediction  Multiple sequence alignments assist in the detection of errors in the structural annotation (false first exon)9
  10. 10. Examples of family-specific protein motifs  B-type cyclins have HxKF signature  Cyclin destruction boxes (B1-type cyclin R-[AV]LGDIGN)10
  11. 11. Examples of family-specific protein Arabidopsis Rice motifs  D-type cyclins contain LxCxE Rb-binding motif  Low conservation of phylogenetic signal at primary sequence level  General rules are rarely general: exceptions (i.e. missing protein motifs) are frequent and might indicate functional divergence11
  12. 12. Classification using phylogenetic tree construction A- and B-type cyclins are mitotic cyclins D-type cyclins are G1-specific H-type cyclins regulate activity of CDK-activating kinases • The complexity of the cyclin gene family appears to be higher in plants than in mammals • Whether there is functional redundancy within A- and B-type cyclins or different regulation (and expression) of some cyclin subclasses remains to be analyzed12
  13. 13. Unraveling functional divergence using Genes large-scale expression compendia13 Plant tissues
  14. 14. Unraveling functional divergence using large-scale expression compendia A-type cyclin B-type cyclin Genes D-type cyclin14 Plant tissues Genevestigator
  15. 15. II. Orthology & paralogy  A major goal of sequence analysis is evolutionary reconstruction. It is critical to distinguish between two principal types of homologous relationships, which differ in their evolutionary history and functional implications.  Orthologs, defined as homologous genes evolved through speciation (~evolutionary counterparts derived from a single ancestral gene in the last common ancestor of the given two species)  Paralogs, which are homologous genes evolved through duplication within the same (perhaps ancestral) genome.  These definitions were first introduced by Fitch (1970)15
  16. 16. Orthology & paralogy inference Organism phylogeny Gene phylogenies (species tree) gene duplication a1 A b1 B c1 a1 b) a2 a2 C b2 b1 c2 a) b2 speciation Outparalogs16 Inparalogs c1
  17. 17. In- and outparalogy17 Sonnhammer & Koonin: Orthology, paralogy and proposed classification for paralog subtypes
  18. 18. Tree reconciliation  The automatic detection of speciation and duplication events using a species tree and gene family tree18
  19. 19. III. Types of proteome analysis19
  20. 20. The evolution of multi-domain proteins20
  21. 21. Interpreting the output of an all- against-all similarity search Metrics for sequence similarity: • E-value, Bit score or percent identity21 • alignment coverage
  22. 22. Clustering of similar sequences Proteins = vertices ~ nodes Sequence similarity relationship = edges22
  23. 23. Clustering of similar sequences23
  24. 24. Advanced methods for protein (orthology) clustering  Sequence similarity-based  COG (RBH) [Tatusov 1997]  InParanoid [Remm et al., 2001]  Tribe-MCL [Van Dongen 2000]  OrthoMCL [Li et al., 2003]  Phylogenetic tree-based  PhylomeDB [Huerta-Cepas et al., 2007]  Ensembl Compara [Vilella et al., 2008]24
  25. 25. Overview methodologies BBH Inparanoid COG species overlap25 Gabaldon, 2008 reconciliation
  26. 26. IV. Resources26
  27. 27. Resources (bis)  Ensembl (Vertebrates)  EnsembGenomes (Metazoa, Protists, Fungi, Plants & Bacteria)  OrthoMCLDB 5 (150 genomes)  YGOB (>15 Fungi)27
  28. 28. Hands-on  Goal: identify and characterize gene family members encoding for talin 2 (TLN2) 1. Select Query gene 2. Retrieve homo/orthologs 3. Create multiple sequence alignment 4. Identify conserved positions 5. Create phylogenetic tree and identify ortho/paralogous genes28

×