Comparative genomics
in eukaryotes
Gene family analysis



  Klaas Vandepoele, PhD


Professor Ghent University
Comparative & Integrative Genomics
VIB – Ghent University, Belgium


                 http://www.bits.vib.be
Workflow




2
Applications of clustering the
        proteome(s)
       Gene families form the basis for the evolutionary
        (or phylogenetic) analysis of
          Detection of orthologs and paralogs
          Gene duplication, family expansions,
           pseudogene formation and gene loss
          Species taxonomies
          Horizontal Gene Transfer (HGT)
          Evolution of gene structure
             • Introns
             • Protein domain organisation &
               (re)arrangements
          Base composition and codon usage

3
I. Structural annotation: genome-
        wide versus family-wise
       Rationale family-wise annotation
           Since every gene has different (sequence)
            characteristics and different genes evolve at
            different rates, using these characteristics to
            determine homologous gene models will
            improve the overall structural annotation
            quality
       Properties:
           Slow & nearly-manual procedure
           High-quality gene models revealing biological
            novel findings

4
Workflow family-wise annotation
            procedure

  Collecting experi-        MSA experimental                          Family
                                                 HMMbuild
mental representatives       representatives                        HMM profile

              EST/cDNA


                                      BLAST                         Species X
                                                                    proteome
           Protein motifs                      Ab initio gene prediction

      Correction gene model               Putative
                                                                    HMMsearch
                                         Homologs
        Classification using
        Phylogenetic trees

5   Detailed characterization                                    http://hmmer.janelia.org/
Experimental representatives


InterProScan




PFAM HMM logo
     Clustalw + JalView




6
BLAST / HMMsearch


    1. Use multiple sequence
       alignment to create HMM profile
    2. Use HMM profile to search for
       similar proteins




7
Representatives + putative homologs

                                                                        BioEdit Sequence Editor




Suffix finalcds indicates corrected gene model compared to the original gene model
generate by the ab-initio gene prediction


             Multiple sequence alignments assist in the detection and
              correction of errors in the structural annotation (missed exon)
8
Representatives + putative homologs




Suffix finalcds indicates corrected gene model compared to the original gene model
generate by the ab-initio gene prediction


             Multiple sequence alignments assist in the detection of errors
              in the structural annotation (false first exon)
9
Examples of family-specific protein
         motifs




        B-type cyclins have HxKF signature
        Cyclin destruction boxes (B1-type cyclin R-[AV]LGDIGN)

10
Examples of family-specific protein
     Arabidopsis
     Rice
                        motifs




                      D-type cyclins contain LxCxE Rb-binding motif
                      Low conservation of phylogenetic signal at primary sequence level
                      General rules are rarely general: exceptions (i.e. missing protein
                       motifs) are frequent and might indicate functional divergence
11
Classification using phylogenetic
                tree construction
        A- and B-type cyclins
          are mitotic cyclins


                                                                           D-type cyclins are
                                                                               G1-specific



     H-type cyclins regulate activity
       of CDK-activating kinases




         • The complexity of the cyclin gene family appears to be higher in plants than in
         mammals
         • Whether there is functional redundancy within A- and B-type cyclins or different
         regulation (and expression) of some cyclin subclasses remains to be analyzed
12
Unraveling functional divergence using
     Genes   large-scale expression compendia




13
                           Plant tissues
Unraveling functional divergence using
             large-scale expression compendia


                                      A-type cyclin




                                      B-type cyclin
     Genes




                                      D-type cyclin



14
                      Plant tissues                   Genevestigator
II. Orthology & paralogy

        A major goal of sequence analysis is evolutionary
         reconstruction. It is critical to distinguish between two
         principal types of homologous relationships, which differ
         in their evolutionary history and functional implications.

        Orthologs, defined as homologous genes evolved
         through speciation (~evolutionary counterparts derived
         from a single ancestral gene in the last common ancestor
         of the given two species)

        Paralogs, which are homologous genes evolved through
         duplication within the same (perhaps ancestral) genome.

        These definitions were first introduced by Fitch (1970)

15
Orthology & paralogy inference


     Organism phylogeny        Gene phylogenies
     (species tree)                gene duplication
                                                              a1
                    A

                                                              b1

                    B                                         c1
                                          a1
                                               b)             a2
                                          a2
                    C                                         b2
                                          b1
                                                              c2
                          a)              b2
       speciation                                     Outparalogs

16                        Inparalogs      c1
In- and outparalogy




17   Sonnhammer & Koonin: Orthology, paralogy and proposed classification for paralog subtypes
Tree reconciliation

        The automatic detection of speciation and duplication
         events using a species tree and gene family tree




18
III. Types of proteome analysis




19
The evolution of multi-domain
     proteins




20
Interpreting the output of an all-
       against-all similarity search




     Metrics for sequence similarity:
     • E-value, Bit score or percent identity
21   • alignment coverage
Clustering of similar sequences




             Proteins = vertices ~ nodes
        Sequence similarity relationship = edges
22
Clustering of similar sequences




23
Advanced methods for protein
         (orthology) clustering
        Sequence similarity-based
            COG (RBH)         [Tatusov 1997]
            InParanoid        [Remm et al., 2001]
            Tribe-MCL         [Van Dongen 2000]
            OrthoMCL          [Li et al., 2003]

        Phylogenetic tree-based
            PhylomeDB         [Huerta-Cepas et al., 2007]
            Ensembl Compara   [Vilella et al., 2008]


24
Overview methodologies



     BBH
                               Inparanoid



            COG




                                 species overlap




25                                                 Gabaldon, 2008
              reconciliation
IV. Resources




26
Resources (bis)

        Ensembl (Vertebrates)
        EnsembGenomes (Metazoa, Protists,
         Fungi, Plants & Bacteria)

        OrthoMCLDB 5 (150 genomes)
        YGOB (>15 Fungi)




27
Hands-on

        Goal: identify and characterize gene family
         members encoding for talin 2 (TLN2)

         1.   Select Query gene
         2.   Retrieve homo/orthologs
         3.   Create multiple sequence alignment
         4.   Identify conserved positions
         5.   Create phylogenetic tree and identify
              ortho/paralogous genes



28

BITS - Comparative genomics: gene family analysis

  • 1.
    Comparative genomics in eukaryotes Genefamily analysis Klaas Vandepoele, PhD Professor Ghent University Comparative & Integrative Genomics VIB – Ghent University, Belgium http://www.bits.vib.be
  • 2.
  • 3.
    Applications of clusteringthe proteome(s)  Gene families form the basis for the evolutionary (or phylogenetic) analysis of  Detection of orthologs and paralogs  Gene duplication, family expansions, pseudogene formation and gene loss  Species taxonomies  Horizontal Gene Transfer (HGT)  Evolution of gene structure • Introns • Protein domain organisation & (re)arrangements  Base composition and codon usage 3
  • 4.
    I. Structural annotation:genome- wide versus family-wise  Rationale family-wise annotation  Since every gene has different (sequence) characteristics and different genes evolve at different rates, using these characteristics to determine homologous gene models will improve the overall structural annotation quality  Properties:  Slow & nearly-manual procedure  High-quality gene models revealing biological novel findings 4
  • 5.
    Workflow family-wise annotation procedure Collecting experi- MSA experimental Family HMMbuild mental representatives representatives HMM profile EST/cDNA BLAST Species X proteome Protein motifs Ab initio gene prediction Correction gene model Putative HMMsearch Homologs Classification using Phylogenetic trees 5 Detailed characterization http://hmmer.janelia.org/
  • 6.
  • 7.
    BLAST / HMMsearch 1. Use multiple sequence alignment to create HMM profile 2. Use HMM profile to search for similar proteins 7
  • 8.
    Representatives + putativehomologs BioEdit Sequence Editor Suffix finalcds indicates corrected gene model compared to the original gene model generate by the ab-initio gene prediction  Multiple sequence alignments assist in the detection and correction of errors in the structural annotation (missed exon) 8
  • 9.
    Representatives + putativehomologs Suffix finalcds indicates corrected gene model compared to the original gene model generate by the ab-initio gene prediction  Multiple sequence alignments assist in the detection of errors in the structural annotation (false first exon) 9
  • 10.
    Examples of family-specificprotein motifs  B-type cyclins have HxKF signature  Cyclin destruction boxes (B1-type cyclin R-[AV]LGDIGN) 10
  • 11.
    Examples of family-specificprotein Arabidopsis Rice motifs  D-type cyclins contain LxCxE Rb-binding motif  Low conservation of phylogenetic signal at primary sequence level  General rules are rarely general: exceptions (i.e. missing protein motifs) are frequent and might indicate functional divergence 11
  • 12.
    Classification using phylogenetic tree construction A- and B-type cyclins are mitotic cyclins D-type cyclins are G1-specific H-type cyclins regulate activity of CDK-activating kinases • The complexity of the cyclin gene family appears to be higher in plants than in mammals • Whether there is functional redundancy within A- and B-type cyclins or different regulation (and expression) of some cyclin subclasses remains to be analyzed 12
  • 13.
    Unraveling functional divergenceusing Genes large-scale expression compendia 13 Plant tissues
  • 14.
    Unraveling functional divergenceusing large-scale expression compendia A-type cyclin B-type cyclin Genes D-type cyclin 14 Plant tissues Genevestigator
  • 15.
    II. Orthology &paralogy  A major goal of sequence analysis is evolutionary reconstruction. It is critical to distinguish between two principal types of homologous relationships, which differ in their evolutionary history and functional implications.  Orthologs, defined as homologous genes evolved through speciation (~evolutionary counterparts derived from a single ancestral gene in the last common ancestor of the given two species)  Paralogs, which are homologous genes evolved through duplication within the same (perhaps ancestral) genome.  These definitions were first introduced by Fitch (1970) 15
  • 16.
    Orthology & paralogyinference Organism phylogeny Gene phylogenies (species tree) gene duplication a1 A b1 B c1 a1 b) a2 a2 C b2 b1 c2 a) b2 speciation Outparalogs 16 Inparalogs c1
  • 17.
    In- and outparalogy 17 Sonnhammer & Koonin: Orthology, paralogy and proposed classification for paralog subtypes
  • 18.
    Tree reconciliation  The automatic detection of speciation and duplication events using a species tree and gene family tree 18
  • 19.
    III. Types ofproteome analysis 19
  • 20.
    The evolution ofmulti-domain proteins 20
  • 21.
    Interpreting the outputof an all- against-all similarity search Metrics for sequence similarity: • E-value, Bit score or percent identity 21 • alignment coverage
  • 22.
    Clustering of similarsequences Proteins = vertices ~ nodes Sequence similarity relationship = edges 22
  • 23.
  • 24.
    Advanced methods forprotein (orthology) clustering  Sequence similarity-based  COG (RBH) [Tatusov 1997]  InParanoid [Remm et al., 2001]  Tribe-MCL [Van Dongen 2000]  OrthoMCL [Li et al., 2003]  Phylogenetic tree-based  PhylomeDB [Huerta-Cepas et al., 2007]  Ensembl Compara [Vilella et al., 2008] 24
  • 25.
    Overview methodologies BBH Inparanoid COG species overlap 25 Gabaldon, 2008 reconciliation
  • 26.
  • 27.
    Resources (bis)  Ensembl (Vertebrates)  EnsembGenomes (Metazoa, Protists, Fungi, Plants & Bacteria)  OrthoMCLDB 5 (150 genomes)  YGOB (>15 Fungi) 27
  • 28.
    Hands-on  Goal: identify and characterize gene family members encoding for talin 2 (TLN2) 1. Select Query gene 2. Retrieve homo/orthologs 3. Create multiple sequence alignment 4. Identify conserved positions 5. Create phylogenetic tree and identify ortho/paralogous genes 28