Detection of genomic homology in eukaryotic genomes

933 views

Published on

i-ADHoRe 3.0--fast and sensitive detection of genomic homology in extremely large data sets.
Proost S, Fostier J, De Witte D, Dhoedt B, Demeester P, Van de Peer Y, Vandepoele K.
Nucleic Acids Res. 2012 Jan;40(2):e11.

Comparative genomics is a powerful means to gain insight into the evolutionary processes that shape the genomes of related species. As the number of sequenced genomes increases, the development of software to perform accurate cross-species analyses becomes indispensable. However, many implementations that have the ability to compare multiple genomes exhibit unfavorable computational and memory requirements, limiting the number of genomes that can be analyzed in one run. Here, we present a software package to unveil genomic homology based on the identification of conservation of gene content and gene order (collinearity), i-ADHoRe 3.0, and its application to eukaryotic genomes. The use of efficient algorithms and support for parallel computing enable the analysis of large-scale data sets. Unlike other tools, i-ADHoRe can process the Ensembl data set, containing 49 species, in 1 h. Furthermore, the profile search is more sensitive to detect degenerate genomic homology than chaining pairwise collinearity information based on transitive homology. From ultra-conserved collinear regions between mammals and birds, by integrating coexpression information and protein-protein interactions, we identified more than 400 regions in the human genome showing significant functional coherence. The different algorithmical improvements ensure that i-ADHoRe 3.0 will remain a powerful tool to study genome evolution.

Published in: Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
933
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • If the basic alignment procedure is stalled because of conflicts (i.e. no alignable set can be found), we want to determine which links are involved in these conflicts and which links are to be removed from G.To quantitatively investigate the number of conflicts that Lst is involved in, we want to assess to which degree s and t are connected through blocking paths. In graph theory, such problems can be addressed by solving the well-known maximum flow problem.
  • Mention 3 sub-sections: gene families, genome organization and Workbench
  • Remnants of WGD in all extant species
  • Gene loss ArabidopsisManual inspection identified 26 reliable nonredundant multiplicons of which, in seven cases, theArabidopsis segments could, based on Ks, unambiguously be grouped in two pairs that originated during the youngest duplication. All analyzed multiplicons can be visualized through the PLAZA website using a link reported in Supplemental Table 6 online. Analyzing all different patterns of gene loss using 139 ancestral loci (see Supplemental Table 6 online) revealed that 3.6 times more genes have been retained after the youngest α than after the oldest βArabidopsis-specific WGD (31.13 and 8.63% retention, respectively). Consequently, this massive amount of gene loss masks most traces of the oldest WGD and explains why, with only the Arabidopsis genome available, the existence and timing of an older β duplication was debated (Simillion et al., 2002; Blanc et al., 2003; Bowers et al., 2003).
  • Systematic evaluation of orthology and conserved co-expression using the ECC method for a set of 21 homologs (encoding ubiquitin-activating enzyme E1) from Arabidopsis, grape, Medicago, maize, poplar, rice and soybean (AT, VV, MT, ZM, PT, OS and GM prefixes, respectively). Groups of inparalogous genes are indicated using dashed vertical lines. Upper-left triangles denote the sequence-based orthologous relationship between the genes, with a darker shade of blue indicating a higher number of evidence types reported by the PLAZA 2.0 Integrative Orthology approach. The lower-right yellow triangles denote gene pairs with significant ECC scores (p-value < 0.05), white triangles represent gene pairs lacking a significant number of hared orthologs (p-value ≥0.05) and darker shades of yellow indicate a higher fraction of shared orthologs. Arced sections denote missing expression data for at least one of the genes. ECC scores are only computed between genes from different species.
  • Orthologous plant modules cover >43,000 unknown genes!
  • Detection of genomic homology in eukaryotic genomes

    1. 1. DETECTION OF GENOMIC HOMOLOGY INEUKARYOTIC GENOMESChallenges & applications Klaas Vandepoele Strasbourg, October 16th 2012 Department of Plant Biotechnology and Bioinformatics, Ghent University Department of Plant Systems Biology, VIB http://twitter.com/plaza_genomics
    2. 2. http://bioinformatics.psb.ugent.be/cig/
    3. 3. KLAAS VANDEPOELE Klaas Vandepoele was appointed Tenure Track Professor at Ghent University in 2011 (MRP Nucleotides 2 Networks) within the Department of Plant Biotechnology and Bioinformatics (Ghent University - VIB). He is currently (co-)promoter of 3 PhD students. His scientific objectives are to extract biological knowledge from large-scale experimental data sets using data integration and comparative genomics. Through the development and application of various bioinformatics tools, including comparative sequence analysis, cross-species gene expression analysis, ChIP-Seq and cis-regulatory elements analysis, he tries to identify new aspects of genome biology, especially in the area of gene function prediction and gene regulation. Recently developed tools include ATCOECIS, a toolbox for co-expression and cis-regulatory element analysis, and PLAZA, a resource for plant comparative genomics (>1,800 visits per month coming from >85 different countries). During the last 10 years, Klaas Vandepoele published >50 papers in international peer- reviewed scientific journals, of which 80% in journals with an impact factor (IF) >5 and 33% in journals with an IF >10. His H-index is 25.
    4. 4. OVERVIEW Cross-species genome analysis Detection of genomic homology using i-ADHoRe 3.0 Applications  Plant genomes & WGD  Vertebrate genomes Conclusions
    5. 5. 1. CROSS-SPECIES GENOME ANALYSIS Alignment of homologous regions  Inter-genomic: aligning genomic sequences from different species  Intra-genomic aligning genomic sequences from the same species Different levels of resolution  Comparative mapping (markers)  Synteny (~ gene content)  Colinearity (gene content + order conservation)  DNA-based alignments (base-to-base mapping)
    6. 6. COMPARATIVE SEQUENCE ANALYSIS Ancestral genomeo Genome conservation: transfer knowledge gained from model organisms to cropso Genome variation: understand how genomes change over time in order to identify evolutionary processes and constraints Contemporary specieso Detection of new functional elements, both coding as well as non- coding Hardison, PLoS Biology 2003
    7. 7. HUMAN – MOUSE - RAT resolution
    8. 8. HUMAN – MOUSE ORTHOLOGOUS REGIONS resolution Genome translocations associated Comparative with human-mouse speciation mapping HumanMouse chr IV www.ensembl.org
    9. 9. HUMAN GENOME BROWSER resolutionConserved gene Human chr Icontent & order Mouse chr IV Gene loss and insertions in orthologous segments since human-mouse speciationEST/cDNAsimilaritiesGenomesimilarities Human gene model
    10. 10. HUMAN – MOUSE BASE-TO-BASE MAPPING resolution  Functional sequences (e.g. exons) evolve slower than non-functional ones (e.g. introns) due to natural selection against mutations in these regions  Consequently, functional elements, both coding and non-coding, are unusually well conserved in orthologous regions Blue: coding exons GT donor AG acceptor
    11. 11. 2. DETECTION OF GENOMIC HOMOLOGY International Chicken Genome Sequencing ConsortiumA. thaliana – A. lyrataProost et al., 2011 Poplar - Tuskan et al., 2006
    12. 12. GENE COLINEARITY
    13. 13. MATRIX REPRESENTATION
    14. 14. MAP-BASED APPROACH Chromosome 1 • Represent chromosomes as sorted gene lists • Identify all homologous gene pairs between chromosomesChromosome 2 (all-against-all BLASTP). • Score pairs of homologues in matrix • Statistical filtering 1Gene Homology Matrix (GHM) 2 Vandepoele et al. (2002) Genome Research
    15. 15. In an actual genomethis becomes complexGood statistical modelto find biologicallyrelevant regions
    16. 16. GENOMIC PROFILES pairwise multiple Simillion et al. (2004) Genome Research
    17. 17. GRAPH-BASED ALIGNMENT INCL. CONFLICTRESOLUTION Needleman-Wunsch Greedy graph-based Fostier, … & Vandepoele, Bioinformatics 2011
    18. 18. I-ADHORE ALGORITHM Proost, Fostier … & Vandepoele, NAR 2011
    19. 19. OUTPUT: MULTIPLE HOMOLOGOUS SEGMENTS - MULTIPLICONHSMMGGTN Mm2 Gg20 Hs20 Gg2 Mm18 Hs18 Tn15 Within and between species gene colinearity!
    20. 20. PROFILES OFFER IMPROVED SENSITIVITY TODETECT DEGENERATE GENOMIC HOMOLOGY Sensitivity (#homologous segments) Proost, Fostier … & Vandepoele, NAR 2011
    21. 21. I-ADHORE 3.0 Speed & memory footprint MCSCan: Tang et al. 2008 Cyntenator: Rödelsperger et al. 2010 Proost, Fostier … & Vandepoele, NAR 2011
    22. 22. 3. APPLICATIONS IN PLANTSINTEGRATION IN PLAZA 2.5HTTP://BIOINFORMATICS.PSB.UGENT.BE/PLAZAPLANT COMPARATIVE GENOMICS PLATFORM25 PLANT SPECIESBLAST PAIRS, GENE FAMILIES &I-ADHORE PRE-COMPUTEDCONNECTED TO SEVERAL TOOLS TOVISUALIZE THE I-ADHORE DATA
    23. 23. Gene family analysisGenome analysis >20 tools available Proost et al., Plant Cell 2009; Van Bel et al., 2012
    24. 24. GENOME-WIDE COLINEARITYZ. mays WGDotplot O. sativa
    25. 25. MULTI-SPECIES COLINEARITY profile
    26. 26. Source: Y. Van de Peer
    27. 27. TRIPLICATED GENOME STRUCTURE VITIS
    28. 28. TRACES OF AN ANCIENT HEXA-PLOIDIZATION INVITIS
    29. 29. 1:4 COLINEARITY BETWEEN VITIS ANDARABIDOPSIS
    30. 30. RESOLVING A SERIES OF ANCIENT AND RECENTWGDS IN DICOTS Arabidopsis a Arabidopsis b Arabidopsis a Arabidopsis b Papaya Poplar a Poplar b Vitis << < > >>
    31. 31. INTERMEZZO – WGDS & THE QUEST FOR PLANTORTHOLOGS •Tree-based orthologs (TROG) inferred using tree reconciliation •Orthologous gene families (ORTHO) inferred using OrthoMCL •Anchor points refer to gene-based colinearity between species Van Bel et al., •Best hit families (BHIF) inferred from Blast hits including inparalogs Plant Physiology 2012
    32. 32. COMPLEX GENE ORTHOLOGY RELATIONSHIPS IN PLANTS Query species: A. thalianaTarget species
    33. 33. SORTING OUT PLANT (CO-)ORTHOLOGS USINGEXPRESSION CONTEXT CONSERVATION Protein integrative orthology Expression Context Conservation scores (p-value < 0.05) Inparalogs (species-specific duplicates)
    34. 34. 3. APPLICATIONS IN VERTEBRATE GENOMEEVOLUTIONOVERVIEW OF THE ENSEMBLDATASET (RELEASE 57)RESOURCE FOR ANIMAL GENOMES(& OUTGROUPS)CONTAINS 49 SPECIES832 666 PROTEIN CODING GENES70 161 CHROMOSOMES/SCAFFOLDSHTTP://WWW.ENSEMBL.ORG/HUBBARD ET AL., 2009
    35. 35. RESULTS• Runtime on 32 CPUs (4 nodes with 8 cores) ~ 4.5 hours (several months with previous version)• Memory usage ~ 4 GByte / core• Search results:  237 292 multiplicons  5 204 391 anchor points• Up-to 46 colinear regions could be grouped into one large multiplicon  Unsurprisingly the largest cluster in these animal genomes was the well known hox-cluster  The hox cluster has a highly conserved order as order is strongly linked with the development of the body plan
    36. 36. ENRICHED COEXPRESSION VS CONSERVED COLINEARITYCONSERVED COLINEAR REGIONS - IS THERE A LINK WITHFUNCTIONAL CLUSTERS? Human Chromosome 4 Bars indicate the number of species the region is conserved in Dark regions indicate significant co-expression
    37. 37. BIOLOGICAL SIGNIFICANCE OF HIGHLY CONSERVED COLLINEARREGIONS
    38. 38. 4. CONCLUSIONS1 i-ADHoRe 3.0 - Algorithmical and technical improvements now allow the analysis of extremely large datasets Application on plant species & the integration in PLAZA comparative2 genomics resource3 It is now possible to analyze all Ensembl genomes in a single run
    39. 39. ACKNOWLEDGEMENTS Sebastian Proost Jan Fostier Michiel Van Bel Yves Van de Peer Piet Demeester
    40. 40. ACKNOWLEDGEMENTSFurther reading Fostier J*, Proost S*, Dhoedt B, Saeys Y, Demeester P, Van de Peer Y, Vandepoele K (2011) A greedy, graph-based algorithm for the alignment of multiple homologous gene lists. Bioinformatics 27: 749-756. Proost, S.*, Fostier, J.*, De Witte, D., Dhoedt, B., Demeester, P., Van de Peer, Y., and Vandepoele, K. (2012). i-ADHoRe 3.0--fast and sensitive detection of genomic homology in extremely large data sets. Nucleic Acids Res 40, e11. Van Bel, M.*, Proost, S.*, Wischnitzki, E., Movahedi, S., Scheerlinck, C., Van de Peer, Y., and Vandepoele, K. (2012). Dissecting plant genomes with the PLAZA comparative genomics platform. Plant Physiol 158, 590-600.

    ×