Ortholog assignment

1,111 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,111
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
14
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Ortholog assignment

  1. 1. Computational Prediction of Orthologs Melvin Zhang School of Computing, National University of Singapore May 4, 2011
  2. 2. A gene is a unit of heredity in a living organism
  3. 3. One gene may encode for multiple proteins
  4. 4. Two genes are homologous if they descended froma common ancestral gene1 1 with respect to a specific speciation event
  5. 5. Two genes are homologous if they descended froma common ancestral gene1 In practice, homology is determined using sequence alignment. Figure: A sequence alignment of two proteins 1 with respect to a specific speciation event
  6. 6. Two genes are homologous if they descended froma common ancestral gene1 In practice, homology is determined using sequence alignment. Figure: A sequence alignment of two proteins Have you seen phrases like “high homology”, “significant homology”, or “35% homology”? 1 with respect to a specific speciation event
  7. 7. Orthologs are due to speciation, paralogs are dueto duplication MRCA of G and H speciation duplication g h h G H main orthologs paralogs orthologs
  8. 8. Orthologs maintain their function Annotate genes with unknown functions.
  9. 9. Orthologs maintain their function Annotate genes with unknown Infer protein-protein functions. interactions.
  10. 10. Orthologs are not one-to-one due to lineagespecific gene duplications Main orthologs are orthologs that have retained their ancestral position.2 MRCA of G and H speciation duplication g h h G H main orthologs paralogs orthologs 2 Burgetz et al., Evolutionary Bioinformatics 2006
  11. 11. Problem of identifying main orthologs Input Position and sequences of genes in 2 genomes Output For each gene in their common ancestor, find its direct descendant in G and H
  12. 12. Problem of identifying main orthologs Input Position and sequences of genes in 2 genomes Output For each gene in their common ancestor, find its direct descendant in G and H Complications gene duplication gene loss horizontal gene transfer gene fusion, fission
  13. 13. Three main approaches for finding orthologs Graph based Tree based Rearrangement based
  14. 14. Bidirectional Best Hit and variants Most popular approach. High level of functional relatedness.a Reciprocal smallest dist use evolutionary distance estimate instead of BLAST scores OMA stable pairs introduce a tolerance interval and stable matching a Altenhoff et al., PLoS CB 2009
  15. 15. EnsemblCompara GeneTrees3 Figure: Species tree for 4 species on top gene tree for gene A Based on reconciliation of gene trees with species tree. 1. Partition genes into families and construct gene trees 2. Reconcile each gene tree and species tree 3 Vilella et al., Genome Res 2009
  16. 16. MSOAR24 Figure: Rearrangement scenario between human and mouse 1. Partition genes into families and assign a unique symbol 2. Reconstruct the most parsimonious rearrangement (inversion, translocation, fusion, fission, duplication) 3. Extract the corresponding orthologs 4 Fu et al., JCB 2007
  17. 17. Can conserved gene neighborhood improveortholog predictions?
  18. 18. Human-mouse synteny blocks Conserved synteny blocks between human and mouse genome generated by the Cinteny web server5 5 Sinha and Meller, BMC Bioinformatics 2007
  19. 19. Local synteny criteria6 Figure: Local synteny: more than one unique match within +/- 3 genes. Homology defined as BLASTP E-value < 1e-5 94% of sampled inter-species pairs are identified as orthologs by Inparanoid (based on BBH) and local synteny criteria. 6 Jin Jun et al., BMC Genomics 2009
  20. 20. Local synteny score (LC) g G H h The local synteny score of g and h is 4 since there are 4 edges in the maximum matching.
  21. 21. Smith-Waterman alignment score (SW)
  22. 22. BBH-LS: bidirectional best hits based on linearcombination of SW and LC g G H h + sim(g , h) = (1−f )×SW(g , h)+f ×LC(g , h)
  23. 23. Human-Mouse-Rat dataset Input Human, mouse, and rat genes downloaded from Ensembl. Benchmark No “golden” benchmark for true orthology. Assume that orthologs are assigned the same gene symbol.
  24. 24. Tuning the BBH-LS method sim(g , h) = (1 − f ) × SW(g , h) + f × LC(g , h)
  25. 25. Results for various methods on Human-Mouse Figure: TP: same gene symbols, FP: different gene symbols More true positives and less false positives than MSOAR2.
  26. 26. Results for various methods on Human-Rat Figure: TP: same gene symbols, FP: different gene symbols
  27. 27. Results for various methods on Mouse-Rat Figure: TP: same gene symbols, FP: different gene symbols
  28. 28. How local synteny helps Human CTSH RASGRF1 ANKRD34C Human MSH3 RASGRF2 CKMT2 chr 15 chr 5 sw = 2466 sw = 2003 ls = 5 ls = 5 sw = 5265 ls = 1 Mouse ANKRD34C RASGRF1 CTSH Mouse CKMT2 RASGRF2 MSH3 chr 9 chr 13 Bold edges are the pairing from BBH-LS, thin edges are the pairing from BBH. BBH paired RASGRF2 (human) to RASGRF1 (mouse) due to high SW, corrected by BBH-LS with LC.
  29. 29. Summary: Identifying main orthologs MRCA of G and H speciation duplication g h h G H main orthologs paralogs orthologs For each gene in their common ancestor, find its direct descendant in G and H
  30. 30. Summary: Three approaches Graph based Tree based Rearrangement based
  31. 31. BBH-LS: bidirectional best hits based on linearcombination of SW and LC
  32. 32. BBH-LS: bidirectional best hits based on linearcombination of SW and LC g G H h +

×