4. Two genes are homologous if they descended from
a common ancestral gene1
1
with respect to a specific speciation event
5. Two genes are homologous if they descended from
a common ancestral gene1
In practice, homology is determined using sequence alignment.
Figure: A sequence alignment of two proteins
1
with respect to a specific speciation event
6. Two genes are homologous if they descended from
a common ancestral gene1
In practice, homology is determined using sequence alignment.
Figure: A sequence alignment of two proteins
Have you seen phrases like “high homology”, “significant
homology”, or “35% homology”?
1
with respect to a specific speciation event
7. Orthologs are due to speciation, paralogs are due
to duplication
MRCA of G and H
speciation
duplication
g h h
G H
main orthologs
paralogs
orthologs
9. Orthologs maintain their function
Annotate genes with unknown Infer protein-protein
functions. interactions.
10. Orthologs are not one-to-one due to lineage
specific gene duplications
Main orthologs are orthologs that have retained their ancestral
position.2
MRCA of G and H
speciation
duplication
g h h
G H
main orthologs
paralogs
orthologs
2
Burgetz et al., Evolutionary Bioinformatics 2006
11. Problem of identifying main orthologs
Input Position and sequences of genes in 2 genomes
Output For each gene in their common ancestor, find its
direct descendant in G and H
12. Problem of identifying main orthologs
Input Position and sequences of genes in 2 genomes
Output For each gene in their common ancestor, find its
direct descendant in G and H
Complications
gene duplication
gene loss
horizontal gene transfer
gene fusion, fission
13. Three main approaches for finding orthologs
Graph based Tree based Rearrangement based
14. Bidirectional Best Hit and variants
Most popular approach. High
level of functional relatedness.a
Reciprocal smallest dist
use evolutionary distance
estimate instead of BLAST
scores
OMA stable pairs
introduce a tolerance interval
and stable matching
a
Altenhoff et al., PLoS CB 2009
15. EnsemblCompara GeneTrees3
Figure: Species tree for 4 species on top gene tree for gene A
Based on reconciliation of gene trees with species tree.
1. Partition genes into families and construct gene trees
2. Reconcile each gene tree and species tree
3
Vilella et al., Genome Res 2009
16. MSOAR24
Figure: Rearrangement scenario between human and mouse
1. Partition genes into families and assign a unique symbol
2. Reconstruct the most parsimonious rearrangement
(inversion, translocation, fusion, fission, duplication)
3. Extract the corresponding orthologs
4
Fu et al., JCB 2007
18. Human-mouse synteny blocks
Conserved synteny blocks between human and mouse genome
generated by the Cinteny web server5
5
Sinha and Meller, BMC Bioinformatics 2007
19. Local synteny criteria6
Figure: Local synteny: more than one unique match within +/- 3
genes. Homology defined as BLASTP E-value < 1e-5
94% of sampled inter-species pairs are identified as orthologs
by Inparanoid (based on BBH) and local synteny criteria.
6
Jin Jun et al., BMC Genomics 2009
20. Local synteny score (LC)
g
G
H
h
The local synteny score of g and h is 4 since there are 4 edges
in the maximum matching.
22. BBH-LS: bidirectional best hits based on linear
combination of SW and LC
g
G
H
h
+
sim(g , h) = (1−f )×SW(g , h)+f ×LC(g , h)
23. Human-Mouse-Rat dataset
Input
Human, mouse, and rat genes downloaded from Ensembl.
Benchmark
No “golden” benchmark for true orthology.
Assume that orthologs are assigned the same gene symbol.
24. Tuning the BBH-LS method
sim(g , h) = (1 − f ) × SW(g , h) + f × LC(g , h)
25. Results for various methods on Human-Mouse
Figure: TP: same gene symbols, FP: different gene symbols
More true positives and less false positives than MSOAR2.
26. Results for various methods on Human-Rat
Figure: TP: same gene symbols, FP: different gene symbols
27. Results for various methods on Mouse-Rat
Figure: TP: same gene symbols, FP: different gene symbols
28. How local synteny helps
Human CTSH RASGRF1 ANKRD34C Human MSH3 RASGRF2 CKMT2
chr 15 chr 5
sw = 2466 sw = 2003
ls = 5 ls = 5
sw = 5265
ls = 1
Mouse ANKRD34C RASGRF1 CTSH Mouse CKMT2 RASGRF2 MSH3
chr 9 chr 13
Bold edges are the pairing from BBH-LS, thin edges are the
pairing from BBH.
BBH paired RASGRF2 (human) to RASGRF1 (mouse) due to
high SW, corrected by BBH-LS with LC.
29. Summary: Identifying main orthologs
MRCA of G and H
speciation
duplication
g h h
G H
main orthologs
paralogs
orthologs
For each gene in their common ancestor, find its direct
descendant in G and H