T coffee algorithm dissection

T-coffee:
A Method for Fast and Accurate Multiple Sequence Alignment
Chen, Gui
03/18/2015

Backgroud & Motivation
Algorithm Illustration
Validation & Result

Why do We Need Multiple Sequences Alignment?
Homology Modeling
Phylogenetic reconstruction
Illustrate conserved and variable sites within a family
Can be used to construct proﬁle or HMM to scour databases of
distantly related members of the family

When construct MSA, theoretically we should consider evolution
and structural relationships within the family. However…
1. Speciﬁc expertise knowledge(if not lacking) is hard to 
be integrated into algorithm
2. General empirical models of protein evolution doesn’t work well with  
sequences are less than 30% identical
3. Mathematically sound methods is prohibitively demanding in computer
resources
That is why we introduce Heuristic method.

A Brief Review of Previous Methods to Construct MSA
ClustalW: 
Heuristic method
MSA & DCA:
Mathematically sound
Prrp&Muscle:
Iterative Method 
dynamic programing to build
guide tree,
then do progressive
alignments
comment: fast but suffers
from its greediness
(once a gap always a gap)
no local alignment
information, take little
reference from other
sequences in sequences set
during construct the MSA
simultaneous alignment of all the
sequences
Carrilo and Lipman Algorithm,
Multipledimensional dynamic
programing
comment: extremely CPU and
memory-intensive approach
and not better than other
methods in term of alignment
performance, no local
alignment information taken
heuristic method combined with
iterative Method
comment: an interesting
method, neither signiﬁcantly
faster nor align better than
ClustalW , no local alignment
information, take some
reference from other
sequences in sequences set
during construct the MSA
Question: theoretically speaking, why iterative method is not better than ClustalW? and
why local alignment information should be taken?

A Brief Review of Previous Methods to Construct MSA
Dialign2: 
Heuristic method
simultaneous alignment of
all sequences but with crude
heuristic local ailignment
method
comment: fast
but only consider local
alignment information and to
some extent consider reference
from all sequences in a
sequence set , align poorly in
practice use.
Question: theoretically speaking, why iterative method is not better than ClustalW? And
why local alignment information should be taken?
…so the motivation now is to build a method with  
all merits listed below:
Combine information from global alignment
Combine information from local alignment
Take reference from other sequences in sequence set
during alignment
T-Coffee
Method
Back
Hueristic method with practical computational time
local alignment is more sensitive to
domain and motif

reference from other sequences should
be taken from other sequences when
align conserved part

optimal alignment not equivalent to
biological meaningful alignment

*A Case Will Fail ClustalW Method

An Overview of T-coffee Method
…so how T-coffee satisfy all the merits mentioned earlier?
Requirement solution
Hueristic method with
practical computational time Progressive Alignment codes from ClustalW
Combine information from
global alignment
Primary Library from
Global Alignment
codes from ClustalW
Combine information from
local alignment
Primary Library from
Local Alignment
codes from Lalign in
FASTA package
Take reference from other
sequences in sequence set
during progressive alignment
Extension Library from  
Primary Library
refer intermediate
sequence method
Three major steps in FASTA:
1. build Hashing table
2. concatenate matched k-tuple
3. extend to get high score segments

An Overview of T-coffee Method

Primary Library of Alignments (Global and Local)
Library— a set of pairwise alignments between all of the sequences to be aligned, and in a
sequence-to-sequence-position-pair speci8ic weight list form. a library can be stored as a N*N
lower (or upper) triangular matrix where main diagonal can be ignored, and each entry is a
weight list. In other word, a list of weighted pairwise constraints. The primary library of global
alignment for a sequence set is denoted by AG and the primary library of local alignment for a
sequence set is denoted by AL. A* is referred to either AG or AL.
Now suppose we have a sequence set with size N(N refer to the number of sequences in the set) ,
the total number of sequence pairs for the sequence set is N*(N-1)/2.
We can use Ai to denote the ith sequence(item) in the sequence set A. So that in matrix A* we can
know entry A*ij where contain the information from the pair of alignment the entry denotes. Before
to generate global alignment AG or local alignment AL, we should 8irst do all possible pairwise
alignments using global alignment method or FASTA local segments match method(Lalign)
*When we do local pairwise alignment, by default, we choose ten top-scoring non-intersections local
alignment from each pair of alignment. So the number of segments derived from an alignment is very
likely less than 10 (simply because there are no so many qualiﬁed matched segments) and could be 0.
After the pairwise alignment, we derived a list of pairwise residue matches for each entry of A*. And Xm
denote the mth position in a certain sequence Ai. So the list in an entry can be denoted by (Xn Xm)|A*ij.
Finally, we assign a weight to each pairwise residues match in all lists directed by all entries in A*,
and the weight equal to percentage identity of the alignment of Ai to Aj where the pairwise residue
match is derived from. W[(Xn, Xm)|A*ij] = P.I.(A*ij). The weight is also referred as constraint.

Primary Library of Alignments (Global and Local)
A1 … Ai Aj … AN
A1
A2
…
Aj %
…
AN
a list of W(Xm,Xn|A*ij)
Library is a generalized list which contains key-list and key-value pairs. 
List contains key-value pairs. For global alignment:
For local alignment
Produce

Combination of the Libraries: Addition
Pooling the ClustalW and Lalign primary libraries in a
simple process of addition:
AGL$=$AG+AL$
W[(Xn, Xm)|AGLij] = W[(Xn, Xm)|AGij]+ W[(Xn, Xm)|ALij] 
If W[(Xn, Xm)|A*ij] is not recorded in A*, assign 0 to it.
Then entry AGLij can be regarded as a ‘sparse’ list with L(i)*L(j) number of
key-value pairs (a lot of values are 0). L(i) denote the length(or number of
residues) of sequence Ai.

Library can be used as scoring scheme
Library A* can be regarded as sequence-to-sequence-position-
pair speciﬁc scoring scheme.
It can be regarded as a secondary scoring scheme derived from
dynamic programing pairwise alignment using substitution matrix as
primary scoring scheme.

Extending the library: Background
Purpose: to take reference from other sequences in each step of progressive
alignment.
Previous solutions for this purpose:
Fitting a set of weighted constraints into a multiple alignment is a well-known
problem, formulated by Kececioglu as an instance of the “maximum weight
trace”, an NP-complete problem. And two optimizaition strategies were proposed: 
1. genetic Algorithm: prohibitive computational time
2. graph-theoretical method: not robust enough for all cases
In a word, this problem cannot be illustrated well from graph-theory point of view.
Solution proposed by this paper: a heuristic algorithm inspired from
intermediate sequence method. A triplet approach.

Extending the library: Triplet approach
W(A(G), C(?)) W(A(G), C(?))consider seqC
consider seqD W(A(G), D(?)) W(A(G), D(?))
For W(A(G), B(G)) E[W(A(G), B(G))]=W(A(G), B(G))+%d=88+77
If C(?) == C(?): get %(min) of W(A(G), C(?))=77
W(A(G), D(?))=100else %(0)
v
v
Sometimes we will get better alignment 
If we don’t strictly follow the guide tree.
That is why we take inference from other 
sequences when align two sequences 
following the guide tree.
 
Iterative method achieve this goal by  
modifying guide tree in a heuristic manner. 
e.g. MUSLE

Extending the library: Let’s code this process
Note the library extension operator as AE and notice that it is not a library that can
be added to A* because it is a function of A*. AE(A*)= A*E.
def AE (A*):
for i=1, i++, i<=N 
for j=i+1, j++, j<=N // go through A*ij: C(2,N)
for m=1, m++, m<=L(Ai)  
for n=1, n++, n<=L(Aj) //go through all
constraints in the matrix entries: L^2 
E=0, for k of each Ak belonging to A-Ai-Aj
a = get_position(m i k a) 
b = get_position(n j k b)
if a == b // to ﬁnd consistent
residues in other sequences supporting match of  
Xm|Ai and Xn|Aj: 2L
e1 = W[(Xm, Xa)|A*ik] 
e2 = W[(Xn, Xb)|A*jk]
E +=min{e1, e2} // get extension
weight
W[(Xm,Xn)|A*ij]+= E // A*E
def get_position(m i k a):
for n=1 n++ n<=L(Ak)
if W[(Xm,Xn)|A*ik] != 0
add n to a // ﬁnd the possible consistent
position in Ak: L(Ak)
return a
C(2,N)* (L^2)*L=O(N^2*L^3)

Extending the library: Let’s formulate this process
AGLE =AE (AG)+AE (AL)
Notice that distributive law is not allowed for operator AE . 
That is to say: AGLE =AE (AGL)

Conclusion: Coffee Score Scheme
Given any pair of residues from any two sequences in sequence set:
If weight = 0, that residue pairs never supported by global, local or extension triplet alignment.
(in other words, the pair of that residues maybe aligned in form of gap).
If weight >0, that weight will reflect a combination of the similarity of the pair of
sequences(Global) or sequence segments(Local) that the residue pair comes from and the
consistency of match of the residues with residues from other sequences in the sequence set.
The weight library can then be used as coffee score scheme to do progressive alignment.
*When apply Coffee score scheme to do dynamic programming or progressive alignment, there is no need to set 
additional gap open or gap extension penalty simply for two reasons:
1. Coffee score scheme is a secondary score scheme generated from dynamic programming using primary
score scheme, where penalty about gap is already taken account of.
2. Although local alignment primary library doesn’t reflect how the match of pair of residues introduce gaps
globally, if this match of pair of residues is also supported by global alignment, gap information will be
reflected through global alignment . Otherwise this mach of pair of residues is not going to have high weight
if it is not supported by consistency with reference from other sequences. In this case, gap penalty is still
not necessary.
In other word the weight reflects how the residue pair is supported
by direct local or global alignment within which the residue pair
comes from and the indirect alignment with facilitation of all other
sequences as intermediate-sequences.
Practically, gap penalty=0.

Progressive Alignment Strategy
Given the Column n
C
C
C
T
+
T
T
T
! !, !!
!!!
!
!!!!!
!!
!
= !"#$!%#_1(!1)!!
! !, !!
!!!
!
!!!!!
!!
!
= !"#$!%#_2(!2)!!
C
C
C
T
T
T
T
+
C
C
Don’t need to align pairs of residues within existing column of
alignment , only consider weights of matched pairs of residues
between existing column:
!!
!!! [!"#!(!), !"#!(!)]!
!!!!!
!!
!
∗ !!
!
= !"#$!%#_3(!3)!!
average_1’=a1+a2+a3
average_2’
Within Within Between
average_3’

Test Cases is from BaliBase Why Balibase
Reliabitlity:The MSA in Balibase is resulted from manual structure comparison and validated  
using structure-superposition algorithms SSAP-DALI
Comprehensiveness: 141 MSA cases in Balibase can be grouped into 5 categories: 
1. Group with phylogenetically equidistant members
2. Group with one orphan sequence and a group of close relatives 
3. Group with two distant subgroups 
4. Group in which some members have long terminal insertions 
5. Group in which some members have long internal insertions 
Thus the cases are unlikely to be biased toward any speciﬁc multiple-
alignment method.

Validation method: Scoring Scheme and Multimethod Comparison
Scoring Scheme:
1. column-wise comparison: get point only when the whole column is aligned correctly
2. SP: sum-of-pairs: get weighted point when the column aligned is partially correct.
Validation is carried out by comparing each calculated multiple alignment
with its counterpart in BaliBase.
Multimethod Comparison:
Candidate Methods
1. Prrp
2. ClustalW
3. MSA & DCA methods eliminated at the very begining
4. Dialign2
Statistic Method: Wilcoxon signed matched-pair ranked test : non-parametric test 
which use difference between sums of ranks from two series of data as statistic 
H0: no difference H1: has difference 
if P-value is large, accept H0.  
Otherwise reject H0.

Result: Extension Library is Superior to Primary Library
Comparison of three types of primary library:
1. ClustalW pair-wise library(C) (extended to CE)
2. Lalign pairwise Library(L) (extended to LE)
3. Pooling of the ClustalW and Lalign pairwise libraries(CL) (extended to CLE)
Result: CL > C , CL > L
Comparison of Extension library with Primary library
Result: CE > C , LE > L , CLE > CL
Comparison between three types of Extension Libraries
Result: CLE > CE , CLE > LE
So that we can conclude that CLE is the best library as scoring scheme.

Result: T-Coffee Method is Superior to Other Methods
As comparison with other Methods, two scoring scheme has been separately applied, and for each  
scoring scheme, two kinds of test has been applied.
Column-wise
core region test T-Coffee > Prrp > ClustalW
complete alignment test T-Coffee > ClustalW> Prrp
Sum of pairs
core region test T-Coffee > ClustalW> Prrp
complete alignment test ?

Result: T-Coffee does not always outperform other methods in all speciﬁc cases

T coffee algorithm dissection

More Related Content

What's hot

Viewers also liked

Similar to T coffee algorithm dissection

Recently uploaded

T coffee algorithm dissection