SlideShare a Scribd company logo
The derivation of ungapped global protein alignment score distrib utions 
Abstract 
For the gap-less global protein sequence alignment with N residues, the score distribution 
can be expressed as sum of probabilities from multinomial distribution elements with same 
score. The score distribution itself is dependant to length, substitution matrix and amino acid 
frequencies. 
Introduction 
The sequence alignment score distribution is an important determining factor of e-values 
of sequence alignments. As considering the importance of protein alignment score expected 
value calculation in biomedical research fields, understanding the exact nature of score 
distributions is of great importance with major consequences. There were many studies to 
determine the function that can best fit the observed score distributions (refs). Although there are 
weak consensus on the function that performs best is an extreme value distributions, there is no 
studies that derived it by theoretical examination of derivation of sequence alignment score 
distribution. In this study, the formula for the probability distributions of alignment scores is 
derived from theoretical view point, then the results would be applied to examine theoretically 
predicted score distributions against observed score distributions of randomly generated 
sequences. The comparison of alignment score distributions between the naturally occurring 
sequences and predictions, the best-fit function will be presented, and at last, approach to the 
generalization will be discussed. 
The methods 
An examination of alignment score as multinomial event 
The aligned protein sequences are paired sequences with assigned matches/mismatches 
between two sequences. These amino acids within each sequence have no 
correlation/dependency to neighboring residues in terms of score calculations (not in biological 
contexts), therefore completely independent in ungapped sequence alignment. Based on this, a 
pair-wise alignment score of sequences with k matches in N residues was modeled as 
multinomial event with k successes of independent events with multiple outcomes in N total 
trials. The match with particular score (matching score for each independent residue types) was 
considered as an outcome (successful) and mismatch was also taken as an outcome (failure) 
within N trials. Assume there are m residue types for the pair of sequences composed of N 
residues. They were aligned with x1 matches with p1 occurrence probability of r1 residue type 
associated with s1 score per match, to xm matches with pm occurrence probability of rm residue 
type associated with sm score per match and mismatch with residual probability (pf) of all 
matches 1- {pi:i=1~m} , each probability of this multinomial event P (this combination of residue 
types) and score Sc are expressed as 
P = 
 
N! 
x1!x2! xm!(N k)! p1 
x1 p2 
x2 pm 
xm pf 
N k = 
N! 
m pf 
(N k)! xi! 
i 
m , Pf = 1- 
N k pi 
xi 
i 
pi 
2 
m 
i 
(1)
Sc = x1s1 + x2s2 + … + xmsm = 
! 
xisi 
m" 
i 
(2) 
For particular score Si, total probability is a sum of probabilities of all combinations of 
positive lattice points on superplane Si = x1s1 + x2s2 + … + xmsm. 
The prediction of alignment score distributions 
The matching probability was defined as the probability of which this amino acid was 
observed as “match” in alignment result. Each amino acid has unique scores for match/mismatch 
events in substitution matrix (such as BLOSUM62; refs). Thus, (1/20)2 was given for synthetic 
sequences with even frequencies and Fobs 
2 for each amino acid observed with frequency of Fobs 
(amino acid frequency) in naturally occurring sequences in the case of self-match only 
(success/failure model). As substitution matrix defines unique matching/similarity scores or 
mismatch penalties for pairs of residues, it was necessary to calculate probabilities of all score 
types (score or penalty; refs). Each probability of occurrence F(i,j) for pair of amino acid i, j was 
defined as F(i,j)=Fobs(i) !Fobs(j) . The score types St were defined as score values found within 
substitution matrix (in this study, -4~9,11 from BLOSUM62), Pt(k) for score type k was the 
summation of each probability of matrix element with score type St(k). In this case, there was no 
success or failure. The calculation on BLOSUM62 matrix gave St={-4,-3,-2,- 
1,0,1,2,3,4,5,6,7,8,9,11} and Pt(k)={0.0415 ,0.1806 ,0.2335 ,0.2275 ,0.1429 ,0.0666 ,0.0373 
,0.0104 ,0.0287 ,0.0160 ,0.0111 ,0.0031 ,0.0005 ,0.0002 ,0.0001}, based on amino acid 
frequencies from UniProt (2013). It should be noted that this occurrence probability distribution 
of score types is dependent to amino acid frequencies and affects the outcome. The probabilities 
of the combinations of pair of residue types and their occurrence probabilities can be replaced by 
this score types and their occurrence probabilities. Thus, rather than dealing with 210 
combination of residue types, 15 score types (BLOSUM62) to 20 types (PAM250) are necessary 
to be considered. 
For the pairs of aligned sequences, the sequence score distribution was given by flowing 
equations with t score types. The occurrence probability p(N,xti,Sti:i=1~t) of alignment score of 
N residue sequences with particular numbers of xti of Sti residue type with probability of Pt(i) can 
be expressed as; 
! 
p(N, xti,Pti : i =1 ~ t) = 
N! 
xt1!xt2!!xtt! 
xt1 pt2 
pt1 
xt2!ptt 
xtt , 
! 
xti 
m" 
i 
= N (3) 
Sc = xt1St1 + xt2St2 + … + xttStt (4) 
By enumeration of all combinations with score of Sc and summation of probabilities, alignment 
score distribution for this particular parameter set (amino acid frequencies and substitution 
matrix) can be calculated as Pc = !{p(score is Sc )}. 
The examination of prediction by derived probability formula and scores with sequence 
alignments of randomly generated synthetic sequences 
The protein sequences were generated randomly with (1) even frequencies for all 20 
amino acids or (2) observed frequencies in real sequences (ref: UniProt 2013) in order to address 
the obvious effect of sequence length and hypothesis that amino acid frequency is potential 
factor of the distribution shape. Synthetic sequences were generated with varied length 
(9/18/36/54/72/90/120/150/180/210/240/270) in order to obtain better amino acid frequency 
statistics compliance to UniProt statisitcs. Natural sequence fragments are generated from CATH
3.4 domain database (ref:CATH) and fragmented into given lengths without overlap in order to 
avoid the effect of overlapped sequences. The total 5000 sequences were randomly selected from 
fragments and aligned pair-wise with all combinations. The sequence alignments were performed 
with Needleman-Wunsch algorithm with modified/unmodified BLOSUM62 or synthetic 
substitution matrices with both natural and synthetic random sequences. There was another 
apparent hypothesis that the substitution matrix is determining factor for the score distributions. 
The calculations of score distributions with substitution matrices were performed by binning the 
alignment scores. Following section describes details on the calculations. 
The scoring schemes were (1) matching with score of 1 (2) evenly distributed 
score/match score of 4~8 (4 residues each) for 20 commonly observed amino acids without 
mismatch penalties to mimic the BLOSUM62 substitution matrix in simplified manner (3) 
matrix derived from BLOSUM62 substitution matrix diagonal components with neither 
mismatch penalties nor scores for rest of matrix (4/5/6/7/8/9/11 for match/residue with 
5/6/4/2/1/1/1 occurrences, respectively) and (4) BLOSUM62. Sequence alignments were 
performed by in-house code with all matrices with default (-10,-2) or very high (9999 for both to 
make alignment gapless) gap opening and extension penalties. The scores were binned with the 
interval width of 1 and assembled. The predictions of score distribution were performed by in-house 
code that calculated all combinations of occurrences of multinomial events (all score 
types), with given scoring scheme and probabilities of occurrences that belonged to same score 
were enumerated and summed. Predicted probability distributions were then compared with the 
distributions calculated from pair-wise alignment scores of synthetic/natural sequences. The 
occurrence probabilities for the prediction were collected from stats of UniProt 2013 dataset or 
adjusted by observed amino acid frequencies in case of naturally occurring sequences as 
observed amino acid frequencies varied set to set. 
The fitting of alignment score distributions 
The MATLAB codes utilizing “lsqnonlin” were used to fit score distributions with 
various statistical distribution functions (binomial/Poisson/Gamma/Extreme value distributions). 
These functions were modified to take into account the X-axis (score) shift and peak width 
adjustment. The amplitude was also adjustable. The values of R-squared were calculated to 
examine fits along with eye-inspection for determination of skewness differences and/or 
systematic biases of residual in fittings. Following were the actual functions used to fit the data. 
Binomial and Poisson were extended from discrete functions to continuous forms by replacing 
factorials by gamma functions due to technical issues of discrete fitting procedures. 
Binomial distribution 
! 
fB (x) = A" 
#(N +1) 
#( 
x $ x0 
w 
+1)" #(N $ 
x $ x0 
w 
+1) 
p 
x$x0 
w (1$ p) 
N $ 
x$x0 
w 
Poisson distribution 
! 
f p (x) = A" 
x$x0 
w 
# 
%( 
x $ x0 
w 
+1) 
e$# 
Gamma distribution 
! 
fg (x) = A" 
x # x0 ( )k#1 
$(k)" % k e 
# 
x#x0 
% 
Extreme value distribution 
! 
fe (x) = A" e 
( 
x#x 0 
w 
#e 
x#x0 
w )
The examination of parameter trajectories by sequence length changes 
The MATLAB codes for simple linear and exponential fit were written and used to obtain 
the length dependent probability calculation formula for the ungapped global alignment score 
distributions. The varied sequence lengths were subjected to the predictions of the score 
distributions in order to examine the qualitative and quantitative trajectories of parameters by 
sequence length changes. The best-fit model of extended gamma distribution has 3 or 4 
parameters and each parameter was fitted with linear or exponential depending on the shape of 
curve as the function of sequence lengths in order to examine whether they were predictable as 
the functions of sequence length. 
Results 
The derivation of score probability density formula 
The view of alignment results as simple multinomial event with particular set of residue 
types as outcomes (score/match) led to the well-known formula of multinomial distribution. This 
distribution itself did not give a score distribution as the distribution gave the probability of the 
occurrence of “combination” of residue types. In order to calculate the distribution of scores, all 
conceivable combinations needed to be enumerated and all probabilities with same score values 
were necessary to be summed up. 
Independent from how sequence alignment was performed, a pair-wise alignment score 
of sequences can be interpreted as simple multinomial events. The match with particular score 
type is an outcome (see method section). Thus, we can model it as a simple multinomial event 
with N trials (refs) with m outcomes (residue types). In the same sense, multiple residue types 
with same score type can be combined to single entity as the combination of score type was what 
determines the scores but not their “sequences”. Thus, in sense, no matter how their sequences 
are varied or shuffled, as long as there were “matches of 3 Ala, 1 Asn, 5 Leu, 1 Tyr and 1 Val”, 
they are same. Then it can be then reduced to “9 of score 4 matches and 1 of score 6 match, and a 
score 7 match”. The formula to calculate probabilities for the given combination of residue types 
and score were given by equations 1 and 2. With the points above, further reduction of the 
expression of calculation by converting residue types to score types was achieved, now formulas 
were transformed to equations 3 and 4 expressed by score types rather than residue types. As the 
formula for the calculation of score and its probability of occurrence were derived theoretically, 
simple enumeration of all combinations of score types with given length of sequences and 
summation of probabilities of combinations with same scores were possible by simple 
computation. Unfortunately, this calculation was extremely computationally expensive as the 
number of combination increases in geometric progression, the computation of longer sequences 
were not scalable. There are 15 score types, therefore, the number of combinations are expressed 
by f(L)=(L+14)/L·f(L-1), (L indicates sequence length). On the workstation with 2.66GHz Xeon, 
it took over 9 days to compute in single thread. Increase of sequence length results in significant 
increase of load and computation with reasonable resource within reasonable time was out of 
reach. For particular score, total probability is a sum of probabilities of all combinations of 
positive lattice points on superplane expressed by equation 4. Unfortunately, the derivation or 
calculation of such a points is beyond this study and the subject itself would be its own study for 
the mathematicians. It appears not be easy task or simply may not be possible for the biomedical 
researcher such as myself. 
The validation of predictions
The score distribution could be calculated from the formula derived in method section 
(equation 1 and 2) as summation of probabilities from same score with varied residue 
combinations. The figure 1 shows the comparison of predicted distributions and those calculated 
from alignment scores of synthetic sequences. The even occurrence probability for score 4~8 for 
each match is shown in panel A. The two data matched almost completely with R2 >0.99996. As 
there is no penalty and scores are between 4 and 8, the distribution is not the bell-shaped but 
rather more complex shapes particularly for the shorter sequences. This result strongly indicates 
that score distribution is not necessarily bell-shaped curve that can be expressed in simple 
statistical distribution functions. In this case, in order for the simplification, amino acid 
frequencies were same (1/20) for all residue types and substitution matrix is 0 for non-diagonal 
elements, thus [4,5,6,7,8] were set 4 each in diagonal positions. There were no scores of 1/2/3 as 
there was no combination for these values, thus, both prediction and actual alignment scores did 
not show them. Scores 4~7 were equally probable as there was single combination, 8 had score 8 
alone and two 4s, thus higher in probability, so on. With sequence length went up, more smooth 
the distributions were, although there were still jaggedness even with 120 residues. The 
distributions with the use of synthetic substitution matrix with diagonal BLOSUM62 matrix were 
shown in panel B. With more complex score types with differing occurrences (score 
[4,5,6,7,8,9,11]=5,6,4,2,1,1,1 occurrences), the distributions were slightly more smoothed out 
but basically same as 4-8 matrix. The R2s ranged in 0.9987~0.9999. The use of BLOSUM62 
matrix was computationally expensive to produce predictions for the longer sequences as the 
execution time increase roughly in order of n! (15HN=15+N-1CN as number of multinomial 
combination is expressed by repetitive combinations). Thus, only sequence length of 9, 18 and 
36 were predicted and are shown. With ungapped sequence alignments of synthetic sequences 
with uniform amino acid frequencies are shown in panel C. The matching are almost perfect with 
R2=0.9996~0.9999 for all lengths. As obviously expected to be, the sequence score distributions 
are length dependent, but not in simple manner. With the use of full BLOSUM62 matrix, there 
are 15 types of multinomial events (score/match of 4~9,11, similarity scores 1~3 and the 
mismatch penalties -4~0), thus, completely smoothing out the distributions. Unfortunately, more 
residue type meant more nested loops for the calculations, it was very computationally expensive 
to calculate the distribution of longer sequences. The increasing calculation time by factor of 
elength/2.28 makes it not practical (takes less than 1s for 9 residue sequence but over 9 days for 36 
residues). But this examination confirms the formula was correct and substitution matrix 
determines the shape of distributions. In panel D, the results from the same BLOSUM62 with 
amino acid frequency calculated from UniProt (release 2013_03) stats is shown. The matches 
between prediction and distributions of actual sequence alignment scores of synthetic sequences 
match very well (R2>0.99996). An interesting point is that the score distributions are shifted 
toward higher direction. This may come from the difference in overall matching probability for 
positive scores for even and observed amino acid frequencies (0.155/0.174, respectively). Thus, 
the hypothesis that amino acid frequency affects the score distribution shape, is proven to be 
true. 
Figure 2 shows small discrepancy between predictions and score distributions from 
alignments of naturally occurring sequences. This was expected as we saw in previous section 
that amino acid frequencies differences resulted in the changes in score distributions (figure 1 
panel C/D). This was assumed to be caused by small differences in amino acid frequencies of 
actual fragment datasets from CATH 3.4 database and frequencies adopted from UniProt 
statistics. The amino acid frequencies of actual sequences used for the alignments were
calculated and predictions were re-calculated with these adjusted amino acid frequencies. There 
were 3.2~6.1% deviations of frequency per residue type (as median values of change in 
frequencies) between CATH 3.4 derived sequence datasets and UniProt values. After adjusting 
the frequencies for the predictions, the R2 of fits went up from 0.9982/0.9975/0.9970 to 
0.9999/0.9999/0.9997 and obvious systematic residuals were mostly eliminated. Thus, even 
relatively small differences in amino acid frequencies affected the distribution in easily 
detectable level. 
The fitting of score distributions 
The score distributions were fitted with 4 families of statistical distribution functions. (1) 
Binomial distribution (2) Poisson distribution (3) Gamma distribution and (4) Extreme value 
distribution. As the length of sequence strongly affect the shape of distributions, fittings were 
performed in 9/18/36/54 residues. As shown in figure 1, the synthetic substitution matrices with 
neither mismatch penalties nor similarity score do not have bell-shaped distributions but rather 
more jagged shape, obviously it is futile trying to find function that can fit these distributions. 
Thus, only smoothed out distributions of BLOSUM62 with/without gap were fitted with these 
functions. Figure 3 and Table 1 show the fitting results of 4 functions with different sequence 
lengths, amino acid frequencies with pair-wise alignment score distributions from synthetic 
sequences. Clearly, gamma distribution was the best function for the BLOSUM62 results for all 
length of sequences studied with highest R2 and smallest systematic residuals. Poisson and 
binomial distributions also performed well in longer sequences but not as good as gamma in the 
short ones. Extreme value distribution had obvious difference in skewness that cannot be 
compensated by parameters and was the worst function among 4. In the case of substitution 
matrix with 1 at diagonal position and 0 for other positions (represents Bernoulli trial), as 
expected, the score distributions completely matched with binomial distributions (data not 
shown). But as it can be seen in figure 1A/B, distributions by substitution matrix with multiple 
score types were not simple convolution of binomial distributions. The binomial and Poisson 
distribution could explain the shape of distributions, in order to fit well, however, they required 
non-unity amplitude parameters. This fact did not sit well with the nature of probability 
density(mass) functions, thus, also from this point, gamma distribution is the best distribution for 
ungapped alignment score distributions. These functions were also tested with naturally 
occurring sequences derived from CATH domain dataset (Figure 4). The pair-wise global 
sequence alignments were computed with both ungapped and gapped manner. The results were 
same as synthetic sequence alignments. Despite the differences in amino acid frequencies that 
affected distribution shapes, all models tested (even, UniProt and CATH derived amino acid 
frequencies) resulted in the same best-fit distribution (gamma distribution). Therefore, it is 
reasonable to conclude that the alignment score distributions of sequences aligned with 
BLOSUM62 matrix follow gamma distribution. The fitting of the predicted distributions, which 
were deemed to be error free as probabilities are precisely calculated from exact analytical 
solution, by gamma distribution resulted in the R2-values of >0.99999 (length 12 or longer) or 
0.999900/0.999979 for length of 6 and 9, respectively. 
The derivation of formulas for parameter prediction as functions of sequence length, 
evaluation of predictability by extrapolated parameters 
In previous sections, it is proved that the multinomial event model could exactly predict 
the score distributions by different substitution matrices, lengths and amino acid frequencies. As
explained already, however, the enumeration of combinations and calculation of probabilities for 
scores are extremely computationally expensive as combinations increase by roughly n! order. 
Therefore, utilizing prediction power to generate the series of distributions as function of 
sequence length and fit these models with gamma distribution to predict the parameters has been 
sought. As shown in Figure 5, parameters in short sequence ranges behaved pretty well in terms 
of predictability. The gamma distribution has two shape deciding parameters (k/! ) and in this 
study, two more parameters were introduced to handle negative scores (x0: x-shift in order to 
avoid complex parameters or undefined gamma function values) and potential necessity for 
adjusting the amplitude. Gamma distribution did not require amplitude adjustment as “best fit” 
parameters for varied length of sequences had values almost exactly 1.0. This is a good property 
for the distribution function as the fitting is performed on probability density function and it 
should have amplitude of exact 1.0. Thus, in further fittings, this amplitude parameter was fixed 
to 1.0. The multinomial event model was utilized to predict the score distributions for different 
lengths of sequences. These distribution models were then fitted by gamma distribution from 
estimated initial parameters. This was an important precaution as with multivariate fitting 
session, initial parameters are very important for the convergence of fitting and arrival of 
“series” of fitting results that can be compared. If initial parameters are not consistently given, it 
is not guaranteed to fall into comparable local minima or even did not converge at all. Figure 5 
shows the results of distribution prediction and fitting, then fitting of gamma distribution 
parameters (k/! /x0) by the linear ( f(x)=a+bx ) or exponential function ( f(x)=a+b*power(x/c,d) ). 
The resultant fits were very well (>0.99999 for k/x0 and 0.99986 for ! ). The obtained parameters 
from these distributions (sequence lengths of 6/9/12/15/18/21/24/27/30/33/36) were extrapolated 
to longer sequence lengths (54~270 residues) and distributions were calculated from extrapolated 
parameters. Those calculated distributions (which are too long to be predicted by exact formula 
with reasonable time, resources) were then compared with actual sequence alignment results up 
to 270 residues long synthetic sequences. Figure 6 shows the comparison of actual sequence 
alignment score distributions and distributions calculated from extrapolated parameters. These 
pairs of distributions matched in the range of R2 values of 0.99952~0.99998 for the sequence 
lengths from 9 to 270. In order to evaluate the extrapolated parameters in comparison with fitted 
parameters from alignment score distributions, score distributions from alignment results were 
fitted with gamma distributions with extrapolated parameters as initial values. They were 
matched very well with error 0.15~0.38% in average. 
Thus, if database amino acid frequencies are known and substitution matrix is given, the 
distribution of alignment scores can be calculated with good accuracy as a function of sequence 
length. This translates into precise calculation of e-values of alignment even without exact 
combinatorial probability calculation and enumeration. Computationally, this is the biggest 
benefit as regular workstation can calculate the e-value precisely in very short time (gamma 
distribution parameters are given, thus cumulative probability function is known).
Table 1: The R2-values for the fitting of alignment score distributions of synthetic sequences by 
statistical distribution functions 
even 9 18 36 54 120 
Gamma 0.9999 0.9999 0.9998 0.9998 0.9996 
Extreme 0.9974 0.9921 0.9872 0.9910 0.9746 
Poisson 0.9992 0.9999 0.9998 0.9997 0.9997 
Binomial 0.9991 0.9998 0.9998 0.9997 0.9997 
A.A. freq 9 18 36 54 120 
Gamma 0.9998 0.9999 0.9999 0.9998 0.9997 
Extreme 0.9950 0.9892 0.9852 0.9821 0.9749 
Poisson 0.9997 0.9999 0.9999 0.9997 0.9997 
Binomial 0.9996 0.9999 0.9999 0.9998 0.9997 
!!!!!! 
Length Binomial Poisson Gamma Extreme 
6 0.9989 "#$%$& "#$$$$! "#$$'(! 
9 0.9999 "#$$$'! 1.0000 "#$$&' 
12 1.0000 1.0000 1.0000 "#$$)* 
15 1.0000 1.0000 1.0000 "#$%$$ 
18 1.0000 1.0000 1.0000 "#$%%' 
21 1.0000 1.0000 1.0000 "#$%%' 
24 1.0000 1.0000 1.0000 "#$%'(! 
27 1.0000 1.0000 1.0000 "#$%*$
Figure 1 
The observed and predicted alignment score distributions 
0.10 
0.08 
0.06 
0.04 
0.02 
0.08 
0.06 
0.04 
0.02 
0 
(A) 4~8 even (B) BL62 diagonal 
L9 
L18 
L36 
L54 
L120 
L9 
L18 
L36 
L54 
L120 
(C) BL62 even freq. (D) BL62 UniProt freq. 
-­60 -­40 -­20 0 20 40 
0 
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 
Alignment score 
Probability 
0.12 L9 
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 
0.10 
0.08 
0.06 
0.04 
0.02 
0 
L18 
L36 
L54 
L120 
L9 
L18 
L36 
L54 
L120 
Alignment score 
Probability 
Alignment score 
Probability 
0.08 
0.06 
0.04 
0.02 
0 
Observed (L9) 
Predicted (L9) 
Observed (L18) 
Predicted (L18) 
Observed (L36) 
Predicted (L36) 
Alignment score 
Probability 
Predicted Observed 
Predicted Observed 
Score distributions 
(even A.A. freq.) 
Observed (L9) 
Predicted (L9) 
Observed (L18) 
Predicted (L18) 
Observed (L36) 
Predicted (L36) 
Score distributions 
(even A.A. freq.) 
Observed (L9) 
Observed (L18) 
Observed (L18) 
Score distributions 
(UniProt A.A. freq.) 
-­60 -­40 -­20 0 20 40 
Figure 2 
The effect of amino acid frequencies to the distribution 
Observed distribution 
Prediction (UniProt) 
Prediction (Adjusted) 
Observed distribution 
Prediction (UniProt) 
Prediction (Adjusted) 
residuals 
L18 
L9 
L36 
0.08 
0.06 
0.04 
0.02 
0 
Probability 
-­70 -­60 -­50 -­40 -­30 -­20 -­10 0 10 20 
Alignment score 
Observed distribution 
Prediction (UniProt) 
Prediction (Adjusted)
Figure 3 
The fitting of alignement score distributions of synthetic sequences 
with UniProt A.A. frequencies by 4 statistical distribution functions 
Binomial 
Poisson 
Gamma 
Extreme 
-­50 -­40 -­30 -­20 -­10 0 10 20 30 
0.05 
0.04 
0.03 
0.02 
0.01 
0 
Alignment score 
Probability 
Binomial 
Poisson 
Gamma 
Extreme 
-­30 -­20 -­10 0 10 20 30 40 50 
0.08 
0.06 
0.04 
0.02 
0 
Alignment score 
Probability 
Binomial 
Poisson 
Gamma 
-­70 -­60 -­50 -­40 -­30 -­20 -­10 0 10 
0.03 
0.02 
0.01 
0 
Alignment score 
Probability 
0.025 
0.020 
0.015 
0.010 
0.005 
0 
Alignment score 
Probability 
Binomial 
Poisson 
Gamma 
Extreme 
-­90 -­80 -­70 -­60 -­50 -­40 -­30 -­20 -­10 0 10 
Extreme 
R2=0.9996 
R2=0.9997 
R2=0.9998 
R2=0.9950 
R2=0.9999 
R2=0.9999 
R2=0.9999 
R2=0.9892 
R2=0.9998 
R2=0.9997 
R2=0.9998 
R2=0.9821 
R2=0.9999 
R2=0.9999 
R2=0.9999 
R2=0.9852 
Figure 4 
The Fitting results on naturally occurring sequences 
9 18 36 54 
Sequence length 
R2 
0.995 
0.990 
0.985 
0.980 
Gamma 
Extreme 
Poisson 
Binomial 
Ungapped 
Gamma 
Extreme 
Poisson 
Binomial 
Gapped 
1.0000 
0.9995
Figure 5 
The Calculation of distribution formula 
Sequence length 
1.65 
e 1.55 
5 10 15 20 25 30 35 40 
80 
70 
60 
50 
40 
30 
20 
0 
-­20 
-­40 
-­60 
x0 
-­100 
-­120 
-­140 
303336 
-­60 -­50 -­40 -­30 -­20 -­10 0 10 20 
0.08 
0.06 
0.04 
0.02 
0 
Alignment score 
Probability 
9 
6 
12 
15 
2118 
2724 
f ( x )  A 
x x 0 	 k 1 
(k ) k e 
x x0 
Gamma distribution 
6 9 121518212427 
k 
f(x)=-­3.557.x+2.310 
Sequence length 
10 
1.60 
1.50 
1.45 
1.40 
f(x)=1.378 
+0.398.(-­x/3.813)-­1.199 
5 10 15 20 25 30 35 40 
Sequence length 
f(x)=-­4.114.x+2.447 
5 10 15 20 25 30 35 40 
Sequence length 
-­80 
-­160 
33 30 
36 
Figure 6 
Alignement derived and parameter fit predicted distributions 
0.07 
0.06 
0.05 
0.04 
0.03 
0.02 
0.01 
0 
Probability 
Sequence 
Length 
L= 9 (R2 = 0.99958) 
L= 18 
(R2 = 0.99974) 
L= 36 
(R2 = 0.99965) 
L= 54 
(R2 = 0.99965) 
L= 72 
(R2 = 0.99972) 
L= 90 
(R2 = 0.99963) 
L=120 
(R2 = 0.99960) 
L=150 
(R2 = 0.99952) 
L=180 
(R2 = 0.99962) 
L=210 
(R2 = 0.99966) 
L=240 
(R2 = 0.99992) 
L=270 
-­350 -­300 -­250 -­200 -­150 -­100 -­50 0 50 
Alignment score 
9 
18 
36 
54 
9072 150 120 270 240210 180 
(R2 = 0.99998) 
Observed 
Predicted

More Related Content

Similar to The derivation of ungapped global protein alignment score distributions - Part1

Seq alignment
Seq alignment Seq alignment
Seq alignment
Nagendrasahu6
 
How the blast work
How the blast workHow the blast work
How the blast work
Atai Rabby
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
Rai University
 
Statistical significance of alignments
Statistical significance of alignmentsStatistical significance of alignments
Statistical significance of alignmentsavrilcoghlan
 
PCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdfPCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdf
ssusera1eccd
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07Computer Science Club
 
Variance component analysis by paravayya c pujeri
Variance component analysis by paravayya c pujeriVariance component analysis by paravayya c pujeri
Variance component analysis by paravayya c pujeri
Paravayya Pujeri
 
A PRACTICAL POWERFUL ROBUST AND INTERPRETABLE FAMILY OF CORRELATION COEFFICIE...
A PRACTICAL POWERFUL ROBUST AND INTERPRETABLE FAMILY OF CORRELATION COEFFICIE...A PRACTICAL POWERFUL ROBUST AND INTERPRETABLE FAMILY OF CORRELATION COEFFICIE...
A PRACTICAL POWERFUL ROBUST AND INTERPRETABLE FAMILY OF CORRELATION COEFFICIE...
Savas Papadopoulos, Ph.D
 
4. sequence alignment.pptx
4. sequence alignment.pptx4. sequence alignment.pptx
4. sequence alignment.pptx
ArupKhakhlari1
 
Computation and System Biology Assignment Help
Computation and System Biology Assignment HelpComputation and System Biology Assignment Help
Computation and System Biology Assignment Help
Nursing Assignment Help
 
Protein Secondary Structure Prediction using HMM
Protein Secondary Structure Prediction using HMMProtein Secondary Structure Prediction using HMM
Protein Secondary Structure Prediction using HMMAbhishek Dabral
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
ammar kareem
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
H K Yoon
 
Gutell 061.nar.1997.25.01559
Gutell 061.nar.1997.25.01559Gutell 061.nar.1997.25.01559
Gutell 061.nar.1997.25.01559Robin Gutell
 
Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)
Safa Khalid
 
Correlation
CorrelationCorrelation
Blast Algorithm
Blast AlgorithmBlast Algorithm

Similar to The derivation of ungapped global protein alignment score distributions - Part1 (20)

Seq alignment
Seq alignment Seq alignment
Seq alignment
 
Sequence alignment belgaum
Sequence alignment belgaumSequence alignment belgaum
Sequence alignment belgaum
 
How the blast work
How the blast workHow the blast work
How the blast work
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
Statistical significance of alignments
Statistical significance of alignmentsStatistical significance of alignments
Statistical significance of alignments
 
PCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdfPCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdf
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
 
Variance component analysis by paravayya c pujeri
Variance component analysis by paravayya c pujeriVariance component analysis by paravayya c pujeri
Variance component analysis by paravayya c pujeri
 
Geert van Kollenburg-masterthesis
Geert van Kollenburg-masterthesisGeert van Kollenburg-masterthesis
Geert van Kollenburg-masterthesis
 
A PRACTICAL POWERFUL ROBUST AND INTERPRETABLE FAMILY OF CORRELATION COEFFICIE...
A PRACTICAL POWERFUL ROBUST AND INTERPRETABLE FAMILY OF CORRELATION COEFFICIE...A PRACTICAL POWERFUL ROBUST AND INTERPRETABLE FAMILY OF CORRELATION COEFFICIE...
A PRACTICAL POWERFUL ROBUST AND INTERPRETABLE FAMILY OF CORRELATION COEFFICIE...
 
4. sequence alignment.pptx
4. sequence alignment.pptx4. sequence alignment.pptx
4. sequence alignment.pptx
 
Computation and System Biology Assignment Help
Computation and System Biology Assignment HelpComputation and System Biology Assignment Help
Computation and System Biology Assignment Help
 
Protein Secondary Structure Prediction using HMM
Protein Secondary Structure Prediction using HMMProtein Secondary Structure Prediction using HMM
Protein Secondary Structure Prediction using HMM
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
Gutell 061.nar.1997.25.01559
Gutell 061.nar.1997.25.01559Gutell 061.nar.1997.25.01559
Gutell 061.nar.1997.25.01559
 
Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)
 
presentation
presentationpresentation
presentation
 
Correlation
CorrelationCorrelation
Correlation
 
Blast Algorithm
Blast AlgorithmBlast Algorithm
Blast Algorithm
 

More from Keiji Takamoto

Viral Protein Structure Predictions - Consensus Strategy
Viral Protein Structure Predictions - Consensus StrategyViral Protein Structure Predictions - Consensus Strategy
Viral Protein Structure Predictions - Consensus StrategyKeiji Takamoto
 
Monovalent Cations Mediate Formation of Native Tertiary Structure of Tetrahym...
Monovalent Cations Mediate Formation of Native Tertiary Structure of Tetrahym...Monovalent Cations Mediate Formation of Native Tertiary Structure of Tetrahym...
Monovalent Cations Mediate Formation of Native Tertiary Structure of Tetrahym...Keiji Takamoto
 
Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...
Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...
Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...
Keiji Takamoto
 
Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...
Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...
Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...
Keiji Takamoto
 
Principles of RNA compaction : insights from equilibrium folding pathway of p...
Principles of RNA compaction : insights from equilibrium folding pathway of p...Principles of RNA compaction : insights from equilibrium folding pathway of p...
Principles of RNA compaction : insights from equilibrium folding pathway of p...
Keiji Takamoto
 
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...Keiji Takamoto
 
Payload Attachment Chemistry and Payload Design
Payload Attachment Chemistry and Payload DesignPayload Attachment Chemistry and Payload Design
Payload Attachment Chemistry and Payload DesignKeiji Takamoto
 
Novel Strategy for Small Viral Protein Structure Predictions
Novel Strategy for Small Viral Protein Structure PredictionsNovel Strategy for Small Viral Protein Structure Predictions
Novel Strategy for Small Viral Protein Structure PredictionsKeiji Takamoto
 
Bond-Specific Chemical Cleavages of Peptides & Proteins with Perfluoric Acid ...
Bond-Specific Chemical Cleavages of Peptides & Proteins with Perfluoric Acid ...Bond-Specific Chemical Cleavages of Peptides & Proteins with Perfluoric Acid ...
Bond-Specific Chemical Cleavages of Peptides & Proteins with Perfluoric Acid ...Keiji Takamoto
 
C-terminal Sequencing of Protein : Novel Partial Acid Hydrolysis & Analysis b...
C-terminal Sequencing of Protein : Novel Partial Acid Hydrolysis & Analysis b...C-terminal Sequencing of Protein : Novel Partial Acid Hydrolysis & Analysis b...
C-terminal Sequencing of Protein : Novel Partial Acid Hydrolysis & Analysis b...Keiji Takamoto
 
Radiolytic Modification of Basic Amino Acid Residues in Peptides : Probes for...
Radiolytic Modification of Basic Amino Acid Residues in Peptides : Probes for...Radiolytic Modification of Basic Amino Acid Residues in Peptides : Probes for...
Radiolytic Modification of Basic Amino Acid Residues in Peptides : Probes for...Keiji Takamoto
 
Controlled Formation of Low-Volume Liquid Pillars between Plates with Lattic...
Controlled Formation of Low-Volume Liquid Pillars  between Plates with Lattic...Controlled Formation of Low-Volume Liquid Pillars  between Plates with Lattic...
Controlled Formation of Low-Volume Liquid Pillars between Plates with Lattic...Keiji Takamoto
 
Carboxy-terminal Degradation of Peptides using Perfluoroacyl Anhydrides : C-T...
Carboxy-terminal Degradation of Peptides using Perfluoroacyl Anhydrides : C-T...Carboxy-terminal Degradation of Peptides using Perfluoroacyl Anhydrides : C-T...
Carboxy-terminal Degradation of Peptides using Perfluoroacyl Anhydrides : C-T...Keiji Takamoto
 
Semi-automated Single-band Peak-fitting Analysis of Hydroxyl Radical Nucleic ...
Semi-automated Single-band Peak-fitting Analysis of Hydroxyl Radical Nucleic ...Semi-automated Single-band Peak-fitting Analysis of Hydroxyl Radical Nucleic ...
Semi-automated Single-band Peak-fitting Analysis of Hydroxyl Radical Nucleic ...Keiji Takamoto
 
Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...
Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...
Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...Keiji Takamoto
 

More from Keiji Takamoto (15)

Viral Protein Structure Predictions - Consensus Strategy
Viral Protein Structure Predictions - Consensus StrategyViral Protein Structure Predictions - Consensus Strategy
Viral Protein Structure Predictions - Consensus Strategy
 
Monovalent Cations Mediate Formation of Native Tertiary Structure of Tetrahym...
Monovalent Cations Mediate Formation of Native Tertiary Structure of Tetrahym...Monovalent Cations Mediate Formation of Native Tertiary Structure of Tetrahym...
Monovalent Cations Mediate Formation of Native Tertiary Structure of Tetrahym...
 
Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...
Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...
Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...
 
Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...
Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...
Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...
 
Principles of RNA compaction : insights from equilibrium folding pathway of p...
Principles of RNA compaction : insights from equilibrium folding pathway of p...Principles of RNA compaction : insights from equilibrium folding pathway of p...
Principles of RNA compaction : insights from equilibrium folding pathway of p...
 
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
 
Payload Attachment Chemistry and Payload Design
Payload Attachment Chemistry and Payload DesignPayload Attachment Chemistry and Payload Design
Payload Attachment Chemistry and Payload Design
 
Novel Strategy for Small Viral Protein Structure Predictions
Novel Strategy for Small Viral Protein Structure PredictionsNovel Strategy for Small Viral Protein Structure Predictions
Novel Strategy for Small Viral Protein Structure Predictions
 
Bond-Specific Chemical Cleavages of Peptides & Proteins with Perfluoric Acid ...
Bond-Specific Chemical Cleavages of Peptides & Proteins with Perfluoric Acid ...Bond-Specific Chemical Cleavages of Peptides & Proteins with Perfluoric Acid ...
Bond-Specific Chemical Cleavages of Peptides & Proteins with Perfluoric Acid ...
 
C-terminal Sequencing of Protein : Novel Partial Acid Hydrolysis & Analysis b...
C-terminal Sequencing of Protein : Novel Partial Acid Hydrolysis & Analysis b...C-terminal Sequencing of Protein : Novel Partial Acid Hydrolysis & Analysis b...
C-terminal Sequencing of Protein : Novel Partial Acid Hydrolysis & Analysis b...
 
Radiolytic Modification of Basic Amino Acid Residues in Peptides : Probes for...
Radiolytic Modification of Basic Amino Acid Residues in Peptides : Probes for...Radiolytic Modification of Basic Amino Acid Residues in Peptides : Probes for...
Radiolytic Modification of Basic Amino Acid Residues in Peptides : Probes for...
 
Controlled Formation of Low-Volume Liquid Pillars between Plates with Lattic...
Controlled Formation of Low-Volume Liquid Pillars  between Plates with Lattic...Controlled Formation of Low-Volume Liquid Pillars  between Plates with Lattic...
Controlled Formation of Low-Volume Liquid Pillars between Plates with Lattic...
 
Carboxy-terminal Degradation of Peptides using Perfluoroacyl Anhydrides : C-T...
Carboxy-terminal Degradation of Peptides using Perfluoroacyl Anhydrides : C-T...Carboxy-terminal Degradation of Peptides using Perfluoroacyl Anhydrides : C-T...
Carboxy-terminal Degradation of Peptides using Perfluoroacyl Anhydrides : C-T...
 
Semi-automated Single-band Peak-fitting Analysis of Hydroxyl Radical Nucleic ...
Semi-automated Single-band Peak-fitting Analysis of Hydroxyl Radical Nucleic ...Semi-automated Single-band Peak-fitting Analysis of Hydroxyl Radical Nucleic ...
Semi-automated Single-band Peak-fitting Analysis of Hydroxyl Radical Nucleic ...
 
Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...
Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...
Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...
 

The derivation of ungapped global protein alignment score distributions - Part1

  • 1. The derivation of ungapped global protein alignment score distrib utions Abstract For the gap-less global protein sequence alignment with N residues, the score distribution can be expressed as sum of probabilities from multinomial distribution elements with same score. The score distribution itself is dependant to length, substitution matrix and amino acid frequencies. Introduction The sequence alignment score distribution is an important determining factor of e-values of sequence alignments. As considering the importance of protein alignment score expected value calculation in biomedical research fields, understanding the exact nature of score distributions is of great importance with major consequences. There were many studies to determine the function that can best fit the observed score distributions (refs). Although there are weak consensus on the function that performs best is an extreme value distributions, there is no studies that derived it by theoretical examination of derivation of sequence alignment score distribution. In this study, the formula for the probability distributions of alignment scores is derived from theoretical view point, then the results would be applied to examine theoretically predicted score distributions against observed score distributions of randomly generated sequences. The comparison of alignment score distributions between the naturally occurring sequences and predictions, the best-fit function will be presented, and at last, approach to the generalization will be discussed. The methods An examination of alignment score as multinomial event The aligned protein sequences are paired sequences with assigned matches/mismatches between two sequences. These amino acids within each sequence have no correlation/dependency to neighboring residues in terms of score calculations (not in biological contexts), therefore completely independent in ungapped sequence alignment. Based on this, a pair-wise alignment score of sequences with k matches in N residues was modeled as multinomial event with k successes of independent events with multiple outcomes in N total trials. The match with particular score (matching score for each independent residue types) was considered as an outcome (successful) and mismatch was also taken as an outcome (failure) within N trials. Assume there are m residue types for the pair of sequences composed of N residues. They were aligned with x1 matches with p1 occurrence probability of r1 residue type associated with s1 score per match, to xm matches with pm occurrence probability of rm residue type associated with sm score per match and mismatch with residual probability (pf) of all matches 1- {pi:i=1~m} , each probability of this multinomial event P (this combination of residue types) and score Sc are expressed as P =  N! x1!x2! xm!(N k)! p1 x1 p2 x2 pm xm pf N k = N! m pf (N k)! xi! i m , Pf = 1- N k pi xi i pi 2 m i (1)
  • 2. Sc = x1s1 + x2s2 + … + xmsm = ! xisi m" i (2) For particular score Si, total probability is a sum of probabilities of all combinations of positive lattice points on superplane Si = x1s1 + x2s2 + … + xmsm. The prediction of alignment score distributions The matching probability was defined as the probability of which this amino acid was observed as “match” in alignment result. Each amino acid has unique scores for match/mismatch events in substitution matrix (such as BLOSUM62; refs). Thus, (1/20)2 was given for synthetic sequences with even frequencies and Fobs 2 for each amino acid observed with frequency of Fobs (amino acid frequency) in naturally occurring sequences in the case of self-match only (success/failure model). As substitution matrix defines unique matching/similarity scores or mismatch penalties for pairs of residues, it was necessary to calculate probabilities of all score types (score or penalty; refs). Each probability of occurrence F(i,j) for pair of amino acid i, j was defined as F(i,j)=Fobs(i) !Fobs(j) . The score types St were defined as score values found within substitution matrix (in this study, -4~9,11 from BLOSUM62), Pt(k) for score type k was the summation of each probability of matrix element with score type St(k). In this case, there was no success or failure. The calculation on BLOSUM62 matrix gave St={-4,-3,-2,- 1,0,1,2,3,4,5,6,7,8,9,11} and Pt(k)={0.0415 ,0.1806 ,0.2335 ,0.2275 ,0.1429 ,0.0666 ,0.0373 ,0.0104 ,0.0287 ,0.0160 ,0.0111 ,0.0031 ,0.0005 ,0.0002 ,0.0001}, based on amino acid frequencies from UniProt (2013). It should be noted that this occurrence probability distribution of score types is dependent to amino acid frequencies and affects the outcome. The probabilities of the combinations of pair of residue types and their occurrence probabilities can be replaced by this score types and their occurrence probabilities. Thus, rather than dealing with 210 combination of residue types, 15 score types (BLOSUM62) to 20 types (PAM250) are necessary to be considered. For the pairs of aligned sequences, the sequence score distribution was given by flowing equations with t score types. The occurrence probability p(N,xti,Sti:i=1~t) of alignment score of N residue sequences with particular numbers of xti of Sti residue type with probability of Pt(i) can be expressed as; ! p(N, xti,Pti : i =1 ~ t) = N! xt1!xt2!!xtt! xt1 pt2 pt1 xt2!ptt xtt , ! xti m" i = N (3) Sc = xt1St1 + xt2St2 + … + xttStt (4) By enumeration of all combinations with score of Sc and summation of probabilities, alignment score distribution for this particular parameter set (amino acid frequencies and substitution matrix) can be calculated as Pc = !{p(score is Sc )}. The examination of prediction by derived probability formula and scores with sequence alignments of randomly generated synthetic sequences The protein sequences were generated randomly with (1) even frequencies for all 20 amino acids or (2) observed frequencies in real sequences (ref: UniProt 2013) in order to address the obvious effect of sequence length and hypothesis that amino acid frequency is potential factor of the distribution shape. Synthetic sequences were generated with varied length (9/18/36/54/72/90/120/150/180/210/240/270) in order to obtain better amino acid frequency statistics compliance to UniProt statisitcs. Natural sequence fragments are generated from CATH
  • 3. 3.4 domain database (ref:CATH) and fragmented into given lengths without overlap in order to avoid the effect of overlapped sequences. The total 5000 sequences were randomly selected from fragments and aligned pair-wise with all combinations. The sequence alignments were performed with Needleman-Wunsch algorithm with modified/unmodified BLOSUM62 or synthetic substitution matrices with both natural and synthetic random sequences. There was another apparent hypothesis that the substitution matrix is determining factor for the score distributions. The calculations of score distributions with substitution matrices were performed by binning the alignment scores. Following section describes details on the calculations. The scoring schemes were (1) matching with score of 1 (2) evenly distributed score/match score of 4~8 (4 residues each) for 20 commonly observed amino acids without mismatch penalties to mimic the BLOSUM62 substitution matrix in simplified manner (3) matrix derived from BLOSUM62 substitution matrix diagonal components with neither mismatch penalties nor scores for rest of matrix (4/5/6/7/8/9/11 for match/residue with 5/6/4/2/1/1/1 occurrences, respectively) and (4) BLOSUM62. Sequence alignments were performed by in-house code with all matrices with default (-10,-2) or very high (9999 for both to make alignment gapless) gap opening and extension penalties. The scores were binned with the interval width of 1 and assembled. The predictions of score distribution were performed by in-house code that calculated all combinations of occurrences of multinomial events (all score types), with given scoring scheme and probabilities of occurrences that belonged to same score were enumerated and summed. Predicted probability distributions were then compared with the distributions calculated from pair-wise alignment scores of synthetic/natural sequences. The occurrence probabilities for the prediction were collected from stats of UniProt 2013 dataset or adjusted by observed amino acid frequencies in case of naturally occurring sequences as observed amino acid frequencies varied set to set. The fitting of alignment score distributions The MATLAB codes utilizing “lsqnonlin” were used to fit score distributions with various statistical distribution functions (binomial/Poisson/Gamma/Extreme value distributions). These functions were modified to take into account the X-axis (score) shift and peak width adjustment. The amplitude was also adjustable. The values of R-squared were calculated to examine fits along with eye-inspection for determination of skewness differences and/or systematic biases of residual in fittings. Following were the actual functions used to fit the data. Binomial and Poisson were extended from discrete functions to continuous forms by replacing factorials by gamma functions due to technical issues of discrete fitting procedures. Binomial distribution ! fB (x) = A" #(N +1) #( x $ x0 w +1)" #(N $ x $ x0 w +1) p x$x0 w (1$ p) N $ x$x0 w Poisson distribution ! f p (x) = A" x$x0 w # %( x $ x0 w +1) e$# Gamma distribution ! fg (x) = A" x # x0 ( )k#1 $(k)" % k e # x#x0 % Extreme value distribution ! fe (x) = A" e ( x#x 0 w #e x#x0 w )
  • 4. The examination of parameter trajectories by sequence length changes The MATLAB codes for simple linear and exponential fit were written and used to obtain the length dependent probability calculation formula for the ungapped global alignment score distributions. The varied sequence lengths were subjected to the predictions of the score distributions in order to examine the qualitative and quantitative trajectories of parameters by sequence length changes. The best-fit model of extended gamma distribution has 3 or 4 parameters and each parameter was fitted with linear or exponential depending on the shape of curve as the function of sequence lengths in order to examine whether they were predictable as the functions of sequence length. Results The derivation of score probability density formula The view of alignment results as simple multinomial event with particular set of residue types as outcomes (score/match) led to the well-known formula of multinomial distribution. This distribution itself did not give a score distribution as the distribution gave the probability of the occurrence of “combination” of residue types. In order to calculate the distribution of scores, all conceivable combinations needed to be enumerated and all probabilities with same score values were necessary to be summed up. Independent from how sequence alignment was performed, a pair-wise alignment score of sequences can be interpreted as simple multinomial events. The match with particular score type is an outcome (see method section). Thus, we can model it as a simple multinomial event with N trials (refs) with m outcomes (residue types). In the same sense, multiple residue types with same score type can be combined to single entity as the combination of score type was what determines the scores but not their “sequences”. Thus, in sense, no matter how their sequences are varied or shuffled, as long as there were “matches of 3 Ala, 1 Asn, 5 Leu, 1 Tyr and 1 Val”, they are same. Then it can be then reduced to “9 of score 4 matches and 1 of score 6 match, and a score 7 match”. The formula to calculate probabilities for the given combination of residue types and score were given by equations 1 and 2. With the points above, further reduction of the expression of calculation by converting residue types to score types was achieved, now formulas were transformed to equations 3 and 4 expressed by score types rather than residue types. As the formula for the calculation of score and its probability of occurrence were derived theoretically, simple enumeration of all combinations of score types with given length of sequences and summation of probabilities of combinations with same scores were possible by simple computation. Unfortunately, this calculation was extremely computationally expensive as the number of combination increases in geometric progression, the computation of longer sequences were not scalable. There are 15 score types, therefore, the number of combinations are expressed by f(L)=(L+14)/L·f(L-1), (L indicates sequence length). On the workstation with 2.66GHz Xeon, it took over 9 days to compute in single thread. Increase of sequence length results in significant increase of load and computation with reasonable resource within reasonable time was out of reach. For particular score, total probability is a sum of probabilities of all combinations of positive lattice points on superplane expressed by equation 4. Unfortunately, the derivation or calculation of such a points is beyond this study and the subject itself would be its own study for the mathematicians. It appears not be easy task or simply may not be possible for the biomedical researcher such as myself. The validation of predictions
  • 5. The score distribution could be calculated from the formula derived in method section (equation 1 and 2) as summation of probabilities from same score with varied residue combinations. The figure 1 shows the comparison of predicted distributions and those calculated from alignment scores of synthetic sequences. The even occurrence probability for score 4~8 for each match is shown in panel A. The two data matched almost completely with R2 >0.99996. As there is no penalty and scores are between 4 and 8, the distribution is not the bell-shaped but rather more complex shapes particularly for the shorter sequences. This result strongly indicates that score distribution is not necessarily bell-shaped curve that can be expressed in simple statistical distribution functions. In this case, in order for the simplification, amino acid frequencies were same (1/20) for all residue types and substitution matrix is 0 for non-diagonal elements, thus [4,5,6,7,8] were set 4 each in diagonal positions. There were no scores of 1/2/3 as there was no combination for these values, thus, both prediction and actual alignment scores did not show them. Scores 4~7 were equally probable as there was single combination, 8 had score 8 alone and two 4s, thus higher in probability, so on. With sequence length went up, more smooth the distributions were, although there were still jaggedness even with 120 residues. The distributions with the use of synthetic substitution matrix with diagonal BLOSUM62 matrix were shown in panel B. With more complex score types with differing occurrences (score [4,5,6,7,8,9,11]=5,6,4,2,1,1,1 occurrences), the distributions were slightly more smoothed out but basically same as 4-8 matrix. The R2s ranged in 0.9987~0.9999. The use of BLOSUM62 matrix was computationally expensive to produce predictions for the longer sequences as the execution time increase roughly in order of n! (15HN=15+N-1CN as number of multinomial combination is expressed by repetitive combinations). Thus, only sequence length of 9, 18 and 36 were predicted and are shown. With ungapped sequence alignments of synthetic sequences with uniform amino acid frequencies are shown in panel C. The matching are almost perfect with R2=0.9996~0.9999 for all lengths. As obviously expected to be, the sequence score distributions are length dependent, but not in simple manner. With the use of full BLOSUM62 matrix, there are 15 types of multinomial events (score/match of 4~9,11, similarity scores 1~3 and the mismatch penalties -4~0), thus, completely smoothing out the distributions. Unfortunately, more residue type meant more nested loops for the calculations, it was very computationally expensive to calculate the distribution of longer sequences. The increasing calculation time by factor of elength/2.28 makes it not practical (takes less than 1s for 9 residue sequence but over 9 days for 36 residues). But this examination confirms the formula was correct and substitution matrix determines the shape of distributions. In panel D, the results from the same BLOSUM62 with amino acid frequency calculated from UniProt (release 2013_03) stats is shown. The matches between prediction and distributions of actual sequence alignment scores of synthetic sequences match very well (R2>0.99996). An interesting point is that the score distributions are shifted toward higher direction. This may come from the difference in overall matching probability for positive scores for even and observed amino acid frequencies (0.155/0.174, respectively). Thus, the hypothesis that amino acid frequency affects the score distribution shape, is proven to be true. Figure 2 shows small discrepancy between predictions and score distributions from alignments of naturally occurring sequences. This was expected as we saw in previous section that amino acid frequencies differences resulted in the changes in score distributions (figure 1 panel C/D). This was assumed to be caused by small differences in amino acid frequencies of actual fragment datasets from CATH 3.4 database and frequencies adopted from UniProt statistics. The amino acid frequencies of actual sequences used for the alignments were
  • 6. calculated and predictions were re-calculated with these adjusted amino acid frequencies. There were 3.2~6.1% deviations of frequency per residue type (as median values of change in frequencies) between CATH 3.4 derived sequence datasets and UniProt values. After adjusting the frequencies for the predictions, the R2 of fits went up from 0.9982/0.9975/0.9970 to 0.9999/0.9999/0.9997 and obvious systematic residuals were mostly eliminated. Thus, even relatively small differences in amino acid frequencies affected the distribution in easily detectable level. The fitting of score distributions The score distributions were fitted with 4 families of statistical distribution functions. (1) Binomial distribution (2) Poisson distribution (3) Gamma distribution and (4) Extreme value distribution. As the length of sequence strongly affect the shape of distributions, fittings were performed in 9/18/36/54 residues. As shown in figure 1, the synthetic substitution matrices with neither mismatch penalties nor similarity score do not have bell-shaped distributions but rather more jagged shape, obviously it is futile trying to find function that can fit these distributions. Thus, only smoothed out distributions of BLOSUM62 with/without gap were fitted with these functions. Figure 3 and Table 1 show the fitting results of 4 functions with different sequence lengths, amino acid frequencies with pair-wise alignment score distributions from synthetic sequences. Clearly, gamma distribution was the best function for the BLOSUM62 results for all length of sequences studied with highest R2 and smallest systematic residuals. Poisson and binomial distributions also performed well in longer sequences but not as good as gamma in the short ones. Extreme value distribution had obvious difference in skewness that cannot be compensated by parameters and was the worst function among 4. In the case of substitution matrix with 1 at diagonal position and 0 for other positions (represents Bernoulli trial), as expected, the score distributions completely matched with binomial distributions (data not shown). But as it can be seen in figure 1A/B, distributions by substitution matrix with multiple score types were not simple convolution of binomial distributions. The binomial and Poisson distribution could explain the shape of distributions, in order to fit well, however, they required non-unity amplitude parameters. This fact did not sit well with the nature of probability density(mass) functions, thus, also from this point, gamma distribution is the best distribution for ungapped alignment score distributions. These functions were also tested with naturally occurring sequences derived from CATH domain dataset (Figure 4). The pair-wise global sequence alignments were computed with both ungapped and gapped manner. The results were same as synthetic sequence alignments. Despite the differences in amino acid frequencies that affected distribution shapes, all models tested (even, UniProt and CATH derived amino acid frequencies) resulted in the same best-fit distribution (gamma distribution). Therefore, it is reasonable to conclude that the alignment score distributions of sequences aligned with BLOSUM62 matrix follow gamma distribution. The fitting of the predicted distributions, which were deemed to be error free as probabilities are precisely calculated from exact analytical solution, by gamma distribution resulted in the R2-values of >0.99999 (length 12 or longer) or 0.999900/0.999979 for length of 6 and 9, respectively. The derivation of formulas for parameter prediction as functions of sequence length, evaluation of predictability by extrapolated parameters In previous sections, it is proved that the multinomial event model could exactly predict the score distributions by different substitution matrices, lengths and amino acid frequencies. As
  • 7. explained already, however, the enumeration of combinations and calculation of probabilities for scores are extremely computationally expensive as combinations increase by roughly n! order. Therefore, utilizing prediction power to generate the series of distributions as function of sequence length and fit these models with gamma distribution to predict the parameters has been sought. As shown in Figure 5, parameters in short sequence ranges behaved pretty well in terms of predictability. The gamma distribution has two shape deciding parameters (k/! ) and in this study, two more parameters were introduced to handle negative scores (x0: x-shift in order to avoid complex parameters or undefined gamma function values) and potential necessity for adjusting the amplitude. Gamma distribution did not require amplitude adjustment as “best fit” parameters for varied length of sequences had values almost exactly 1.0. This is a good property for the distribution function as the fitting is performed on probability density function and it should have amplitude of exact 1.0. Thus, in further fittings, this amplitude parameter was fixed to 1.0. The multinomial event model was utilized to predict the score distributions for different lengths of sequences. These distribution models were then fitted by gamma distribution from estimated initial parameters. This was an important precaution as with multivariate fitting session, initial parameters are very important for the convergence of fitting and arrival of “series” of fitting results that can be compared. If initial parameters are not consistently given, it is not guaranteed to fall into comparable local minima or even did not converge at all. Figure 5 shows the results of distribution prediction and fitting, then fitting of gamma distribution parameters (k/! /x0) by the linear ( f(x)=a+bx ) or exponential function ( f(x)=a+b*power(x/c,d) ). The resultant fits were very well (>0.99999 for k/x0 and 0.99986 for ! ). The obtained parameters from these distributions (sequence lengths of 6/9/12/15/18/21/24/27/30/33/36) were extrapolated to longer sequence lengths (54~270 residues) and distributions were calculated from extrapolated parameters. Those calculated distributions (which are too long to be predicted by exact formula with reasonable time, resources) were then compared with actual sequence alignment results up to 270 residues long synthetic sequences. Figure 6 shows the comparison of actual sequence alignment score distributions and distributions calculated from extrapolated parameters. These pairs of distributions matched in the range of R2 values of 0.99952~0.99998 for the sequence lengths from 9 to 270. In order to evaluate the extrapolated parameters in comparison with fitted parameters from alignment score distributions, score distributions from alignment results were fitted with gamma distributions with extrapolated parameters as initial values. They were matched very well with error 0.15~0.38% in average. Thus, if database amino acid frequencies are known and substitution matrix is given, the distribution of alignment scores can be calculated with good accuracy as a function of sequence length. This translates into precise calculation of e-values of alignment even without exact combinatorial probability calculation and enumeration. Computationally, this is the biggest benefit as regular workstation can calculate the e-value precisely in very short time (gamma distribution parameters are given, thus cumulative probability function is known).
  • 8. Table 1: The R2-values for the fitting of alignment score distributions of synthetic sequences by statistical distribution functions even 9 18 36 54 120 Gamma 0.9999 0.9999 0.9998 0.9998 0.9996 Extreme 0.9974 0.9921 0.9872 0.9910 0.9746 Poisson 0.9992 0.9999 0.9998 0.9997 0.9997 Binomial 0.9991 0.9998 0.9998 0.9997 0.9997 A.A. freq 9 18 36 54 120 Gamma 0.9998 0.9999 0.9999 0.9998 0.9997 Extreme 0.9950 0.9892 0.9852 0.9821 0.9749 Poisson 0.9997 0.9999 0.9999 0.9997 0.9997 Binomial 0.9996 0.9999 0.9999 0.9998 0.9997 !!!!!! Length Binomial Poisson Gamma Extreme 6 0.9989 "#$%$& "#$$$$! "#$$'(! 9 0.9999 "#$$$'! 1.0000 "#$$&' 12 1.0000 1.0000 1.0000 "#$$)* 15 1.0000 1.0000 1.0000 "#$%$$ 18 1.0000 1.0000 1.0000 "#$%%' 21 1.0000 1.0000 1.0000 "#$%%' 24 1.0000 1.0000 1.0000 "#$%'(! 27 1.0000 1.0000 1.0000 "#$%*$
  • 9. Figure 1 The observed and predicted alignment score distributions 0.10 0.08 0.06 0.04 0.02 0.08 0.06 0.04 0.02 0 (A) 4~8 even (B) BL62 diagonal L9 L18 L36 L54 L120 L9 L18 L36 L54 L120 (C) BL62 even freq. (D) BL62 UniProt freq. -­60 -­40 -­20 0 20 40 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Alignment score Probability 0.12 L9 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 0.10 0.08 0.06 0.04 0.02 0 L18 L36 L54 L120 L9 L18 L36 L54 L120 Alignment score Probability Alignment score Probability 0.08 0.06 0.04 0.02 0 Observed (L9) Predicted (L9) Observed (L18) Predicted (L18) Observed (L36) Predicted (L36) Alignment score Probability Predicted Observed Predicted Observed Score distributions (even A.A. freq.) Observed (L9) Predicted (L9) Observed (L18) Predicted (L18) Observed (L36) Predicted (L36) Score distributions (even A.A. freq.) Observed (L9) Observed (L18) Observed (L18) Score distributions (UniProt A.A. freq.) -­60 -­40 -­20 0 20 40 Figure 2 The effect of amino acid frequencies to the distribution Observed distribution Prediction (UniProt) Prediction (Adjusted) Observed distribution Prediction (UniProt) Prediction (Adjusted) residuals L18 L9 L36 0.08 0.06 0.04 0.02 0 Probability -­70 -­60 -­50 -­40 -­30 -­20 -­10 0 10 20 Alignment score Observed distribution Prediction (UniProt) Prediction (Adjusted)
  • 10. Figure 3 The fitting of alignement score distributions of synthetic sequences with UniProt A.A. frequencies by 4 statistical distribution functions Binomial Poisson Gamma Extreme -­50 -­40 -­30 -­20 -­10 0 10 20 30 0.05 0.04 0.03 0.02 0.01 0 Alignment score Probability Binomial Poisson Gamma Extreme -­30 -­20 -­10 0 10 20 30 40 50 0.08 0.06 0.04 0.02 0 Alignment score Probability Binomial Poisson Gamma -­70 -­60 -­50 -­40 -­30 -­20 -­10 0 10 0.03 0.02 0.01 0 Alignment score Probability 0.025 0.020 0.015 0.010 0.005 0 Alignment score Probability Binomial Poisson Gamma Extreme -­90 -­80 -­70 -­60 -­50 -­40 -­30 -­20 -­10 0 10 Extreme R2=0.9996 R2=0.9997 R2=0.9998 R2=0.9950 R2=0.9999 R2=0.9999 R2=0.9999 R2=0.9892 R2=0.9998 R2=0.9997 R2=0.9998 R2=0.9821 R2=0.9999 R2=0.9999 R2=0.9999 R2=0.9852 Figure 4 The Fitting results on naturally occurring sequences 9 18 36 54 Sequence length R2 0.995 0.990 0.985 0.980 Gamma Extreme Poisson Binomial Ungapped Gamma Extreme Poisson Binomial Gapped 1.0000 0.9995
  • 11. Figure 5 The Calculation of distribution formula Sequence length 1.65 e 1.55 5 10 15 20 25 30 35 40 80 70 60 50 40 30 20 0 -­20 -­40 -­60 x0 -­100 -­120 -­140 303336 -­60 -­50 -­40 -­30 -­20 -­10 0 10 20 0.08 0.06 0.04 0.02 0 Alignment score Probability 9 6 12 15 2118 2724 f ( x ) A x x 0 k 1 (k ) k e x x0 Gamma distribution 6 9 121518212427 k f(x)=-­3.557.x+2.310 Sequence length 10 1.60 1.50 1.45 1.40 f(x)=1.378 +0.398.(-­x/3.813)-­1.199 5 10 15 20 25 30 35 40 Sequence length f(x)=-­4.114.x+2.447 5 10 15 20 25 30 35 40 Sequence length -­80 -­160 33 30 36 Figure 6 Alignement derived and parameter fit predicted distributions 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 Probability Sequence Length L= 9 (R2 = 0.99958) L= 18 (R2 = 0.99974) L= 36 (R2 = 0.99965) L= 54 (R2 = 0.99965) L= 72 (R2 = 0.99972) L= 90 (R2 = 0.99963) L=120 (R2 = 0.99960) L=150 (R2 = 0.99952) L=180 (R2 = 0.99962) L=210 (R2 = 0.99966) L=240 (R2 = 0.99992) L=270 -­350 -­300 -­250 -­200 -­150 -­100 -­50 0 50 Alignment score 9 18 36 54 9072 150 120 270 240210 180 (R2 = 0.99998) Observed Predicted