The document describes a theoretical derivation of the probability distribution of scores from ungapped global protein sequence alignments. It shows that the distribution can be expressed as the sum of probabilities from a multinomial distribution based on the number of matches between amino acid residues. Predicted distributions calculated from this formula closely match distributions observed from aligning synthetic protein sequences, validating the theoretical derivation. The score distribution shape depends on factors like sequence length, amino acid frequencies, and the substitution matrix used.
A General Method for Estimating a Linear Structural Equation System
The substantially upgraded new version marks the golden jubilee of a seminal development in the history of Structure Equation Modeling (SEM). A little over a half century ago Professor Karl Jöreskog published a monograph in the Educational Testing Service (ETS) Research Bulletin series entitled A General Method for Estimating a Linear Structural Equation System, along with the LISREL software program.
祺荃企業有限公司 您可以信賴的軟體供應商
國內外原版軟體代理及經銷 | 教育訓練 | 軟體購買諮詢 | Devs Paradise | 線上商店(Store)
Cheer Chain Enterprise Co., Ltd. distributes and sells software with the aim of offering clients guidance when choosing software, as well as technical support !!!
Distribution of Software | Training Courses | Consulting Services
A General Method for Estimating a Linear Structural Equation System
The substantially upgraded new version marks the golden jubilee of a seminal development in the history of Structure Equation Modeling (SEM). A little over a half century ago Professor Karl Jöreskog published a monograph in the Educational Testing Service (ETS) Research Bulletin series entitled A General Method for Estimating a Linear Structural Equation System, along with the LISREL software program.
祺荃企業有限公司 您可以信賴的軟體供應商
國內外原版軟體代理及經銷 | 教育訓練 | 軟體購買諮詢 | Devs Paradise | 線上商店(Store)
Cheer Chain Enterprise Co., Ltd. distributes and sells software with the aim of offering clients guidance when choosing software, as well as technical support !!!
Distribution of Software | Training Courses | Consulting Services
A PRACTICAL POWERFUL ROBUST AND INTERPRETABLE FAMILY OF CORRELATION COEFFICIE...Savas Papadopoulos, Ph.D
If we conducted a competition for which statistical quantity would be the most valuable in exploratory data analysis, the winner would most likely be the correlation coefficient with a significant difference from its first competitor. In addition, most data applications contain non-normal data with outliers without being able to be converted to normal data. Therefore, we search for robust correlation coefficients to nonnormality and/or outliers that could be applied to all applications and detect influenced or hidden correlations not recognized by the most popular correlation coefficients. We introduce a correlation-coefficient family with the Pearson and Spearman coefficients as specific cases. Other family members provide desirable lower p-values than those derived by the standard coefficients in the earlier problems. The proposed family of coefficients, their cut-off points, and p-values, computed by permutation tests, could be applied by all scientists analyzing data. We share simulations, code, and real data by email or the internet.
I am John D. I am a Computation and System Biology Assignment Expert at nursingassignmenthelp.com. I hold a Ph.D in Biology, from Arizona University the US. I have been helping students with their assignments for the past 9 years. I solve assignments related to Computation and System Biology.
Visit nursingassignmenthelp.com or email info@nursingassignmenthelp.com.
You can also call on +1 678 648 4277 for any assistance with Computation and System Biology Assignments.
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 10: Correlation and Regression
10.1: Correlation
A PRACTICAL POWERFUL ROBUST AND INTERPRETABLE FAMILY OF CORRELATION COEFFICIE...Savas Papadopoulos, Ph.D
If we conducted a competition for which statistical quantity would be the most valuable in exploratory data analysis, the winner would most likely be the correlation coefficient with a significant difference from its first competitor. In addition, most data applications contain non-normal data with outliers without being able to be converted to normal data. Therefore, we search for robust correlation coefficients to nonnormality and/or outliers that could be applied to all applications and detect influenced or hidden correlations not recognized by the most popular correlation coefficients. We introduce a correlation-coefficient family with the Pearson and Spearman coefficients as specific cases. Other family members provide desirable lower p-values than those derived by the standard coefficients in the earlier problems. The proposed family of coefficients, their cut-off points, and p-values, computed by permutation tests, could be applied by all scientists analyzing data. We share simulations, code, and real data by email or the internet.
I am John D. I am a Computation and System Biology Assignment Expert at nursingassignmenthelp.com. I hold a Ph.D in Biology, from Arizona University the US. I have been helping students with their assignments for the past 9 years. I solve assignments related to Computation and System Biology.
Visit nursingassignmenthelp.com or email info@nursingassignmenthelp.com.
You can also call on +1 678 648 4277 for any assistance with Computation and System Biology Assignments.
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 10: Correlation and Regression
10.1: Correlation
Statistical Equilibrium Wealth Distributions in an Exchange Economy with Stoc...Keiji Takamoto
We describe an exchange market consisting of many agents with stochastic pref- erences for two goods. When individuals are indifferent between goods, statistical mechanics predicts that goods and wealth will have steady-state gamma distributions. Simulation studies show that gamma distributions arise for a broader class of pref- erence distributions. We demonstrate this mathematically in the limit of large numbers of individual agents. These studies illustrate the potential power of a statis- tical mechanical approach to stochastic models in economics and suggest that gamma distributions will describe steady-state wealths for a class of stochastic models with periodic redistribution of conserved quantities.
Radiolytic Protein Footprinting with Mass Spectrometry to Probe the Structure...Keiji Takamoto
Structural proteomics approaches using mass spectrometry are in- creasingly used in biology to examine the composition and struc- ture of macromolecules. Hydroxyl radical–mediated protein foot- printing using mass spectrometry has recently been developed to define structure, assembly, and conformational changes of macro- molecules in solution based on measurements of reactivity of amino acid side chain groups with covalent modification reagents. Accurate measurements of side chain reactivity are achieved using quanti- tative liquid-chromatography-coupled mass spectrometry, whereas the side chain modification sites are identified using tandem mass spectrometry. In addition, the use of footprinting data in conjunc- tion with computational modeling approaches is a powerful new method for testing and refining structural models of macromolecules and their complexes. In this review, we discuss the basic chemistry of hydroxyl radical reactions with peptides and proteins, highlight various approaches to map protein structure using radical oxidation methods, and describe state-of-the-art approaches to combine com- putational and footprinting data.
Principles of RNA compaction : insights from equilibrium folding pathway of p...Keiji Takamoto
Counterions are required for RNA folding, and divalent metal ions such as Mg2+ are often critical. To dissect the role of counterions, we have compared global and local folding of wild-type and mutant variants of P4- P6 RNA derived from the Tetrahymena group I ribozyme in monovalent and in divalent metal ions. A remarkably simple picture of the folding thermodynamics emerges. The equilibrium folding pathway in mono- valent ions displays two phases. In the first phase, RNA molecules that are initially in an extended conformation enforced by charge–charge repulsion are relaxed by electrostatic screening to a state with increased flexibility but without formation of long-range tertiary contacts. At higher concentrations of monovalent ions, a state that is nearly identical to the native folded state in the presence of Mg2C is formed, with tertiary contacts that involve base and backbone interactions but without the subset of interactions that involve specific divalent metal ion-binding sites. The folding model derived from these and previous results provides a robust framework for understanding the equilibrium and kinetic folding of RNA.
Biochemical Implications of Three-Dimensional Model of Monomeric Actin Bound ...
The derivation of ungapped global protein alignment score distributions - Part1
1. The derivation of ungapped global protein alignment score distrib utions
Abstract
For the gap-less global protein sequence alignment with N residues, the score distribution
can be expressed as sum of probabilities from multinomial distribution elements with same
score. The score distribution itself is dependant to length, substitution matrix and amino acid
frequencies.
Introduction
The sequence alignment score distribution is an important determining factor of e-values
of sequence alignments. As considering the importance of protein alignment score expected
value calculation in biomedical research fields, understanding the exact nature of score
distributions is of great importance with major consequences. There were many studies to
determine the function that can best fit the observed score distributions (refs). Although there are
weak consensus on the function that performs best is an extreme value distributions, there is no
studies that derived it by theoretical examination of derivation of sequence alignment score
distribution. In this study, the formula for the probability distributions of alignment scores is
derived from theoretical view point, then the results would be applied to examine theoretically
predicted score distributions against observed score distributions of randomly generated
sequences. The comparison of alignment score distributions between the naturally occurring
sequences and predictions, the best-fit function will be presented, and at last, approach to the
generalization will be discussed.
The methods
An examination of alignment score as multinomial event
The aligned protein sequences are paired sequences with assigned matches/mismatches
between two sequences. These amino acids within each sequence have no
correlation/dependency to neighboring residues in terms of score calculations (not in biological
contexts), therefore completely independent in ungapped sequence alignment. Based on this, a
pair-wise alignment score of sequences with k matches in N residues was modeled as
multinomial event with k successes of independent events with multiple outcomes in N total
trials. The match with particular score (matching score for each independent residue types) was
considered as an outcome (successful) and mismatch was also taken as an outcome (failure)
within N trials. Assume there are m residue types for the pair of sequences composed of N
residues. They were aligned with x1 matches with p1 occurrence probability of r1 residue type
associated with s1 score per match, to xm matches with pm occurrence probability of rm residue
type associated with sm score per match and mismatch with residual probability (pf) of all
matches 1- {pi:i=1~m} , each probability of this multinomial event P (this combination of residue
types) and score Sc are expressed as
P =
N!
x1!x2! xm!(N k)! p1
x1 p2
x2 pm
xm pf
N k =
N!
m pf
(N k)! xi!
i
m , Pf = 1-
N k pi
xi
i
pi
2
m
i
(1)
2. Sc = x1s1 + x2s2 + … + xmsm =
!
xisi
m"
i
(2)
For particular score Si, total probability is a sum of probabilities of all combinations of
positive lattice points on superplane Si = x1s1 + x2s2 + … + xmsm.
The prediction of alignment score distributions
The matching probability was defined as the probability of which this amino acid was
observed as “match” in alignment result. Each amino acid has unique scores for match/mismatch
events in substitution matrix (such as BLOSUM62; refs). Thus, (1/20)2 was given for synthetic
sequences with even frequencies and Fobs
2 for each amino acid observed with frequency of Fobs
(amino acid frequency) in naturally occurring sequences in the case of self-match only
(success/failure model). As substitution matrix defines unique matching/similarity scores or
mismatch penalties for pairs of residues, it was necessary to calculate probabilities of all score
types (score or penalty; refs). Each probability of occurrence F(i,j) for pair of amino acid i, j was
defined as F(i,j)=Fobs(i) !Fobs(j) . The score types St were defined as score values found within
substitution matrix (in this study, -4~9,11 from BLOSUM62), Pt(k) for score type k was the
summation of each probability of matrix element with score type St(k). In this case, there was no
success or failure. The calculation on BLOSUM62 matrix gave St={-4,-3,-2,-
1,0,1,2,3,4,5,6,7,8,9,11} and Pt(k)={0.0415 ,0.1806 ,0.2335 ,0.2275 ,0.1429 ,0.0666 ,0.0373
,0.0104 ,0.0287 ,0.0160 ,0.0111 ,0.0031 ,0.0005 ,0.0002 ,0.0001}, based on amino acid
frequencies from UniProt (2013). It should be noted that this occurrence probability distribution
of score types is dependent to amino acid frequencies and affects the outcome. The probabilities
of the combinations of pair of residue types and their occurrence probabilities can be replaced by
this score types and their occurrence probabilities. Thus, rather than dealing with 210
combination of residue types, 15 score types (BLOSUM62) to 20 types (PAM250) are necessary
to be considered.
For the pairs of aligned sequences, the sequence score distribution was given by flowing
equations with t score types. The occurrence probability p(N,xti,Sti:i=1~t) of alignment score of
N residue sequences with particular numbers of xti of Sti residue type with probability of Pt(i) can
be expressed as;
!
p(N, xti,Pti : i =1 ~ t) =
N!
xt1!xt2!!xtt!
xt1 pt2
pt1
xt2!ptt
xtt ,
!
xti
m"
i
= N (3)
Sc = xt1St1 + xt2St2 + … + xttStt (4)
By enumeration of all combinations with score of Sc and summation of probabilities, alignment
score distribution for this particular parameter set (amino acid frequencies and substitution
matrix) can be calculated as Pc = !{p(score is Sc )}.
The examination of prediction by derived probability formula and scores with sequence
alignments of randomly generated synthetic sequences
The protein sequences were generated randomly with (1) even frequencies for all 20
amino acids or (2) observed frequencies in real sequences (ref: UniProt 2013) in order to address
the obvious effect of sequence length and hypothesis that amino acid frequency is potential
factor of the distribution shape. Synthetic sequences were generated with varied length
(9/18/36/54/72/90/120/150/180/210/240/270) in order to obtain better amino acid frequency
statistics compliance to UniProt statisitcs. Natural sequence fragments are generated from CATH
3. 3.4 domain database (ref:CATH) and fragmented into given lengths without overlap in order to
avoid the effect of overlapped sequences. The total 5000 sequences were randomly selected from
fragments and aligned pair-wise with all combinations. The sequence alignments were performed
with Needleman-Wunsch algorithm with modified/unmodified BLOSUM62 or synthetic
substitution matrices with both natural and synthetic random sequences. There was another
apparent hypothesis that the substitution matrix is determining factor for the score distributions.
The calculations of score distributions with substitution matrices were performed by binning the
alignment scores. Following section describes details on the calculations.
The scoring schemes were (1) matching with score of 1 (2) evenly distributed
score/match score of 4~8 (4 residues each) for 20 commonly observed amino acids without
mismatch penalties to mimic the BLOSUM62 substitution matrix in simplified manner (3)
matrix derived from BLOSUM62 substitution matrix diagonal components with neither
mismatch penalties nor scores for rest of matrix (4/5/6/7/8/9/11 for match/residue with
5/6/4/2/1/1/1 occurrences, respectively) and (4) BLOSUM62. Sequence alignments were
performed by in-house code with all matrices with default (-10,-2) or very high (9999 for both to
make alignment gapless) gap opening and extension penalties. The scores were binned with the
interval width of 1 and assembled. The predictions of score distribution were performed by in-house
code that calculated all combinations of occurrences of multinomial events (all score
types), with given scoring scheme and probabilities of occurrences that belonged to same score
were enumerated and summed. Predicted probability distributions were then compared with the
distributions calculated from pair-wise alignment scores of synthetic/natural sequences. The
occurrence probabilities for the prediction were collected from stats of UniProt 2013 dataset or
adjusted by observed amino acid frequencies in case of naturally occurring sequences as
observed amino acid frequencies varied set to set.
The fitting of alignment score distributions
The MATLAB codes utilizing “lsqnonlin” were used to fit score distributions with
various statistical distribution functions (binomial/Poisson/Gamma/Extreme value distributions).
These functions were modified to take into account the X-axis (score) shift and peak width
adjustment. The amplitude was also adjustable. The values of R-squared were calculated to
examine fits along with eye-inspection for determination of skewness differences and/or
systematic biases of residual in fittings. Following were the actual functions used to fit the data.
Binomial and Poisson were extended from discrete functions to continuous forms by replacing
factorials by gamma functions due to technical issues of discrete fitting procedures.
Binomial distribution
!
fB (x) = A"
#(N +1)
#(
x $ x0
w
+1)" #(N $
x $ x0
w
+1)
p
x$x0
w (1$ p)
N $
x$x0
w
Poisson distribution
!
f p (x) = A"
x$x0
w
#
%(
x $ x0
w
+1)
e$#
Gamma distribution
!
fg (x) = A"
x # x0 ( )k#1
$(k)" % k e
#
x#x0
%
Extreme value distribution
!
fe (x) = A" e
(
x#x 0
w
#e
x#x0
w )
4. The examination of parameter trajectories by sequence length changes
The MATLAB codes for simple linear and exponential fit were written and used to obtain
the length dependent probability calculation formula for the ungapped global alignment score
distributions. The varied sequence lengths were subjected to the predictions of the score
distributions in order to examine the qualitative and quantitative trajectories of parameters by
sequence length changes. The best-fit model of extended gamma distribution has 3 or 4
parameters and each parameter was fitted with linear or exponential depending on the shape of
curve as the function of sequence lengths in order to examine whether they were predictable as
the functions of sequence length.
Results
The derivation of score probability density formula
The view of alignment results as simple multinomial event with particular set of residue
types as outcomes (score/match) led to the well-known formula of multinomial distribution. This
distribution itself did not give a score distribution as the distribution gave the probability of the
occurrence of “combination” of residue types. In order to calculate the distribution of scores, all
conceivable combinations needed to be enumerated and all probabilities with same score values
were necessary to be summed up.
Independent from how sequence alignment was performed, a pair-wise alignment score
of sequences can be interpreted as simple multinomial events. The match with particular score
type is an outcome (see method section). Thus, we can model it as a simple multinomial event
with N trials (refs) with m outcomes (residue types). In the same sense, multiple residue types
with same score type can be combined to single entity as the combination of score type was what
determines the scores but not their “sequences”. Thus, in sense, no matter how their sequences
are varied or shuffled, as long as there were “matches of 3 Ala, 1 Asn, 5 Leu, 1 Tyr and 1 Val”,
they are same. Then it can be then reduced to “9 of score 4 matches and 1 of score 6 match, and a
score 7 match”. The formula to calculate probabilities for the given combination of residue types
and score were given by equations 1 and 2. With the points above, further reduction of the
expression of calculation by converting residue types to score types was achieved, now formulas
were transformed to equations 3 and 4 expressed by score types rather than residue types. As the
formula for the calculation of score and its probability of occurrence were derived theoretically,
simple enumeration of all combinations of score types with given length of sequences and
summation of probabilities of combinations with same scores were possible by simple
computation. Unfortunately, this calculation was extremely computationally expensive as the
number of combination increases in geometric progression, the computation of longer sequences
were not scalable. There are 15 score types, therefore, the number of combinations are expressed
by f(L)=(L+14)/L·f(L-1), (L indicates sequence length). On the workstation with 2.66GHz Xeon,
it took over 9 days to compute in single thread. Increase of sequence length results in significant
increase of load and computation with reasonable resource within reasonable time was out of
reach. For particular score, total probability is a sum of probabilities of all combinations of
positive lattice points on superplane expressed by equation 4. Unfortunately, the derivation or
calculation of such a points is beyond this study and the subject itself would be its own study for
the mathematicians. It appears not be easy task or simply may not be possible for the biomedical
researcher such as myself.
The validation of predictions
5. The score distribution could be calculated from the formula derived in method section
(equation 1 and 2) as summation of probabilities from same score with varied residue
combinations. The figure 1 shows the comparison of predicted distributions and those calculated
from alignment scores of synthetic sequences. The even occurrence probability for score 4~8 for
each match is shown in panel A. The two data matched almost completely with R2 >0.99996. As
there is no penalty and scores are between 4 and 8, the distribution is not the bell-shaped but
rather more complex shapes particularly for the shorter sequences. This result strongly indicates
that score distribution is not necessarily bell-shaped curve that can be expressed in simple
statistical distribution functions. In this case, in order for the simplification, amino acid
frequencies were same (1/20) for all residue types and substitution matrix is 0 for non-diagonal
elements, thus [4,5,6,7,8] were set 4 each in diagonal positions. There were no scores of 1/2/3 as
there was no combination for these values, thus, both prediction and actual alignment scores did
not show them. Scores 4~7 were equally probable as there was single combination, 8 had score 8
alone and two 4s, thus higher in probability, so on. With sequence length went up, more smooth
the distributions were, although there were still jaggedness even with 120 residues. The
distributions with the use of synthetic substitution matrix with diagonal BLOSUM62 matrix were
shown in panel B. With more complex score types with differing occurrences (score
[4,5,6,7,8,9,11]=5,6,4,2,1,1,1 occurrences), the distributions were slightly more smoothed out
but basically same as 4-8 matrix. The R2s ranged in 0.9987~0.9999. The use of BLOSUM62
matrix was computationally expensive to produce predictions for the longer sequences as the
execution time increase roughly in order of n! (15HN=15+N-1CN as number of multinomial
combination is expressed by repetitive combinations). Thus, only sequence length of 9, 18 and
36 were predicted and are shown. With ungapped sequence alignments of synthetic sequences
with uniform amino acid frequencies are shown in panel C. The matching are almost perfect with
R2=0.9996~0.9999 for all lengths. As obviously expected to be, the sequence score distributions
are length dependent, but not in simple manner. With the use of full BLOSUM62 matrix, there
are 15 types of multinomial events (score/match of 4~9,11, similarity scores 1~3 and the
mismatch penalties -4~0), thus, completely smoothing out the distributions. Unfortunately, more
residue type meant more nested loops for the calculations, it was very computationally expensive
to calculate the distribution of longer sequences. The increasing calculation time by factor of
elength/2.28 makes it not practical (takes less than 1s for 9 residue sequence but over 9 days for 36
residues). But this examination confirms the formula was correct and substitution matrix
determines the shape of distributions. In panel D, the results from the same BLOSUM62 with
amino acid frequency calculated from UniProt (release 2013_03) stats is shown. The matches
between prediction and distributions of actual sequence alignment scores of synthetic sequences
match very well (R2>0.99996). An interesting point is that the score distributions are shifted
toward higher direction. This may come from the difference in overall matching probability for
positive scores for even and observed amino acid frequencies (0.155/0.174, respectively). Thus,
the hypothesis that amino acid frequency affects the score distribution shape, is proven to be
true.
Figure 2 shows small discrepancy between predictions and score distributions from
alignments of naturally occurring sequences. This was expected as we saw in previous section
that amino acid frequencies differences resulted in the changes in score distributions (figure 1
panel C/D). This was assumed to be caused by small differences in amino acid frequencies of
actual fragment datasets from CATH 3.4 database and frequencies adopted from UniProt
statistics. The amino acid frequencies of actual sequences used for the alignments were
6. calculated and predictions were re-calculated with these adjusted amino acid frequencies. There
were 3.2~6.1% deviations of frequency per residue type (as median values of change in
frequencies) between CATH 3.4 derived sequence datasets and UniProt values. After adjusting
the frequencies for the predictions, the R2 of fits went up from 0.9982/0.9975/0.9970 to
0.9999/0.9999/0.9997 and obvious systematic residuals were mostly eliminated. Thus, even
relatively small differences in amino acid frequencies affected the distribution in easily
detectable level.
The fitting of score distributions
The score distributions were fitted with 4 families of statistical distribution functions. (1)
Binomial distribution (2) Poisson distribution (3) Gamma distribution and (4) Extreme value
distribution. As the length of sequence strongly affect the shape of distributions, fittings were
performed in 9/18/36/54 residues. As shown in figure 1, the synthetic substitution matrices with
neither mismatch penalties nor similarity score do not have bell-shaped distributions but rather
more jagged shape, obviously it is futile trying to find function that can fit these distributions.
Thus, only smoothed out distributions of BLOSUM62 with/without gap were fitted with these
functions. Figure 3 and Table 1 show the fitting results of 4 functions with different sequence
lengths, amino acid frequencies with pair-wise alignment score distributions from synthetic
sequences. Clearly, gamma distribution was the best function for the BLOSUM62 results for all
length of sequences studied with highest R2 and smallest systematic residuals. Poisson and
binomial distributions also performed well in longer sequences but not as good as gamma in the
short ones. Extreme value distribution had obvious difference in skewness that cannot be
compensated by parameters and was the worst function among 4. In the case of substitution
matrix with 1 at diagonal position and 0 for other positions (represents Bernoulli trial), as
expected, the score distributions completely matched with binomial distributions (data not
shown). But as it can be seen in figure 1A/B, distributions by substitution matrix with multiple
score types were not simple convolution of binomial distributions. The binomial and Poisson
distribution could explain the shape of distributions, in order to fit well, however, they required
non-unity amplitude parameters. This fact did not sit well with the nature of probability
density(mass) functions, thus, also from this point, gamma distribution is the best distribution for
ungapped alignment score distributions. These functions were also tested with naturally
occurring sequences derived from CATH domain dataset (Figure 4). The pair-wise global
sequence alignments were computed with both ungapped and gapped manner. The results were
same as synthetic sequence alignments. Despite the differences in amino acid frequencies that
affected distribution shapes, all models tested (even, UniProt and CATH derived amino acid
frequencies) resulted in the same best-fit distribution (gamma distribution). Therefore, it is
reasonable to conclude that the alignment score distributions of sequences aligned with
BLOSUM62 matrix follow gamma distribution. The fitting of the predicted distributions, which
were deemed to be error free as probabilities are precisely calculated from exact analytical
solution, by gamma distribution resulted in the R2-values of >0.99999 (length 12 or longer) or
0.999900/0.999979 for length of 6 and 9, respectively.
The derivation of formulas for parameter prediction as functions of sequence length,
evaluation of predictability by extrapolated parameters
In previous sections, it is proved that the multinomial event model could exactly predict
the score distributions by different substitution matrices, lengths and amino acid frequencies. As
7. explained already, however, the enumeration of combinations and calculation of probabilities for
scores are extremely computationally expensive as combinations increase by roughly n! order.
Therefore, utilizing prediction power to generate the series of distributions as function of
sequence length and fit these models with gamma distribution to predict the parameters has been
sought. As shown in Figure 5, parameters in short sequence ranges behaved pretty well in terms
of predictability. The gamma distribution has two shape deciding parameters (k/! ) and in this
study, two more parameters were introduced to handle negative scores (x0: x-shift in order to
avoid complex parameters or undefined gamma function values) and potential necessity for
adjusting the amplitude. Gamma distribution did not require amplitude adjustment as “best fit”
parameters for varied length of sequences had values almost exactly 1.0. This is a good property
for the distribution function as the fitting is performed on probability density function and it
should have amplitude of exact 1.0. Thus, in further fittings, this amplitude parameter was fixed
to 1.0. The multinomial event model was utilized to predict the score distributions for different
lengths of sequences. These distribution models were then fitted by gamma distribution from
estimated initial parameters. This was an important precaution as with multivariate fitting
session, initial parameters are very important for the convergence of fitting and arrival of
“series” of fitting results that can be compared. If initial parameters are not consistently given, it
is not guaranteed to fall into comparable local minima or even did not converge at all. Figure 5
shows the results of distribution prediction and fitting, then fitting of gamma distribution
parameters (k/! /x0) by the linear ( f(x)=a+bx ) or exponential function ( f(x)=a+b*power(x/c,d) ).
The resultant fits were very well (>0.99999 for k/x0 and 0.99986 for ! ). The obtained parameters
from these distributions (sequence lengths of 6/9/12/15/18/21/24/27/30/33/36) were extrapolated
to longer sequence lengths (54~270 residues) and distributions were calculated from extrapolated
parameters. Those calculated distributions (which are too long to be predicted by exact formula
with reasonable time, resources) were then compared with actual sequence alignment results up
to 270 residues long synthetic sequences. Figure 6 shows the comparison of actual sequence
alignment score distributions and distributions calculated from extrapolated parameters. These
pairs of distributions matched in the range of R2 values of 0.99952~0.99998 for the sequence
lengths from 9 to 270. In order to evaluate the extrapolated parameters in comparison with fitted
parameters from alignment score distributions, score distributions from alignment results were
fitted with gamma distributions with extrapolated parameters as initial values. They were
matched very well with error 0.15~0.38% in average.
Thus, if database amino acid frequencies are known and substitution matrix is given, the
distribution of alignment scores can be calculated with good accuracy as a function of sequence
length. This translates into precise calculation of e-values of alignment even without exact
combinatorial probability calculation and enumeration. Computationally, this is the biggest
benefit as regular workstation can calculate the e-value precisely in very short time (gamma
distribution parameters are given, thus cumulative probability function is known).