2. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
2
Mean
The arithmetic mean is the sum of a collection of numbers divided by the number of numbers in the
collection.
Sample Variance
The estimator of population variance, also called the unbiased sample variance, is:
š2
=
ā (š„š ā š„Ģ )2š
š=1
š ā 1
Source: http://en.wikipedia.org/wiki/Variance
Sample Kurtosis
The estimators of population kurtosis is:
šŗ2 =
š4
š2
2 =
(š + 1)š
(š ā 1)(š ā 2)(š ā 3)
ā
ā (š„š ā š„Ģ )4š
š=1
š2
2 ā 3
(š ā 1)2
(š ā 2)(š ā 3)
The standard error of the sample kurtosis of a sample of size n from the normal distribution is:
š¾ šš”š. šøšššš = ā
4[6š(š ā 1)2(š + 1)]
(š ā 3)(š ā 2)(š + 1)(š + 3)(š + 5)
Source: http://en.wikipedia.org/wiki/Kurtosis#Estimators_of_population_kurtosis
Sample Skewness
Skewness of a population sample is estimated by the adjusted FisherāPearson standardized moment
coefficient:
šŗ =
š
(š ā 1)(š ā 2)
ā (
š„š ā š„Ģ
š
)
3š
š=1
where n is the sample size and s is the sample standard deviation.
The standard error of the skewness of a sample of size n from a normal distribution is:
šŗ šš”š. šøšššš = ā
6š(š ā 1)
(š ā 2)(š + 1)(š + 3)
Source: https://en.wikipedia.org/wiki/Skewness#Sample_skewness
Total Variance
Variance of the entire population is:
š2
=
ā (š„š ā š„Ģ )2š
š=1
š
3. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
3
Source: http://en.wikipedia.org/wiki/Variance
Total Kurtosis
Kurtosis of the entire population is:
šŗ2 =
ā (š„š ā š„Ģ )4š
š=1
š
š4
ā 3
where n is the sample size and Ļ is the total standard deviation.
Source: http://en.wikipedia.org/wiki/Kurtosis
Total Skewness
Skewness of the entire population is:
šŗ =
ā (š„š ā š„Ģ )3š
š=1
š
š3
where n is the sample size and Ļ is the total standard deviation.
Source: https://en.wikipedia.org/wiki/Skewness
Quantiles of a population
ISSTATS uses the same method as Rā7, Excel CUARTIL.INC function, SciPyā(1,1), SPSS and Minitab.
Qp, the estimate for the kth
qāquantile, where p = k/q and h = (Nā1)*p + 1, is computing by
Qp =
Linear interpolation of the modes for the order statistics for the uniform distribution on [0, 1]. When p =
1, use xN.
Source: http://en.wikipedia.org/wiki/Quantile#Estimating_the_quantiles_of_a_population
MSSD (Mean of the squared successive differences)
It is calculated by taking the sum of the differences between consecutive observations squared, then
taking the mean of that sum and dividing by two.
šššš· =
ā (š„š+1 ā š„š)2š
š=1
2(š ā 1)
The MSSD has the desirable property that one half the MSSD is an unbiased estimator of true variance.
Pearson Chi Square Test
The value of the test-statistic is
4. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
4
š2
= ā
(šš ā šøš)2
šøš
š
š=1
Where
ļ· š2
is the Pearson's cumulative test statistic, which asymptotically approaches a š2
distribution
with (r - 1)(c - 1) degrees of freedom.
ļ· šš is the number of observations of type i.
ļ· šøš is the expected (theoretical) frequency of type i
Yates's Continuity Correction
The value of the test-statistic is
š2
= ā
(ššš„{0, |šš ā šøš| ā 0.5})2
šøš
š
š=1
When |šš ā šøš| ā 0.5 is below zero, the null value is computed. The effect of Yates' correction is to
prevent overestimation of statistical significance for small data. This formula is chiefly used when at least
one cell of the table has an expected count smaller than 5.
Likelihood Ratio G-Test
The value of the test-statistic is
šŗ = 2 (ā ā ššš ā šš(
ššš
šøšš
)
š
š=1
š
š=1
)
where
ļ· Oij is the observed count in row i and column j
ļ· Eij is the expected count in row i and column j
G has an asymptotically approximate Ļ2
distribution with (r - 1)(c - 1) degrees of freedom when the null
hypothesis is true and n is large enough.
Mantel-Haenszel Chi-Square Test
The Mantel-Haenszel chi-square statistic tests the alternative hypothesis that there is a linear association
between the row variable and the column variable. Both variables must lie on an ordinal scale. The
Mantel-Haenszel chi-square statistic is computed as:
š šš» = (š ā 1)š2
Where r is the Pearson correlation between the row variable and the column variable, n is the sample size.
Under the null hypothesis of no association, has an asymptotic chi-square distribution with one degree of
freedom.
5. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
5
Fisher's Exact Test
Fisherās exact test assumes that the row and column totals are fixed, and then uses the hypergeometric
distribution to compute probabilities of possible tables conditional on the observed row and column totals.
Fisherās exact test does not depend on any large-sample distribution assumptions, and so it is appropriate
even for small sample sizes and for sparse tables. This test is computed for 2X2 tables such as
š“ = (
š š
š š
)
For an efficient computing, the elements of the matrix A are reordered
Aā = ( šā² šā²
šā² šā²
)
Being aā the cell of A that have the minimum marginals (minimum row and column totals). The test result
does not depend on the cells disposition.
The left-sided āvalue sums the probability for all the tables that have equal or smaller aā.
p šššš” = P(š„ ā¤ šā²) = ā
(
š¾ = šā²
+ šā²
š
) (
š ā š¾
š ā š
)
(
š = šā² + šā² + šā² + šā²
š = šā² + šā²
)
šā²
š=0
The right-sided āvalue sums the probability for all the tables that have equal or larger aā.
p šššāš” = P(š„ ā„ šā²) = ā
(
š¾ = šā²
+ šā²
š
) (
š ā š¾
š ā š
)
(
š = šā² + šā² + šā² + šā²
š = šā² + šā²
)
š¾=šā²+šā²
š=šā²
Most of the statistical packages output -as the one-sided test result- the minimum value of pleft and pright.
The Fisher two-tailed p-value for a table A is defined as the sum of probabilities for all tables consistent
with the marginals that are as likely as the current table.
McNemar's Test
This test is computed for 2X2 tables such as
š“ = (
š š
š š
)
The value of the test-statistic is
6. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
6
š2
=
(š ā š)2
š + š
The statistic is asymptotically distributed like a chi-squared distribution with 1 degree of freedom.
Edwards Continuity Correction
The value of the test-statistic is
š2
=
(ššš„{0, |š ā š| ā 1})2
š + š
When |š ā š| ā 1 is below zero, the statistic is zero.
The statistic is asymptotically distributed like a chi-squared distribution with 1 degree of freedom.
McNemar Exact Binomial
Assuming that b < c. Let be n = b + c, and B(x, n, p) the binomial distribution
Two ā sided p ā value = 2 ā (one ā sided p ā value) = 2 ā ā šµ(š„, š, 0.5)
š
š„=0
= 2 ā ā (
š
š„
) ā 0.5 š„
ā 0.5 šāš„
š
š„=0
= 2 ā
1
2 š
ā ā (
š
š„
)
š
š„=0
If b = c, the exact p-value equals 1.0.
Mid-P McNemar Test
Let be n = b + c. Assuming that b < c.
Mid ā P value = 2 ā ā šµ(š„, š, 0.5)
š
š„=0
ā šµ(š, š, 0.5) = 2 ā
1
2 š
ā ā (
š
š„
) ā (
š
š
) ā
1
2 š
š
š„=0
If b = c, the mid p-value is 1.0 ā
1
2
(
š
š
) ā
1
2 š
Bowkerās Test of Symmetry
This test is computed for m-by-m square matrix as:
šµš = ā ā
(ššš ā ššš)2
ššš + ššš
šā1
š=1
šā1
š=1
For large samples, BW has an asymptotic chi-square distribution with M*(M - 1)/2 ā R degrees of
freedom under the null hypothesis of symmetry, where R is the number of off-diagonal cells with nij + nji
= 0.
7. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
7
Risk Test
Let be
Risk Factor Disease status
Cohort = Present Cohort = Absent
Present a b
Absent c d
Odds ratio
The odds ratio (Risk Factor = Present / Risk Factor = Absent) is computed as:
šš =
š
šā
š
šā
The distribution of the log odds ratio is approximately normal with:
š ~ š(log(šš ) , š2
)
The standard error for the log odds ratio is approximately
ššø = ā
1
š
+
1
š
+
1
š
+
1
š
The 95% confidence interval for the odds ratio is computed as
[exp(log(šš ) ā š§0.025 ā ššø) ; exp(log(šš ) + š§0.025 ā ššø)]
To test the hypothesis that the population odds ratio equals one, is computed the two-sided p-value as
š ššššššššššš (2 ā š šššš) = 2 ā š(š§ ā¤
ā|log(šš )|
ššø
)
Source: https://en.wikipedia.org/wiki/Odds_ratio
Relative Risk
The relative risk (for cohort Disease status = Present) is computed as
š š =
š
š + šā
š
š + šā
The distribution of the log relative risk is approximately normal with:
š ~ š(log(šš ) , š2
)
8. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
8
The standard error for the log relative risk is approximately
ššø = ā
1
š
+
1
š
ā
1
š + š
ā
1
š + š
The 95% confidence interval for the relative risk is computed as
[exp(log(š š ) ā š§0.025 ā ššø) ; exp(log(š š ) + š§0.025 ā ššø)]
To test the hypothesis that the population relative risk equals one, is computed the two-sided p-value as
š ššššššššššš (2 ā š šššš) = 2 ā š(š§ ā¤
ā|log(š š )|
ššø
)
The relative risk (for cohort Disease status = Absent) is computed as
š š =
š
š + šā
š
š + šā
Epidemiology Risk
All the parameters are computed for cohort Disease status = Present.
Attributable risk, represents how much the risk factor increase/decrease the risk of disease
š“š =
š
š + š
ā
š
š + š
If AR > 0 there an increase of the risk. If AR < 0 there is a reduction of the risk.
Relative Attributable Risk
š š =
š
š + š
ā
š
š + š
š
š + š
=
š“š
š
š + š
Number Needed to Harm
ššš» =
1
š
š + š
ā
š
š + š
=
1
š“š
The number needed to harm (NNH) is an epidemiological measure that indicates how many patients on
average need to be exposed to a risk-factor over a specific period to cause harm in an average of one
patient who would not otherwise have been harmed.
A negative number would not be presented as a NNH, rather, as the risk factor is not harmful, it is
expressed as a number needed to treat (NNT) or number needed to avoid to expose to risk.
9. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
9
Attributable risk per unit
š“š š =
š š ā 1
š š
Preventive fraction
šš¹ = 1 ā š š
Etiologic fraction is the proportion of cases in which the exposure has played a causal role in disease
development.
šøš¹ =
š ā š
š
A similar parameters are computed for cohort Disease status = Absent.
Source: https://en.wikipedia.org/wiki/Relative_risk
Cohen's Kappa Test
Given a k-by-k square matrix, which collect the scores of two raters who each classify N items into k
mutually exclusive categories, the equation for Cohen's kappa coefficient is
šĢ =
š š ā š š
1 ā š š
Where
š š = ā
ššš
š
= ā ššš
š
š=1
š
š=1
ššš šš = ā šš.
š.š
š
š=1
š¤āššš ššš =
ššš
š
ššš šš. = ā
ššš
š
š
š=1
ššš š.š = ā
ššš
š
š
š=1
The asymptotic variance is computed by
š£šš(šĢ) =
1
š(1 ā šš)4
{ ā ššš[(1 ā šš) ā (š.š + šš.)(1 ā š š)]2
š
š=1
+ (1 ā š0)2
ā ā ššš(š.š + šš.)2
š
š=1,šā š
ā (š š šš ā 2šš + š š)2
š
š=1
}
10. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
10
The formulae is given by Fleiss, Cohen, and Everitt (1969), and modified by Fleiss (1981). The
asymptotic standard error is the root square of the value given above. This standard error and the normal
distribution N(0,1) must be used to compute confidence intervals.
šĢ Ā± š§ā/2ā š£šš(šĢ)
To compute an asymptotic test for the kappa coefficient, ISSTATS uses a standardized test statistic T
which has an asymptotic standard normal distribution under the null hypothesis that kappa equals zero
(H0: k = 0). The standardized test statistic is computed as
š =
šĢ
ā š£šš0(šĢ)
ā š(0,1)
Where the variance of the kappa coefficient under the null hypothesis is
š£šš0(šĢ) =
1
š(1 ā šš)2
{ šš + šš
2
ā ā š.š šš.(š.š+ šš.)
š
š=1
}
Refer to Fleiss (1981)
Source: https://v8doc.sas.com/sashtml/stat/chap28/sect26.htm
Nominal by Nominal Measures of Association
Contingency Coefficient
Cramer's V is a measure of association between two nominal variables, giving a value between 0 and +1
(inclusive).
š¶ = ā
š2
š2 + š
Where
ļ· š2
is the Pearson's cumulative test statistic.
ļ· N is the total sample size.
C asymptotically approaches a š2
distribution with (r - 1)(c - 1) degrees of freedom.
Standardized Contingency Coefficient
11. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
11
If X and Y have the same number of categories (r = c), then the maximum value for the contingency
coefficient is calculated as:
š ššš„ = ā
š ā 1
š
If X and Y have a differing number of categories (r ā c), then the maximum value for the contingency
coefficient is calculated as
š ššš„ = ā
(š ā 1)(š ā 1)
š ā š
4
The standardized contingency coefficient is calculated as the ratio:
ššš”šššššššš§šš =
š¶
š ššš„
which varies between 0 and 1 with 0 indicating independence and 1 dependence.
Phi coefficient
The phi coefficient is a measure of association for two nominal variables.
š· = ā
š2
š
Where
ļ· š2
is the Pearson's cumulative test statistic.
ļ· N is the total sample size.
Phi asymptotically approaches a š2
distribution with (r - 1)(c - 1) degrees of freedom.
Cramer's V
Cramer's V is a measure of association between two nominal variables, giving a value between 0 and +1
(inclusive).
š = ā
š2
š
ā
ššš{š ā 1, š ā 1}
Where
ļ· š2
is the Pearson's cumulative test statistic.
ļ· N is the total sample size.
V asymptotically approaches a Ļ2
distribution with (r - 1)(c - 1) degrees of freedom.
12. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
12
Tschuprow's T
Tschuprow's T is a measure of association between two nominal variables, giving a value between 0 and
1 (inclusive).
š = ā
š2
š
ā
ā(š ā 1)(š ā 1)
Lambda
Asymmetric lambda, Ī»(C/R) or column variable dependent, is interpreted as the probable improvement in
predicting the column variable Y given knowledge of the row variable X. The range of asymmetric
lambda is {0, 1}. Asymmetric lambda (C/R) or column variable dependent is computed as
š(š¶/š ) =
ā ššš ā š
š ā š
The asymptotic variance is
š£šš( š(š¶/š )) =
š ā ā ššš
( š ā š)3
{ ā šš
š
+ š ā 2 ā(šš|šš = š)
š
}
Where
šš = max
š
{ššš} ššš š = max
š
{š.š} ššš šš = max
š
{ššš} ššš š = max
š
{šš.}
The values of li and l are determined as follows. Denote by li the unique value of j such that ri = nij, and
let l be the unique value of j such that r = n.j. Because of the uniqueness assumptions, ties in the
frequencies or in the marginal totals must be broken in an arbitrary but consistent manner. In case of ties,
l is defined as the smallest value of such that r = n.j.
For those columns containing a cell (i, j) for which nij = ri = cj, csj records the row in which cj is assumed
to occur. Initially is set equal to ā1 for all j. Beginning with i = 1, if there is at least one value j such that
nij = ri = cj, and if csj = -1, then li is defined to be the smallest such value of j, and csj is set equal to i.
Otherwise, if nil = ri, then li is defined to be equal to l. If neither condition is true, then li is taken to be the
smallest value of j such that nij = ri.
The asymptotic standard error is the root square of the asymptotic variance.
The formulas for lambda asymmetric Ī»(R/C) can be obtained by interchanging the indices.
13. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
13
š(š /š¶) =
ā ššš ā š
š ā š
The Symmetric lambda is the average of the two asymmetric lambdas, Ī»(C/R) and Ī»(R/C). Its range is {-
1, 1}. Lambda symmetric is computed as
š =
ā ššš + ā ššš ā š ā š
2š ā š ā š
The asymptotic variance is
š£šš( š) =
1
š¤4
{ š¤š£š¦ ā 2š¤2
[š ā ā ā(ššš|š = šš, š = šš)
šš
] ā 2š£2
(š ā š šš)}
Where
š¤ = 2š ā š ā š ššš š£ = 2š ā ā šš
š
ā ā šš
š
ššš š„
= ā(šš
| šš = š)
š
+ ā(šš
| šš = š)
š
+ šš + šš ššš š¦ = 8š ā š¤ ā š£ ā 2š„
The definitions of l and li are given in the previous section. The values k and kj are defined in a similar
way for lambda asymmetric (R/C).
Uncertainty Coefficient
The uncertainty coefficient U (C/R) -or column variable dependent U- measures the proportion of
uncertainty (entropy) in the column variable Y that is explained by the row variable X. Its range is {0, 1}.
The uncertainty coefficient is computed as
š(š¶/š ) = š šššš¢šš š£ššššššš ššššššššš” =
š£
š¤
=
H(X) + H(Y) ā H(XY)
H(Y)
Where
š»(š) = ā ā
šš.
š
ln (
šš.
š
)
š
ššš š»(š) = ā ā
š.š
š
ln (
š.š
š
)
š
ššš š»(šš)
= ā ā ā
ššš
š
ln (
ššš
š
)
šš
The asymptotic variance is
š£šš(š(š¶/š )) =
1
š2 š¤4
ā ā ššš {š»(š) ln (
ššš
šš.
) + (H(X) ā H(XY)) ln (
š.š
š
)}
2
šš
The asymptotic standard error is the root square of the asymptotic variance.
14. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
14
The formulas for the uncertainty coefficient U (C/R) can be obtained by interchanging the indices.
The symmetric uncertainty coefficient is computed as
š =
2 ā [H(X) + H(Y) ā H(XY)]
H(X) + H(Y)
The asymptotic variance is
š£šš(š) = 4 ā ā
ššš {š»(šš) ln (
šš. š.š
š2 ) ā (H(X) ā H(Y)) ln (
š.š
š )}
2
š2(H(X) + H(Y))4
šš
The asymptotic standard error is the root square of the asymptotic variance.
Ordinal by Ordinal Measures of Association
Let nij denote the observed frequency in cell (i, j) in a IxJ contingency table. Let be N the total frequency
and
š“šš = ā ā š šš
š<šš<š
+ ā ā š šš
š>šš>š
š·šš = ā ā š šš
š<šš>š
+ ā ā š šš
š>šš<š
š = ā ā ššš š“šš
šš
ššš š = ā ā ššš š·šš
šš
Gamma Coefficient
The gamma (G) statistic is based only on the number of concordant and discordant pairs of observations.
It ignores tied pairs (that is, pairs of observations that have equal values of X or equal values of Y).
Gamma is appropriate only when both variables lie on an ordinal scale. The range of gamma is {-1, 1}. If
the row and column variables are independent, then gamma tends to be close to zero.
Gamma is estimated by
šŗ =
š ā š
š + š
The asymptotic variance is
15. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
15
š£šš(šŗ) =
16
( š + š)2
{ ā ā ššš ā (šš“šš ā šš·šš)2
š½
š=1
š¼
š=1
}
The asymptotic standard error is the root square of the asymptotic variance.
The variance under the null hypothesis that gamma equals zero is computed as
š£šš0(šŗ) =
4
( š + š)2
{ ā ā ššš ā ššš
2
š½
š=1
ā
(š ā š)2
š
š¼
š=1
}
Where dij = Aij - Dij
The asymptotic standard error under the null hypothesis that d equals zero is the root square of the
variance.
Kendall's tau-b
Kendallās tau-b is similar to gamma except that tau-b uses a correction for ties. Tau-b is appropriate only
when both variables lie on an ordinal scale. The range of tau-b is {-1, 1}. Kendallās tau-b is estimated by
š š =
š ā š
š¤
Where
š¤š = š2
ā ā šš.
2
š
ššš š¤š = š2
ā ā š.š
2
š
ššš š¤ = ā š¤š š¤š
The asymptotic variance is
š£šš( š š) =
1
š¤4
{ ā ā ššš(2š¤ššš + š š š£šš)2
š½
š=1
š¼
š=1
ā š3
š š
2
( š¤ š + š¤ š)2
}
where
š£šš = š¤ š šš. + š¤ š š.š
The asymptotic standard error is the root square of the asymptotic variance.
The variance under the null hypothesis that tau-b equals zero is computed as
16. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
16
š£šš0( š š) =
4
š¤ š š¤ š
{ ā ā ššš ā ššš
2
š½
š=1
ā
(š ā š)2
š
š¼
š=1
}
The asymptotic standard error under the null hypothesis that d equals zero is the root square of the
variance.
Stuart-Kendall's tau-c
Stuart-Kendallās tau-c makes an adjustment for table size in addition to a correction for ties. Tau-c is
appropriate only when both variables lie on an ordinal scale. The range of tau-c is {-1, 1}. Stuart-
Kendallās tau-c is estimated by
š š =
š(š ā š)
š2(š ā 1)
Where m =min {I, J}.
The asymptotic variance is
š£šš( š š) =
4š2
š4
(š ā 1)2 { ā ā ššš ā ššš
2
š½
š=1
ā
(š ā š)2
š
š¼
š=1
}
The asymptotic standard error is the root square of the asymptotic variance.
The variance under the null hypothesis that tau-c equals zero is the same as the asymptotic variance.
Sommersā D
Somersā D(C/R) and Somersā D(R/C) are asymmetric modifications of tau-b. C/R indicates that the row
variable X is regarded as the independent variable and the column variable Y is regarded as dependent.
Similarly, R/C indicates that the column variable Y is regarded as the independent variable and the row
variable X is regarded as dependent. Somersā D differs from tau-b in that it uses a correction only for
pairs that are tied on the independent variable. Somersā D is appropriate only when both variables lie on
an ordinal scale. The range of Somersā D is {-1, 1}. Somersā D is computed as
š·(š¶/š ) = š· šššš¢šš š£ššššššš ššššššššš” =
š ā š
š¤š
The asymptotic variance is
17. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
17
š£šš( š·(š¶/š )) =
4
š¤ š
4
{ ā ā ššš[š¤š ššš ā (š ā š)(š ā šš.)]2
š½
š=1
š¼
š=1
}
The asymptotic standard error is the root square of the asymptotic variance.
The variance under the null hypothesis that D(C/R) equals zero is computed as
š£šš0( š·(š¶/š )) =
4
š¤ š
2
{ ā ā ššš ā ššš
2
š½
š=1
ā
(š ā š)2
š
š¼
š=1
}
The asymptotic standard error under the null hypothesis that D(C/R) equals zero is the root square of the
variance.
Formulas for Somersā D(R/C) are obtained by interchanging the indices.
Symmetric version of Somersā d is
š =
š ā š
š¤š + š¤š
2
The standard error is
š“ššø(š) =
2š šš š¤
š¤ š + š¤ š
where ĻĻb is the asymptotic standard error of Kendallās tau-b.
The variance under the null hypothesis that d equals zero is computed as
š£šš0(š) =
16
( š¤ š + š¤ š)2
{ ā ā ššš ā ššš
2
š½
š=1
ā
(š ā š)2
š
š¼
š=1
}
The asymptotic standard error under the null hypothesis that d equals zero is the root square of the
variance.
Confidence Bounds and One-Sided Tests
Suppose you are testing the null hypothesis H0: ļ ā„ ļ0 against the one-sided alternative H1: ļ < ļ0. Rather
than give a two-sided confidence interval for ļ, the more appropriate procedure is to give an upper
confidence bound in this setting. This upper confidence bound has a direct relationship to the one-sided
test, namely:
18. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
18
1. A level ļ” test of H0: ļ ā„ ļ0 against the one-sided alternative H1: ļ < ļ0 rejects H0 exactly when
the value ļ0 is above the 1āĪ± upper confidence bound.
2. A level ļ” test of H0: ļ ā¤ ļ0 against the one-sided alternative H1: ļ > ļ0 rejects H0 exactly when
the value ļ0 is above the 1āĪ± lower confidence bound.
ANOVA Test
ššššš”šš = ā ā(š¦šš ā š..
Ģ )2
šš
š=1
š
š=1
ššš¼šš”šš = ā šš(šĢ š. ā š..
Ģ )2
š
š=1
ššš¼šš”šš = ā ā(š¦šš ā šš.
Ģ )2
š š
š=1
š
š=1
= ššššš”šš ā ššš¼šš”šš
DF Total = N ā 1
DF Inter = k ā 1
DF Intra = N ā k
ššššš”šš =
SSTotal
DFTotal
ššš¼šš”šš =
SSInter
DFInter
ššš¼šš”šš =
SSIntra
DFIntra
š¹ =
MSInter
MSIntra
where
ļ· F is the result of the test
ļ· k is the number of different groups to which the sampled cases belong
ļ· š = ā šš
š
š=1 is the total sample size
ļ· ni is the number of cases in the i-th group
ļ· yij is the value of the measured variable for the j-th case from the i-th group
ļ· šĢ .. is the mean of all yij
ļ· šĢ š. is the mean of the yij for group i.
The test statistic has a F-distribution with DF Inter and DF Intra degrees of freedom. Thus the null
hypothesis is rejected if š¹ ā„ š¹(1 ā š¼) šāš
šā1
19. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
19
ANOVA Multiple Comparisons
Difference of Means
š¦Ģ š ā š¦Ģ š
Standard Error of the Difference of Means Estimator
šš”š. šøšššš = āššš¼šš”šš ā (
1
šš
+
1
šš
)
Scheffeās Method
Confidence Interval for Difference of Means
š¶š¼ (1 ā š¼) = š¦Ģ š ā š¦Ģ š Ā± āš·š¹š¼šš”šš ā ššš¼šš”šš ā š¹(1 ā š¼) š·š¹ š¼šš”šš
š·š¹ š¼šš”šš
ā (
1
šš
+
1
šš
)
Source: http://en.wikipedia.org/wiki/Scheff%C3%A9%27s_method
Tukey's range test HSD
Confidence Interval for Difference of Means
š¶š¼ (1 ā š¼) = š¦Ģ š ā š¦Ģ š Ā± š(1 ā š¼) š·š¹ š¼šš”šš
š
ā
ššš¼šš”šš
2
ā (
1
šš
+
1
šš
)
Where q is the studentized range distribution.
Source: https://en.wikipedia.org/wiki/Tukey%27s_range_test
Fisher's Method LSD
If overall ANOVA test is not significant, you must not consider any results of Fisher test, significant or
not.
Confidence Interval for Difference of Means
š¶š¼ (1 ā š¼) = š¦Ģ š ā š¦Ģ š Ā± š”(1 ā š¼
2ā )
š·š¹ š¼šš”šš
āššš¼šš”šš ā (
1
šš
+
1
šš
)
Where t is the student distribution.
Bonferroni's Method
The family-wise significance level (FWER) is Ī± = 1 - Confidence Level. Thus any comparison flagged by
ISSTATS as significant is based on a Bonferroni Correction:
20. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
20
š¼ā² =
2š¼
š(š ā 1)
šā² = š
š(š ā 1)
2
Where k is the number of groups.
Confidence Interval for Difference of Means
š¶š¼ (1 ā š¼) = š¦Ģ š ā š¦Ģ š Ā± š” (1 ā š¼ā²
2ā )
š·š¹ š¼šš”šš
āššš¼šš”šš ā (
1
šš
+
1
šš
)
Where t is the student distribution.
Sidak's Method
The family-wise significance level (FWER) is Ī± = 1 - Confidence Level. So any comparison flagged by
ISSTATS as significant is based on a Sidak Correction:
š¼ā² = (1 ā š¼)
2
š(šā1)
šā²
= 1 ā š
log(1āš)š(šā1)
2
Where k is the number of groups.
Confidence Interval for Difference of Means
š¶š¼ (1 ā š¼) = š¦Ģ š ā š¦Ģ š Ā± š” (1 ā š¼ā²
2ā )
š·š¹ š¼šš”šš
āššš¼šš”šš ā (
1
šš
+
1
šš
)
Where t is the student distribution.
Welchās Test for equality of means
The test statistic, F*
, is defined as follows:
š¹ā
=
ā š¤š(š„Ģ š ā šĢ)2š
š=1
š ā 1
1 +
2(š ā 2)
š2 ā 1
ā ā āš
š
š=1
where
ļ· F*
is the result of the test
ļ· k is the number of different groups to which the sampled cases belong
ļ· ni is the number of cases in the i-th group
ļ· š¤š =
š š
šš
2
21. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
21
ļ· š = ā š¤š = ā
š š
šš
2
š
š=1
š
š=1
ļ· šĢ =
ā š¤ š š„Ģ š
š
š=1
š
ļ· āš =
(1ā
š¤ š
š
)
2
š šā1
The test statistic has approximately a F-distribution with k-1 and šš =
š2ā1
3āā ā š
š
š=1
degrees of freedom. Thus
the null hypothesis is rejected if š¹ā
ā„ š¹(1 ā š¼) šš
šā1
BrownāForsythe Test for equality of means
The test statistic, F*
, is defined as follows:
š¹ā
=
ā šš(š„Ģ š ā šĢ ..)2š
š=1
ā (1 ā
šš
š) šš
2š
š=1
where
ļ· F*
is the result of the test
ļ· k is the number of different groups to which the sampled cases belong
ļ· ni is the number of cases in the i-th group (sample size of group i)
ļ· š = ā šš
š
š=1 is the total sample size
ļ· šĢ .. =
ā š š š„Ģ š
š
š=1
š
is the overall mean.
The test statistic has approximately a F-distribution with k-1 and df degrees of freedom. Where df is
obtained with the Satterthwaite (1941) approximation as
1
df
= ā
ci
2
ni ā 1
k
i=1
with
šš =
(1 ā
šš
š) šš
2
ā (1 ā
šš
š) šš
2š
š=1
Thus the null hypothesis is rejected if š¹ā
ā„ š¹(1 ā š¼) šš
šā1
Homoscedasticity Tests
Levene's Test
The test statistic, F, is defined as follows:
š¹ =
š ā š
š ā 1
ā
ā šš(šĢ š. ā šĢ ..)2š
š=1
ā ā (ššš ā šĢ š.)2š š
š=1
š
š=1
22. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
22
where
ļ· F is the result of the test
ļ· k is the number of different groups to which the sampled cases belong
ļ· š = ā šš
š
š=1 is the total sample size
ļ· ni is the number of cases in the i-th group
ļ· Yij is the value of the measured variable for the j-th case from the i-th group
ļ· ššš = |ššš ā šĢ š.| where šĢ š. is a mean of i-th group
ļ· šĢ .. is the mean of all Zij
ļ· šĢ š. is the mean of the Zij for group i.
The test statistic has a F-distribution with k-1 and N-k degrees of freedom. Thus the null hypothesis is
rejected if š¹ ā„ š¹(1 ā š¼) šāš
šā1
Source: http://en.wikipedia.org/wiki/Levene%27s_test
BrownāForsythe Test for equality of variances
The test statistic, F, is defined as follows:
š¹ =
š ā š
š ā 1
ā
ā šš(šĢ š. ā šĢ ..)2š
š=1
ā ā (ššš ā šĢ š.)2šš
š=1
š
š=1
where
ļ· F is the result of the test
ļ· k is the number of different groups to which the sampled cases belong
ļ· š = ā šš
š
š=1 is the total sample size
ļ· ni is the number of cases in the i-th group
ļ· Yij is the value of the measured variable for the j-th case from the i-th group
ļ· ššš = |ššš ā šĢš.| where šĢš. is a median of i-th group
ļ· šĢ .. is the mean of all Zij
ļ· šĢ š. is the mean of the Zij for group i.
The test statistic has a F-distribution with k-1 and N-k degrees of freedom. Thus the null hypothesis is
rejected if š¹ ā„ š¹(1 ā š¼) šāš
šā1
Source: http://en.wikipedia.org/wiki/Levene%27s_test
Bartlett's Test
Bartlett's test is used to test the null hypothesis, H0 that all k population variances are equal against the
alternative that at least two are different.
If there are k samples with size ni and sample variances S2
i then Bartlett's test statistic is
š2
=
(š ā š)šš(š š
2
) ā ā (šš ā 1)šš(šš
2
)š
š=1
1 +
1
3(š ā 1)
ā (ā (
1
šš ā 1)š
š=1 ā
1
š ā š
)
where
23. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
23
ļ· š = ā šš
š
š=1 is the total sample size
ļ· š š
2
=
ā (š šā1)šš
2š
š=1
šāš
is the pooled estimate for the variance
The test statistic has approximately a chi-squared distribution with k-1 degrees of freedom. Thus the null
hypothesis is rejected if š2
ā„ š šā1
2
(1 ā š¼).
Source: http://en.wikipedia.org/wiki/Bartlett%27s_test
Bivariate Correlation Tests
Sample Covariance
Sxy =
ā (xi ā xĢ )(yi ā yĢ )N
i=1
N ā 1
Where š = ā šš
š
š=1 is the total sample size.
Source: http://en.wikipedia.org/wiki/Covariance#Calculating_the_sample_covariance
Sample Pearson Product-Moment Correlation Coefficient
r =
1
N ā 1
ā
ā (š„š ā š„Ģ )(š¦š ā š¦Ģ )š
š=1
š š„ š š¦
=
š š„š¦
š š„ š š¦
where Sx and Sy are the sample standard deviation of the paired sample (xi, yi), Sxy is the sample
covariance and N is the total sample size.
Source: http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#For_a_sample
Test for the Significance of the Pearson Product-Moment Correlation Coefficient
Test hypothesis are:
ļ· H0: the sample values come from a population in which Ļ=0
ļ· H1: the sample values come from a population in which Ļā 0
Test statistic is
t =
r ā āN ā 2
ā1 ā r2
where
ļ· š = ā šš
š
š=1 is the total sample size
ļ· r is the Sample Pearson Product-Moment Correlation Coefficient
24. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
24
The test statistic has a t-student distribution with N-2 degrees of freedom.
Spearman Correlation Coefficient
For each of the variables X and Y separately, the observations are sorted into ascending order and
replaced by their ranks. Identical values (rank ties or value duplicates) are assigned a rank equal to the
average of their positions in the ascending order of the values. Each time t observations are tied (t>1), the
quantity t3
āt is calculated and summed separately for each variable. These sums will be designated STx
and STy.
For each of the N observations, the difference between the rank of X and rank of Y is computed as:
di = Rank(Xi) ā Rank(Yi)
If there are no ties in both samples, Spearmanās rho (Ļ) is calculated as
Ļ = 1 ā
6 ā šš
N(š2 ā 1)
If there are any ties in any of the samples, Spearmanās rho (Ļ) is calculated as (Siegel, 1956):
Ļ =
šš„ + šš¦ ā ā di
2ā šš„ ā šš¦
where
šš„ =
N(š2
ā 1) ā ššš„
12
šš¦ =
N(š2
ā 1) ā ššš¦
12
If Tx or Ty is 0, the statistic is not computed.
Source:
http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2F
alg_nonpar_corr_spearman.htm
Test for the Significance of the Spearmanās Correlation Coefficient
Test hypothesis are:
ļ· H0: the sample values come from a population in which Ļ=0
ļ· H1: the sample values come from a population in which Ļā 0
Test statistic is
t =
Ļ ā āN ā 2
ā1 ā Ļ2
The test statistic has a t-student distribution with N-2 degrees of freedom.
25. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
25
Kendall's Tau-b Correlation Coefficient
For each of the variables X and Y separately, the observations are sorted into ascending order and
replaced by their ranks. In situations where t observations are tied, the average rank is assigned.
Each time t > 1, the following quantities are computed and summed over all groups of ties for each
variable separately.
T1 = ā š”2
ā š”
T2 = ā(š”2
ā š”)(š” ā 2)
T3 = ā(š”2
ā š”)(2š” + 5)
Each of the N cases is compared to the others to determine with how many cases its ranking of X and Y is
concordant or discordant. The following procedure is used. For each distinct pair of cases (i, j), where i <
j the quantity
dij=[Rank(Xj)āRank(Xi)][Rank(Yj)āRank(Yi)]
is computed. If the sign of this product is positive, the pair of observations (i, j) is concordant. If the sign
is negative, the pair is discordant. The number of concordant pairs minus the number of discordant pairs
is
S = ā ā š ššš(ššš)
š
š=š+1
šā1
š=1
where sign(dij) is defined as +1 or ā1 depending on the sign of dij. Pairs in which dij=0 are ignored in the
computation of S.
If there are no ties in both samples, Kendallās tau (Ļ) is computed as
Ļ =
2S
N2 ā N
If there are any ties in any of the samples, Kendallās tau (Ļ) is computed as
Ļ =
2S
āN2 ā N ā š1 š„āN2 ā N ā š1 š¦
If the denominator is 0, the statistic is not computed.
Source: http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient#Tau-b
Test for the Significance of the Kendall's Tau-b Correlation Coefficient
The variance of S is estimated by (Kendall, 1955):
26. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
26
Var =
(N2
ā N)(2N + 5) ā T3x ā T3y
18
+
T2x ā T2y
9(N2 ā N)(N ā 2)
+
T1x ā T1y
2(N2 ā N)
The significance level is obtained using
Z =
S
āVar
Which, under the null hypothesis, is approximately distributed as a standard normal when the variables
are statistically independent.
Sources: http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient#Significance_tests
http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2F
alg_nonpar_corr_kendalls.htm
Parametric Value at Risk
Value at Risk of a single asset
Given the time series of daily return rates for an asset, the daily mean of the return rates is Ī¼, the daily
variance of the daily return rates is Ļ2
. Given the position, hold or investment in the asset P.
One-day Expected Return is:
ER = PĪ¼
The Standard Deviation or Volatility is the square root of the Variance:
š = ā š2
One-day Value at Risk is:
ššš 1āš¼ = ā(Ī¼ + š§ š¼ š)P
where zĪ± is the left-tail Ī± quantile of the normal standard distribution.
Total Value at Risk for n trading days is:
ššš 1āš¼
š ššš¦š
= ššš 1āš¼ ā ā š = ā(Ī¼ + š§ š¼ š)Pā š
Portfolio Value at Risk
Given the time series of daily return rates on different assets, the daily mean of the return rates for the i-th
asset is Ī¼i, the daily variance of the return rate for the i-th asset is Ļi
2
, the daily standard deviation (or
volatility) of the return rates for the i-th asset is Ļi. The covariance of the daily return rates of i-th and j-th
assets is Ļij. All parameters are unbiased estimates. Given the holds, positions or investments on each of
these assets: Pi
Total positions is
27. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
27
P = ā šš
š
š=1
The weighting of each position is
š¤š =
šš
š
The weighted mean of the portfolio is
Ī¼ š = ā š¤š šš =
š
š=1
1
š
ā šš šš
š
š=1
One-day Expected Return of the portfolio is the weighted mean of the portfolio multiplied by the total
position
ER = PĪ¼ š = P ā š¤š šš =
š
š=1
ā šš šš
š
š=1
The Portfolio Variance is
š š
2
= [š¤1 ā¦ š¤š ā¦ š¤ š] [
š1
2
āÆ š1š
ā® ā± ā®
š š1 āÆ š š
2
]
[
š¤1
ā®
š¤š
ā®
š¤ š]
= š š
šš
where W is the vector of weights and M is the covariance matrix. The item i-th in the diagonal of M is the
daily variance of the return rates for the i-th asset. The items outside the diagonal are covariances.
Portfolio Variance also can be computed as:
š š
2
=
1
š2
ā [š1 ā¦ šš ā¦ šš] [
š1
2
āÆ š1š
ā® ā± ā®
š š1 āÆ š š
2
]
[
š1
ā®
šš
ā®
šš]
=
1
š2
ā š š
šš
where X is the vector of positions.
The Portfolio Standard Deviation or Portfolio Volatility is the square root of the Portfolio Variance:
š š = āš š
2
One-day Value at Risk is:
ššš 1āš¼ = ā(Ī¼ š + š§ š¼ š š)P
28. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
28
Where zĪ± is the left-tail Ī± quantile of the normal standard distribution.
Total Value at Risk for n trading days is:
ššš 1āš¼
š ššš¦š
= ššš 1āš¼ ā ā š = ā(Ī¼ š + š§ š¼ š š)Pā š
ššš 1āš¼
š ššš¦š
is the minimum potential loss that a portfolio can suffer in the Ī±% worst cases in n days.
About the Signs: A positive value of VaR is an expected loss. A negative VaR would imply the portfolio
has a high probability of making a profit.
Source: http://www.jpmorgan.com/tss/General/Risk_Management/1159360877242
Remark: Some texts about VaR express the covariance as Ļij = ĻiĻjĻij where Ļij is the correlation
coefficient.
Remark: Sometimes VaR is assumed to be the Portfolio Volatility multiplied by the position as expected
return is supposed to be approximately zero. ISSTATS does NOT consider VaR as Portfolio Volatility
and do NOT suppose expected return is zero.
Marginal Value at Risk
Marginal Value at Risk is the change in portfolio VaR resulting from a marginal change in the currency
(dollar, euroā¦) position in component i:
šššš š =
šššš
ššš
Assuming the linearity of the risk in the parametric approach, the vector of Marginal Value at Risk is
[
šššš 1
ā®
šššš š
ā®
šššš š]
= ā
([
š1
ā®
šš
ā®
š š]
+
š§ š¼
š š
ā [
š1
2
āÆ š1š
ā® ā± ā®
š š1 āÆ š š
2
]
[
š¤1
ā®
š¤š
ā®
š¤ š])
[
šššš 1
ā®
šššš š
ā®
šššš š]
= ā
([
š1
ā®
šš
ā®
š š]
+
š§ š¼
š ā š š
ā [
š1
2
āÆ š1š
ā® ā± ā®
š š1 āÆ š š
2
]
[
š1
ā®
šš
ā®
šš])
Total Marginal Value at Risk for n trading days is:
šššš š
š ššš¦š
= šššš š ā ā š
Component Value at Risk
Component Value at Risk is a partition of the portfolio VaR that indicates the change of VaR if a given
component was deleted.
29. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
29
š¶ššš š =
šššš
ššš
šš = šššš š ā šš
Note that the sum of all component VaRs (CVaR) is the VaR for the entire portfolio:
ššš = ā š¶ššš š
š
š=1
= ā
šššš
ššš
š
š=1
šš = ā šššš š
š
š=1
ā šš
Total Component Value at Risk for n trading days is:
š¶ššš š
š ššš¦š
= š¶ššš š ā ā š
Source: http://www.math.nus.edu.sg/~urops/Projects/valueatrisk.pdf
Incremental Value at Risk
Incremental VaR of a given position is the VaR of the portfolio with the given position minus the VaR of
the portfolio without the given position, which measures the change in VaR due to a new position on the
portfolio:
IVaR (a) = VaR (P) ā VaR (P - a)
Source:
http://www.jpmorgan.com/tss/General/Portfolio_Management_With_Incremental_VaR/1259104336084
Conditional Value at Risk, Expected Shortfall, Expected Tail Loss or Average Value at Risk
šøš1āš¼
1 ššš¦
is the expected value of the loss of the portfolio in the Ī±% worst cases in one day.
Under Multivariate Normal Assumption, Expected Shortfall, also known as Expected Tail Loss (ETL),
Conditional Value-at-Risk (CVaR), Average Value at Risk (AVaR) and Worst Conditional Expectation,
is computed by
ES(āVaR) = āšø(š„|š„ < āššš ) ā š = ā[š + šøš(š§ š¼)š] ā š = ā[š + šø(š§|š§ < š§ š¼)š] ā š
= ā [š +
ā« š”šā
š”2
2
š§ š¼
āā
šš”
š¼
š] ā š = ā(š ā
šā
š§ š¼
2
2
š¼ā2š
š) ā š
where zĪ± is the left-tail Ī± quantile of the normal standard distribution.
About the Sign: Because VaR is given by ISSTATS with a negative sign, as J.P. Mogan recommend, we
take its original value to perform calculations (-VaR = Ī¼ + zĪ±Ļ). Once the ES is computed, it is given
multiplied by a negative sign. That is mean; a positive value of ES is an expected loss. On the other hand,
a negative value of ES would imply the portfolio has a high probability of making a profit even in the
worst cases.
Source: http://www.imes.boj.or.jp/english/publication/mes/2002/me20-1-3.pdf
30. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
30
Exponentially Weighted Moving Average (EWMA) Forecast
Given a series of k daily return rates {r1, ā¦ā¦.., rk} computed as Continuously Compounded Return:
šš = ln (
š š
š šā1
)
Where r1 corresponds to the earliest date in the series, and rk corresponds to the latest or most recent date.
Supposed k > 50, and assuming that the sample mean of daily returns is zero, the EWMA estimates the
one-day variance for a given sequence of k returns as:
š2
= (1 ā š) ā šš
ššāš
2
šā1
š=0
where 0 < Ī»< 1 is the decay factor.
The one-day volatility is:
š = ā š2
For horizons greater than one-day, the T-period (i.e., over T days) forecasts of the volatility is:
š š ššš¦š = šāš
For two return series, assuming that both averages are zero, the EWMA estimate of one-day covariance
for a given sequence of k returns is given by
ššš£1,2 = š1,2 = (1 ā š) ā šš
š1,šāš š2,šāš
šā1
š=0
The corresponding one-day correlation forecast for the two returns is given by
š1,2 =
ššš£1,2
š1 š2
=
š1,2
š1 š2
For horizons greater than one-day, the T-period (i.e., over T days) forecasts of the covariance is:
ššš£1,2
š ššš¦š
= š1,2 š
Source: http://pascal.iseg.utl.pt/~aafonso/eif/rm/TD4ePt_2.pdf
Value at Risk of a single asset, Portfolio Value at Risk, Marginal Value at Risk, Component
Value at Risk, Incremental Value at Risk, Incremental Value at Risk by EWMA method.
See methods and formulas at Parametric Value at Risk.
31. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
31
Linear Regression
Given n equations for a regression model, with p predictor variables. The i-th given equation is
yi = Ī²0 + Ī²1xi1 + Ī²2xi2 + ā¦+ Ī²pxip
The n equations stacked together and written in vector form is
[
š¦1
ā®
š¦š
ā®
š¦š]
= [
1 āÆ š„1š
ā® ā± ā®
1 āÆ š„ šš
]
[
Ī²0
ā®
Ī² š
ā®
Ī² š]
+
[
Ō0
ā®
Ōš
ā®
Ō š]
In matrix notation:
Y = XĪ² + Ō
X is here named the design matrix, of dimensions n-by-(p+1).
If constant is not included, the matrix are
[
š¦1
ā®
š¦š
ā®
š¦š]
= [
š„11 āÆ š„1š
ā® ā± ā®
š„ š1 āÆ š„ šš
]
[
Ī²1
ā®
Ī² š
ā®
Ī² š]
+
[
Ō1
ā®
Ōš
ā®
Ō š]
If constant is not included, X, the design matrix, has now dimensions n-by-p.
The estimated value of the unknown parameter Ī²:
š½Ģ = (šš š
)ā1
š š
š
Estimation can be carried out if, and only if, there is no perfect multicollinearity between the predictor
variables.
If constant is not included, the parameters can also be estimated by
š½Ģš =
ā š„šš š¦š
š
š=1
ā š„šš
2š
š=1
The standardized coefficients are
š½Ģš
š š”
=
š½Ģš ā š š„š
š š¦
Where
ļ· Sxi is the unbiased standard deviation of the i-th predictor variable
32. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
32
ļ· Sy is the unbiased standard deviation of the response variable y
The estimate of the standard error of each coefficient is obtained by
š š(š½Ģš) = āšššø ā (šš š)šš
ā1
Where MSE is the mean squared error of the regression model.
It is known that
š½Ģš
š š(š½Ģš)
ā š” šāšā1
Where
ļ· p is the number of predictor variables
ļ· n is the total number of observations (number of rows in the design matrix)
If constant is not included, the degrees of freedom for the t statistics are n-p.
ANOVA for linear regression
If the constant is included.
Component Sum of
squares
Degrees of
freedom
Mean of squares F
Model SSM p MSM = SSM/p MSM/MSE
Error SSE n-p-1 MSE = SSE/(n-p-1)
Total SST n-1 MST = SST/(n-1)
Being
ššš = ā(š¦šĢ ā š¦Ģ )2
š
š=1
šššø = ā(š¦š ā š¦šĢ)2
š
š=1
ššš = ā(š¦š ā š¦Ģ )2
š
š=1
Where
ļ· p is the number of predictor variables
ļ· n is the total number of observations (number of rows in the design matrix)
ļ· SSE = sum of squared residuals
ļ· MSE = mean squared error of the regression model
The test statistic has a F-distribution with p and (n-p-1) degrees of freedom. Thus the ANOVA null
hypothesis is rejected if š¹ ā„ š¹(1 ā š¼) šāšā1
š
The coefficient of determination R2
is defined as SSM/SST. It is output as a percentage.
33. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
33
The Adjusted R2
is defined as 1-MSE/MST. It is output as a percentage.
The square root of MSE is called the standard error of the regression, or standard error of the Estimate.
If the constant is not included.
Component Sum of
squares
Degrees of
freedom
Mean of squares F
Model SSM p MSM = SSM/p MSM/MSE
Error SSE n-p MSE = SSE/(n-p)
Total SST n SST/n
Being
ššš = ā š¦šĢ2
š
š=1
šššø = ā(š¦š ā š¦šĢ)2
š
š=1
ššš = ā š¦š
2
š
š=1
Unstandardized Predicted Values
The fitted values (or unstandardized predicted values) from the regression will be
šĢ = šš½Ģ = š(šš š
)ā1
š š
š = HY
where H is the projection matrix (also known as hat matrix)
H = X(XXT
)-1
XT
Standardized Predicted Values
Once computed the mean and unbiased standard deviation of the unstandardized predicted values, we
standardize the fitted values as
š¦Ģš
š š”
=
š¦Ģš ā š¦ĢĢ
š š¦Ģ
When new predictions are included outside of the design matrix, they are standardized with the above
values.
Prediction Intervals for Mean
Let define the vector of given predictors as
34. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
34
Xh = (1, xh,1, xh,2, ā¦, xh, p)T
We define the standard error of the fit at Xh given by:
š š(š¦Ģā) = āšššø ā šā
š
(š š š)ā1 šā
Then, the Confidence Interval for the Mean Response is
š¦Ģā Ā± š” š¼
2
;šāšā1
ā š š(š¦Ģā)
Where
ļ· X is the design matrix
ļ· Å·h is the "fitted value" or "predicted value" of the response when the predictor values are Xh.
ļ· MSE is the mean squared error of the regression model
ļ· n is the total number of observations
ļ· p is the number of predictor variables
Prediction Intervals for Individuals
Let define the vector of given predictors as
Xh = (1, xh,1, xh,2, ā¦, xh, p)T
We define the standard error of the fit at Xh given by:
š š(š¦Ģā) = āšššø ā [1 + šā
š(š š š)ā1 šā]
Then, the Confidence Interval for individuals or new observations is
š¦Ģā Ā± š” š¼
2
;šāšā1
ā š š(š¦Ģā)
Where
ļ· X is the design matrix
ļ· Å·h is the "fitted value" or "predicted value" of the response when the predictor values are Xh.
ļ· MSE is the mean squared error of the regression model
ļ· n is the total number of observations
ļ· p is the number of predictor variables
Unstandardized Residuals
35. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
35
The Unstandardized Residual for the i-th data unit is defined as:
ĆŖi = yi - Å·i
In matrix notation
Ć = Y - Č² = Y ā HY = (Inxn ā H)Y
Where H is the hat matrix.
Standardized Residuals
The standardized Residual for the i-th data unit is defined as:
esĢ š =
eĢ š
ā šššø
Where
ļ· ĆŖi is the unstandardized residual for the i-th data unit.
ļ· MSE is the mean squared error of the regression model
Studentized Residuals (internally studentized residuals)
The leverage score for the i-th data unit is defined as:
hii = [H]ii
the i-th diagonal element of the projection matrix (also known as hat matrix)
H = X(XXT
)-1
XT
where X is the design matrix.
The Studentized Residual for the i-th data unit is defined as:
š”š =
šĢ š
āšššø ā (1 ā āšš)
Where
ļ· ĆŖi is the unstandardized residual for the i-th data unit.
ļ· MSE is the mean squared error of the regression model
36. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
36
Source:https://en.wikipedia.org/wiki/Studentized_residual
Centered Leverage Values
The regular leverage score for the i-th data unit is defined as:
hii = [H]ii
the i-th diagonal element of the projection matrix (also known as hat matrix)
H = X(XXT
)-1
XT
where X is the design matrix.
The centered leverage value for the i-th data unit is defined as:
clvi = hii ā 1/n
Where n is the number of observations.
If the intercept is not included, then the centered leverage value for the i-th data unit is defined as:
clvi = hii
Source:https://en.wikipedia.org/wiki/Leverage_(statistics)
Mahalanobis Distance
The Mahalanobis Distance for the i-th data unit is defined as:
Di2
= (n - 1)*(hii ā 1/n) = (n - 1)*clvi
Where
ļ· hii is the i-th diagonal element of the projection matrix.
ļ· n is the number of observations
If the intercept is not included, the Mahalanobis Distance for the i-th data unit is defined as:
Di2
= n*hii
Source: https://en.wikipedia.org/wiki/Mahalanobis_distance
Cookās Distance
37. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
37
The Cookās Distance for the i-th data unit is defined as:
š·š =
šĢ š
2
āšš
šššø ā (š + 1) ā (1 ā āšš)2
Where
ļ· hii is the i-th diagonal element of the projection matrix.
ļ· p is the number of predictor variables
ļ· ĆŖi is the unstandardized residual for the i-th data unit.
ļ· MSE is the mean squared error of the regression model
If the intercept is not included, the Cookās Distance for the i-th data unit is defined as:
š·š =
šĢ š
2
āšš
šššø ā š ā (1 ā āšš)2
Source: https://en.wikipedia.org/wiki/Cook%27s_distance
Curve Estimation Models
Linear. Model whose equation is Y = b0 + (b1 * t). The series values are modeled as a linear
function of time.
Quadratic. Model whose equation is Y = b0 + (b1 * t) + (b2 * t**2). The quadratic model can be
used to model a series that "takes off" or a series that dampens.
Cubic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3).
Quartic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3) + (b4
* t**4).
Quintic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3) + (b4
* t**4) + (b5 * t**5).
Sextic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3) + (b4 *
t**4) + (b5 * t**5) + (b6 * t**6).
Logarithmic. Model whose equation is Y = b0 + (b1 * ln(t)).
Inverse. Model whose equation is Y = b0 + (b1 / t).
Power. Model whose equation is Y = b0 * (t**b1) or ln(Y) = ln(b0) + (b1 * ln(t)).
Compound. Model whose equation is Y = b0 * (b1**t) or ln(Y) = ln(b0) + (ln(b1) * t).
S-curve. Model whose equation is Y = e**(b0 + (b1/t)) or ln(Y) = b0 + (b1/t).
38. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
38
Logistic. Model whose equation is Y = 1 / (1/u + (b0 * (b1**t))) or ln(1/y-1/u) = ln (b0) + (ln(b1)
* t) where u is the upper boundary value. After selecting Logistic, specify the upper boundary value to
use in the regression equation. The value must be a positive number that is greater than the largest
dependent variable value.
Growth. Model whose equation is Y = e**(b0 + (b1 * t)) or ln(Y) = b0 + (b1 * t).
Exponential. Model whose equation is Y = b0 * (e**(b1 * t)) or ln(Y) = ln(b0) + (b1 * t).