SlideShare a Scribd company logo
1 of 39
Download to read offline
InnerSoft STATS
Methods and Formulas Help
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
2
Mean
The arithmetic mean is the sum of a collection of numbers divided by the number of numbers in the
collection.
Sample Variance
The estimator of population variance, also called the unbiased sample variance, is:
š‘†2
=
āˆ‘ (š‘„š‘– āˆ’ š‘„Ģ…)2š‘›
š‘–=1
š‘› āˆ’ 1
Source: http://en.wikipedia.org/wiki/Variance
Sample Kurtosis
The estimators of population kurtosis is:
šŗ2 =
š‘˜4
š‘˜2
2 =
(š‘› + 1)š‘›
(š‘› āˆ’ 1)(š‘› āˆ’ 2)(š‘› āˆ’ 3)
āˆ—
āˆ‘ (š‘„š‘– āˆ’ š‘„Ģ…)4š‘›
š‘–=1
š‘˜2
2 āˆ’ 3
(š‘› āˆ’ 1)2
(š‘› āˆ’ 2)(š‘› āˆ’ 3)
The standard error of the sample kurtosis of a sample of size n from the normal distribution is:
š¾ š‘†š‘”š‘‘. šøš‘Ÿš‘Ÿš‘œš‘Ÿ = āˆš
4[6š‘›(š‘› āˆ’ 1)2(š‘› + 1)]
(š‘› āˆ’ 3)(š‘› āˆ’ 2)(š‘› + 1)(š‘› + 3)(š‘› + 5)
Source: http://en.wikipedia.org/wiki/Kurtosis#Estimators_of_population_kurtosis
Sample Skewness
Skewness of a population sample is estimated by the adjusted Fisherā€“Pearson standardized moment
coefficient:
šŗ =
š‘›
(š‘› āˆ’ 1)(š‘› āˆ’ 2)
āˆ‘ (
š‘„š‘– āˆ’ š‘„Ģ…
š‘ 
)
3š‘›
š‘–=1
where n is the sample size and s is the sample standard deviation.
The standard error of the skewness of a sample of size n from a normal distribution is:
šŗ š‘†š‘”š‘‘. šøš‘Ÿš‘Ÿš‘œš‘Ÿ = āˆš
6š‘›(š‘› āˆ’ 1)
(š‘› āˆ’ 2)(š‘› + 1)(š‘› + 3)
Source: https://en.wikipedia.org/wiki/Skewness#Sample_skewness
Total Variance
Variance of the entire population is:
šœŽ2
=
āˆ‘ (š‘„š‘– āˆ’ š‘„Ģ…)2š‘›
š‘–=1
š‘›
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
3
Source: http://en.wikipedia.org/wiki/Variance
Total Kurtosis
Kurtosis of the entire population is:
šŗ2 =
āˆ‘ (š‘„š‘– āˆ’ š‘„Ģ…)4š‘›
š‘–=1
š‘›
šœŽ4
āˆ’ 3
where n is the sample size and Ļƒ is the total standard deviation.
Source: http://en.wikipedia.org/wiki/Kurtosis
Total Skewness
Skewness of the entire population is:
šŗ =
āˆ‘ (š‘„š‘– āˆ’ š‘„Ģ…)3š‘›
š‘–=1
š‘›
šœŽ3
where n is the sample size and Ļƒ is the total standard deviation.
Source: https://en.wikipedia.org/wiki/Skewness
Quantiles of a population
ISSTATS uses the same method as Rā€“7, Excel CUARTIL.INC function, SciPyā€“(1,1), SPSS and Minitab.
Qp, the estimate for the kth
qā€“quantile, where p = k/q and h = (Nā€“1)*p + 1, is computing by
Qp =
Linear interpolation of the modes for the order statistics for the uniform distribution on [0, 1]. When p =
1, use xN.
Source: http://en.wikipedia.org/wiki/Quantile#Estimating_the_quantiles_of_a_population
MSSD (Mean of the squared successive differences)
It is calculated by taking the sum of the differences between consecutive observations squared, then
taking the mean of that sum and dividing by two.
š‘€š‘†š‘†š· =
āˆ‘ (š‘„š‘–+1 āˆ’ š‘„š‘–)2š‘›
š‘–=1
2(š‘› āˆ’ 1)
The MSSD has the desirable property that one half the MSSD is an unbiased estimator of true variance.
Pearson Chi Square Test
The value of the test-statistic is
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
4
šœ’2
= āˆ‘
(š‘‚š‘– āˆ’ šøš‘–)2
šøš‘–
š‘›
š‘–=1
Where
ļ‚· šœ’2
is the Pearson's cumulative test statistic, which asymptotically approaches a šœ’2
distribution
with (r - 1)(c - 1) degrees of freedom.
ļ‚· š‘‚š‘– is the number of observations of type i.
ļ‚· šøš‘– is the expected (theoretical) frequency of type i
Yates's Continuity Correction
The value of the test-statistic is
šœ’2
= āˆ‘
(š‘šš‘Žš‘„{0, |š‘‚š‘– āˆ’ šøš‘–| āˆ’ 0.5})2
šøš‘–
š‘›
š‘–=1
When |š‘‚š‘– āˆ’ šøš‘–| āˆ’ 0.5 is below zero, the null value is computed. The effect of Yates' correction is to
prevent overestimation of statistical significance for small data. This formula is chiefly used when at least
one cell of the table has an expected count smaller than 5.
Likelihood Ratio G-Test
The value of the test-statistic is
šŗ = 2 (āˆ‘ āˆ‘ š‘‚š‘–š‘— āˆ— š‘™š‘›(
š‘‚š‘–š‘—
šøš‘–š‘—
)
š‘
š‘—=1
š‘Ÿ
š‘–=1
)
where
ļ‚· Oij is the observed count in row i and column j
ļ‚· Eij is the expected count in row i and column j
G has an asymptotically approximate Ļ‡2
distribution with (r - 1)(c - 1) degrees of freedom when the null
hypothesis is true and n is large enough.
Mantel-Haenszel Chi-Square Test
The Mantel-Haenszel chi-square statistic tests the alternative hypothesis that there is a linear association
between the row variable and the column variable. Both variables must lie on an ordinal scale. The
Mantel-Haenszel chi-square statistic is computed as:
š‘„ š‘€š» = (š‘› āˆ’ 1)š‘Ÿ2
Where r is the Pearson correlation between the row variable and the column variable, n is the sample size.
Under the null hypothesis of no association, has an asymptotic chi-square distribution with one degree of
freedom.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
5
Fisher's Exact Test
Fisherā€™s exact test assumes that the row and column totals are fixed, and then uses the hypergeometric
distribution to compute probabilities of possible tables conditional on the observed row and column totals.
Fisherā€™s exact test does not depend on any large-sample distribution assumptions, and so it is appropriate
even for small sample sizes and for sparse tables. This test is computed for 2X2 tables such as
š“ = (
š‘Ž š‘
š‘ š‘‘
)
For an efficient computing, the elements of the matrix A are reordered
Aā€™ = ( š‘Žā€² š‘ā€²
š‘ā€² š‘‘ā€²
)
Being aā€™ the cell of A that have the minimum marginals (minimum row and column totals). The test result
does not depend on the cells disposition.
The left-sided ā€“value sums the probability for all the tables that have equal or smaller aā€™.
p š‘™š‘’š‘“š‘” = P(š‘„ ā‰¤ š‘Žā€²) = āˆ‘
(
š¾ = š‘Žā€²
+ š‘ā€²
š‘–
) (
š‘ āˆ’ š¾
š‘› āˆ’ š‘–
)
(
š‘ = š‘Žā€² + š‘ā€² + š‘ā€² + š‘‘ā€²
š‘› = š‘Žā€² + š‘ā€²
)
š‘Žā€²
š‘–=0
The right-sided ā€“value sums the probability for all the tables that have equal or larger aā€™.
p š‘Ÿš‘–š‘”ā„Žš‘” = P(š‘„ ā‰„ š‘Žā€²) = āˆ‘
(
š¾ = š‘Žā€²
+ š‘ā€²
š‘–
) (
š‘ āˆ’ š¾
š‘› āˆ’ š‘–
)
(
š‘ = š‘Žā€² + š‘ā€² + š‘ā€² + š‘‘ā€²
š‘› = š‘Žā€² + š‘ā€²
)
š¾=š‘Žā€²+š‘ā€²
š‘–=š‘Žā€²
Most of the statistical packages output -as the one-sided test result- the minimum value of pleft and pright.
The Fisher two-tailed p-value for a table A is defined as the sum of probabilities for all tables consistent
with the marginals that are as likely as the current table.
McNemar's Test
This test is computed for 2X2 tables such as
š“ = (
š‘Ž š‘
š‘ š‘‘
)
The value of the test-statistic is
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
6
šœ’2
=
(š‘ āˆ’ š‘)2
š‘ + š‘
The statistic is asymptotically distributed like a chi-squared distribution with 1 degree of freedom.
Edwards Continuity Correction
The value of the test-statistic is
šœ’2
=
(š‘šš‘Žš‘„{0, |š‘ āˆ’ š‘| āˆ’ 1})2
š‘ + š‘
When |š‘ āˆ’ š‘| āˆ’ 1 is below zero, the statistic is zero.
The statistic is asymptotically distributed like a chi-squared distribution with 1 degree of freedom.
McNemar Exact Binomial
Assuming that b < c. Let be n = b + c, and B(x, n, p) the binomial distribution
Two āˆ’ sided p āˆ’ value = 2 āˆ— (one āˆ’ sided p āˆ’ value) = 2 āˆ— āˆ‘ šµ(š‘„, š‘›, 0.5)
š‘
š‘„=0
= 2 āˆ— āˆ‘ (
š‘›
š‘„
) āˆ— 0.5 š‘„
āˆ— 0.5 š‘›āˆ’š‘„
š‘
š‘„=0
= 2 āˆ—
1
2 š‘›
āˆ— āˆ‘ (
š‘›
š‘„
)
š‘
š‘„=0
If b = c, the exact p-value equals 1.0.
Mid-P McNemar Test
Let be n = b + c. Assuming that b < c.
Mid āˆ’ P value = 2 āˆ— āˆ‘ šµ(š‘„, š‘›, 0.5)
š‘
š‘„=0
āˆ’ šµ(š‘, š‘›, 0.5) = 2 āˆ—
1
2 š‘›
āˆ— āˆ‘ (
š‘›
š‘„
) āˆ’ (
š‘›
š‘
) āˆ—
1
2 š‘›
š‘
š‘„=0
If b = c, the mid p-value is 1.0 āˆ’
1
2
(
š‘›
š‘
) āˆ—
1
2 š‘›
Bowkerā€™s Test of Symmetry
This test is computed for m-by-m square matrix as:
šµš‘Š = āˆ‘ āˆ‘
(š‘›š‘–š‘— āˆ’ š‘›š‘—š‘–)2
š‘›š‘–š‘— + š‘›š‘—š‘–
š‘–āˆ’1
š‘—=1
š‘šāˆ’1
š‘–=1
For large samples, BW has an asymptotic chi-square distribution with M*(M - 1)/2 ā€“ R degrees of
freedom under the null hypothesis of symmetry, where R is the number of off-diagonal cells with nij + nji
= 0.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
7
Risk Test
Let be
Risk Factor Disease status
Cohort = Present Cohort = Absent
Present a b
Absent c d
Odds ratio
The odds ratio (Risk Factor = Present / Risk Factor = Absent) is computed as:
š‘‚š‘… =
š‘Ž
š‘ā„
š‘
š‘‘ā„
The distribution of the log odds ratio is approximately normal with:
šœ’ ~ š‘(log(š‘‚š‘…) , šœŽ2
)
The standard error for the log odds ratio is approximately
š‘†šø = āˆš
1
š‘Ž
+
1
š‘
+
1
š‘
+
1
š‘‘
The 95% confidence interval for the odds ratio is computed as
[exp(log(š‘‚š‘…) āˆ’ š‘§0.025 āˆ— š‘†šø) ; exp(log(š‘‚š‘…) + š‘§0.025 āˆ— š‘†šø)]
To test the hypothesis that the population odds ratio equals one, is computed the two-sided p-value as
š‘ š‘–š‘”š‘›š‘–š‘“š‘–š‘š‘Žš‘›š‘š‘’ (2 āˆ’ š‘ š‘–š‘‘š‘’š‘‘) = 2 āˆ— š‘ƒ(š‘§ ā‰¤
āˆ’|log(š‘‚š‘…)|
š‘†šø
)
Source: https://en.wikipedia.org/wiki/Odds_ratio
Relative Risk
The relative risk (for cohort Disease status = Present) is computed as
š‘…š‘… =
š‘Ž
š‘Ž + š‘ā„
š‘
š‘ + š‘‘ā„
The distribution of the log relative risk is approximately normal with:
šœ’ ~ š‘(log(š‘‚š‘…) , šœŽ2
)
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
8
The standard error for the log relative risk is approximately
š‘†šø = āˆš
1
š‘Ž
+
1
š‘
āˆ’
1
š‘Ž + š‘
āˆ’
1
š‘ + š‘‘
The 95% confidence interval for the relative risk is computed as
[exp(log(š‘…š‘…) āˆ’ š‘§0.025 āˆ— š‘†šø) ; exp(log(š‘…š‘…) + š‘§0.025 āˆ— š‘†šø)]
To test the hypothesis that the population relative risk equals one, is computed the two-sided p-value as
š‘ š‘–š‘”š‘›š‘–š‘“š‘–š‘š‘Žš‘›š‘š‘’ (2 āˆ’ š‘ š‘–š‘‘š‘’š‘‘) = 2 āˆ— š‘ƒ(š‘§ ā‰¤
āˆ’|log(š‘…š‘…)|
š‘†šø
)
The relative risk (for cohort Disease status = Absent) is computed as
š‘…š‘… =
š‘
š‘Ž + š‘ā„
š‘‘
š‘ + š‘‘ā„
Epidemiology Risk
All the parameters are computed for cohort Disease status = Present.
Attributable risk, represents how much the risk factor increase/decrease the risk of disease
š“š‘… =
š‘Ž
š‘Ž + š‘
āˆ’
š‘
š‘ + š‘‘
If AR > 0 there an increase of the risk. If AR < 0 there is a reduction of the risk.
Relative Attributable Risk
š‘…š‘… =
š‘Ž
š‘Ž + š‘
āˆ’
š‘
š‘ + š‘‘
š‘
š‘ + š‘‘
=
š“š‘…
š‘
š‘ + š‘‘
Number Needed to Harm
š‘š‘š» =
1
š‘Ž
š‘Ž + š‘
āˆ’
š‘
š‘ + š‘‘
=
1
š“š‘…
The number needed to harm (NNH) is an epidemiological measure that indicates how many patients on
average need to be exposed to a risk-factor over a specific period to cause harm in an average of one
patient who would not otherwise have been harmed.
A negative number would not be presented as a NNH, rather, as the risk factor is not harmful, it is
expressed as a number needed to treat (NNT) or number needed to avoid to expose to risk.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
9
Attributable risk per unit
š“š‘…š‘ƒ =
š‘…š‘… āˆ’ 1
š‘…š‘…
Preventive fraction
š‘ƒš¹ = 1 āˆ’ š‘…š‘…
Etiologic fraction is the proportion of cases in which the exposure has played a causal role in disease
development.
šøš¹ =
š‘Ž āˆ’ š‘
š‘Ž
A similar parameters are computed for cohort Disease status = Absent.
Source: https://en.wikipedia.org/wiki/Relative_risk
Cohen's Kappa Test
Given a k-by-k square matrix, which collect the scores of two raters who each classify N items into k
mutually exclusive categories, the equation for Cohen's kappa coefficient is
š‘˜Ģ‚ =
š‘ š‘œ āˆ’ š‘ š‘’
1 āˆ’ š‘ š‘’
Where
š‘ š‘œ = āˆ‘
š‘›š‘–š‘–
š‘
= āˆ‘ š‘š‘–š‘–
š‘˜
š‘–=1
š‘˜
š‘–=1
š‘Žš‘›š‘‘ š‘š‘’ = āˆ‘ š‘š‘–.
š‘.š‘–
š‘˜
š‘–=1
š‘¤ā„Žš‘’š‘Ÿš‘’ š‘š‘–š‘— =
š‘›š‘–š‘—
š‘
š‘Žš‘›š‘‘ š‘š‘–. = āˆ‘
š‘›š‘–š‘—
š‘
š‘˜
š‘—=1
š‘Žš‘›š‘‘ š‘.š‘— = āˆ‘
š‘›š‘–š‘—
š‘
š‘˜
š‘–=1
The asymptotic variance is computed by
š‘£š‘Žš‘Ÿ(š‘˜Ģ‚) =
1
š‘(1 āˆ’ š‘š‘’)4
{ āˆ‘ š‘š‘–š‘–[(1 āˆ’ š‘š‘’) āˆ’ (š‘.š‘– + š‘š‘–.)(1 āˆ’ š‘ š‘œ)]2
š‘˜
š‘–=1
+ (1 āˆ’ š‘0)2
āˆ‘ āˆ‘ š‘š‘–š‘—(š‘.š‘– + š‘š‘—.)2
š‘˜
š‘—=1,š‘—ā‰ š‘–
āˆ’ (š‘ š‘œ š‘š‘’ āˆ’ 2š‘š‘’ + š‘ š‘œ)2
š‘˜
š‘–=1
}
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
10
The formulae is given by Fleiss, Cohen, and Everitt (1969), and modified by Fleiss (1981). The
asymptotic standard error is the root square of the value given above. This standard error and the normal
distribution N(0,1) must be used to compute confidence intervals.
š‘˜Ģ‚ Ā± š‘§āˆ/2āˆš š‘£š‘Žš‘Ÿ(š‘˜Ģ‚)
To compute an asymptotic test for the kappa coefficient, ISSTATS uses a standardized test statistic T
which has an asymptotic standard normal distribution under the null hypothesis that kappa equals zero
(H0: k = 0). The standardized test statistic is computed as
š‘‡ =
š‘˜Ģ‚
āˆš š‘£š‘Žš‘Ÿ0(š‘˜Ģ‚)
ā‰… š‘(0,1)
Where the variance of the kappa coefficient under the null hypothesis is
š‘£š‘Žš‘Ÿ0(š‘˜Ģ‚) =
1
š‘(1 āˆ’ š‘š‘’)2
{ š‘š‘’ + š‘š‘’
2
āˆ’ āˆ‘ š‘.š‘– š‘š‘–.(š‘.š‘–+ š‘š‘–.)
š‘˜
š‘–=1
}
Refer to Fleiss (1981)
Source: https://v8doc.sas.com/sashtml/stat/chap28/sect26.htm
Nominal by Nominal Measures of Association
Contingency Coefficient
Cramer's V is a measure of association between two nominal variables, giving a value between 0 and +1
(inclusive).
š¶ = āˆš
šœ’2
šœ’2 + š‘
Where
ļ‚· šœ’2
is the Pearson's cumulative test statistic.
ļ‚· N is the total sample size.
C asymptotically approaches a šœ’2
distribution with (r - 1)(c - 1) degrees of freedom.
Standardized Contingency Coefficient
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
11
If X and Y have the same number of categories (r = c), then the maximum value for the contingency
coefficient is calculated as:
š‘ š‘šš‘Žš‘„ = āˆš
š‘Ÿ āˆ’ 1
š‘Ÿ
If X and Y have a differing number of categories (r ā‰  c), then the maximum value for the contingency
coefficient is calculated as
š‘ š‘šš‘Žš‘„ = āˆš
(š‘Ÿ āˆ’ 1)(š‘ āˆ’ 1)
š‘Ÿ āˆ— š‘
4
The standardized contingency coefficient is calculated as the ratio:
š‘š‘†š‘”š‘Žš‘›š‘‘š‘Žš‘Ÿš‘‘š‘–š‘§š‘’š‘‘ =
š¶
š‘ š‘šš‘Žš‘„
which varies between 0 and 1 with 0 indicating independence and 1 dependence.
Phi coefficient
The phi coefficient is a measure of association for two nominal variables.
š›· = āˆš
šœ’2
š‘
Where
ļ‚· šœ’2
is the Pearson's cumulative test statistic.
ļ‚· N is the total sample size.
Phi asymptotically approaches a šœ’2
distribution with (r - 1)(c - 1) degrees of freedom.
Cramer's V
Cramer's V is a measure of association between two nominal variables, giving a value between 0 and +1
(inclusive).
š‘‰ = āˆš
šœ’2
š‘
ā„
š‘šš‘–š‘›{š‘Ÿ āˆ’ 1, š‘ āˆ’ 1}
Where
ļ‚· šœ’2
is the Pearson's cumulative test statistic.
ļ‚· N is the total sample size.
V asymptotically approaches a Ļ‡2
distribution with (r - 1)(c - 1) degrees of freedom.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
12
Tschuprow's T
Tschuprow's T is a measure of association between two nominal variables, giving a value between 0 and
1 (inclusive).
š‘‡ = āˆš
šœ’2
š‘
ā„
āˆš(š‘Ÿ āˆ’ 1)(š‘ āˆ’ 1)
Lambda
Asymmetric lambda, Ī»(C/R) or column variable dependent, is interpreted as the probable improvement in
predicting the column variable Y given knowledge of the row variable X. The range of asymmetric
lambda is {0, 1}. Asymmetric lambda (C/R) or column variable dependent is computed as
šœ†(š¶/š‘…) =
āˆ‘ š‘Ÿš‘–š‘– āˆ’ š‘Ÿ
š‘ āˆ’ š‘Ÿ
The asymptotic variance is
š‘£š‘Žš‘Ÿ( šœ†(š¶/š‘…)) =
š‘ āˆ’ āˆ‘ š‘Ÿš‘–š‘–
( š‘Ÿ āˆ’ š‘)3
{ āˆ‘ š‘Ÿš‘–
š‘–
+ š‘Ÿ āˆ’ 2 āˆ‘(š‘Ÿš‘–|š‘™š‘– = š‘™)
š‘–
}
Where
š‘Ÿš‘– = max
š‘—
{š‘›š‘–š‘—} š‘Žš‘›š‘‘ š‘Ÿ = max
š‘—
{š‘Ÿ.š‘—} š‘Žš‘›š‘‘ š‘š‘— = max
š‘–
{š‘›š‘–š‘—} š‘Žš‘›š‘‘ š‘ = max
š‘–
{š‘›š‘–.}
The values of li and l are determined as follows. Denote by li the unique value of j such that ri = nij, and
let l be the unique value of j such that r = n.j. Because of the uniqueness assumptions, ties in the
frequencies or in the marginal totals must be broken in an arbitrary but consistent manner. In case of ties,
l is defined as the smallest value of such that r = n.j.
For those columns containing a cell (i, j) for which nij = ri = cj, csj records the row in which cj is assumed
to occur. Initially is set equal to ā€“1 for all j. Beginning with i = 1, if there is at least one value j such that
nij = ri = cj, and if csj = -1, then li is defined to be the smallest such value of j, and csj is set equal to i.
Otherwise, if nil = ri, then li is defined to be equal to l. If neither condition is true, then li is taken to be the
smallest value of j such that nij = ri.
The asymptotic standard error is the root square of the asymptotic variance.
The formulas for lambda asymmetric Ī»(R/C) can be obtained by interchanging the indices.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
13
šœ†(š‘…/š¶) =
āˆ‘ š‘š‘—š‘— āˆ’ š‘
š‘ āˆ’ š‘
The Symmetric lambda is the average of the two asymmetric lambdas, Ī»(C/R) and Ī»(R/C). Its range is {-
1, 1}. Lambda symmetric is computed as
šœ† =
āˆ‘ š‘Ÿš‘–š‘– + āˆ‘ š‘š‘—š‘— āˆ’ š‘Ÿ āˆ’ š‘
2š‘ āˆ’ š‘Ÿ āˆ’ š‘
The asymptotic variance is
š‘£š‘Žš‘Ÿ( šœ†) =
1
š‘¤4
{ š‘¤š‘£š‘¦ āˆ’ 2š‘¤2
[š‘ āˆ’ āˆ‘ āˆ‘(š‘›š‘–š‘—|š‘— = š‘™š‘–, š‘– = š‘˜š‘—)
š‘—š‘–
] āˆ’ 2š‘£2
(š‘ āˆ’ š‘› š‘˜š‘™)}
Where
š‘¤ = 2š‘› āˆ’ š‘Ÿ āˆ’ š‘ š‘Žš‘›š‘‘ š‘£ = 2š‘› āˆ’ āˆ‘ š‘Ÿš‘–
š‘–
āˆ’ āˆ‘ š‘š‘—
š‘—
š‘Žš‘›š‘‘ š‘„
= āˆ‘(š‘Ÿš‘–
| š‘™š‘– = š‘™)
š‘–
+ āˆ‘(š‘š‘—
| š‘˜š‘— = š‘˜)
š‘—
+ š‘Ÿš‘˜ + š‘š‘™ š‘Žš‘›š‘‘ š‘¦ = 8š‘ āˆ’ š‘¤ āˆ’ š‘£ āˆ’ 2š‘„
The definitions of l and li are given in the previous section. The values k and kj are defined in a similar
way for lambda asymmetric (R/C).
Uncertainty Coefficient
The uncertainty coefficient U (C/R) -or column variable dependent U- measures the proportion of
uncertainty (entropy) in the column variable Y that is explained by the row variable X. Its range is {0, 1}.
The uncertainty coefficient is computed as
š‘ˆ(š¶/š‘…) = š‘ˆ š‘š‘œš‘™š‘¢š‘šš‘› š‘£š‘Žš‘Ÿš‘–š‘Žš‘š‘™š‘’ š‘‘š‘’š‘š‘’š‘›š‘‘š‘’š‘›š‘” =
š‘£
š‘¤
=
H(X) + H(Y) āˆ’ H(XY)
H(Y)
Where
š»(š‘‹) = āˆ’ āˆ‘
š‘›š‘–.
š‘›
ln (
š‘›š‘–.
š‘›
)
š‘–
š‘Žš‘›š‘‘ š»(š‘Œ) = āˆ’ āˆ‘
š‘›.š‘—
š‘›
ln (
š‘›.š‘—
š‘›
)
š‘–
š‘Žš‘›š‘‘ š»(š‘‹š‘Œ)
= āˆ’ āˆ‘ āˆ‘
š‘›š‘–š‘—
š‘›
ln (
š‘›š‘–š‘—
š‘›
)
š‘—š‘–
The asymptotic variance is
š‘£š‘Žš‘Ÿ(š‘ˆ(š¶/š‘…)) =
1
š‘›2 š‘¤4
āˆ‘ āˆ‘ š‘›š‘–š‘— {š»(š‘Œ) ln (
š‘›š‘–š‘—
š‘›š‘–.
) + (H(X) āˆ’ H(XY)) ln (
š‘›.š‘—
š‘›
)}
2
š‘—š‘–
The asymptotic standard error is the root square of the asymptotic variance.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
14
The formulas for the uncertainty coefficient U (C/R) can be obtained by interchanging the indices.
The symmetric uncertainty coefficient is computed as
š‘ˆ =
2 āˆ— [H(X) + H(Y) āˆ’ H(XY)]
H(X) + H(Y)
The asymptotic variance is
š‘£š‘Žš‘Ÿ(š‘ˆ) = 4 āˆ‘ āˆ‘
š‘›š‘–š‘— {š»(š‘‹š‘Œ) ln (
š‘›š‘–. š‘›.š‘—
š‘›2 ) āˆ’ (H(X) āˆ’ H(Y)) ln (
š‘›.š‘—
š‘› )}
2
š‘›2(H(X) + H(Y))4
š‘—š‘–
The asymptotic standard error is the root square of the asymptotic variance.
Ordinal by Ordinal Measures of Association
Let nij denote the observed frequency in cell (i, j) in a IxJ contingency table. Let be N the total frequency
and
š“š‘–š‘— = āˆ‘ āˆ‘ š‘› š‘˜š‘™
š‘™<š‘—š‘˜<š‘–
+ āˆ‘ āˆ‘ š‘› š‘˜š‘™
š‘™>š‘—š‘˜>š‘–
š·š‘–š‘— = āˆ‘ āˆ‘ š‘› š‘˜š‘™
š‘™<š‘—š‘˜>š‘–
+ āˆ‘ āˆ‘ š‘› š‘˜š‘™
š‘™>š‘—š‘˜<š‘–
š‘ƒ = āˆ‘ āˆ‘ š‘Žš‘–š‘— š“š‘–š‘—
š‘—š‘–
š‘Žš‘›š‘‘ š‘„ = āˆ‘ āˆ‘ š‘Žš‘–š‘— š·š‘–š‘—
š‘—š‘–
Gamma Coefficient
The gamma (G) statistic is based only on the number of concordant and discordant pairs of observations.
It ignores tied pairs (that is, pairs of observations that have equal values of X or equal values of Y).
Gamma is appropriate only when both variables lie on an ordinal scale. The range of gamma is {-1, 1}. If
the row and column variables are independent, then gamma tends to be close to zero.
Gamma is estimated by
šŗ =
š‘ƒ āˆ’ š‘„
š‘ƒ + š‘„
The asymptotic variance is
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
15
š‘£š‘Žš‘Ÿ(šŗ) =
16
( š‘ƒ + š‘„)2
{ āˆ‘ āˆ‘ š‘›š‘–š‘— āˆ— (š‘„š“š‘–š‘— āˆ’ š‘ƒš·š‘–š‘—)2
š½
š‘—=1
š¼
š‘–=1
}
The asymptotic standard error is the root square of the asymptotic variance.
The variance under the null hypothesis that gamma equals zero is computed as
š‘£š‘Žš‘Ÿ0(šŗ) =
4
( š‘ƒ + š‘„)2
{ āˆ‘ āˆ‘ š‘›š‘–š‘— āˆ— š‘‘š‘–š‘—
2
š½
š‘—=1
āˆ’
(š‘ƒ āˆ’ š‘„)2
š‘
š¼
š‘–=1
}
Where dij = Aij - Dij
The asymptotic standard error under the null hypothesis that d equals zero is the root square of the
variance.
Kendall's tau-b
Kendallā€™s tau-b is similar to gamma except that tau-b uses a correction for ties. Tau-b is appropriate only
when both variables lie on an ordinal scale. The range of tau-b is {-1, 1}. Kendallā€™s tau-b is estimated by
šœ š‘ =
š‘ƒ āˆ’ š‘„
š‘¤
Where
š‘¤š‘Ÿ = š‘›2
āˆ’ āˆ‘ š‘›š‘–.
2
š‘–
š‘Žš‘›š‘‘ š‘¤š‘ = š‘›2
āˆ’ āˆ‘ š‘›.š‘—
2
š‘–
š‘Žš‘›š‘‘ š‘¤ = āˆš š‘¤š‘Ÿ š‘¤š‘
The asymptotic variance is
š‘£š‘Žš‘Ÿ( šœ š‘) =
1
š‘¤4
{ āˆ‘ āˆ‘ š‘›š‘–š‘—(2š‘¤š‘‘š‘–š‘— + šœ š‘ š‘£š‘–š‘—)2
š½
š‘—=1
š¼
š‘–=1
āˆ’ š‘3
šœ š‘
2
( š‘¤ š‘Ÿ + š‘¤ š‘)2
}
where
š‘£š‘–š‘— = š‘¤ š‘ š‘›š‘–. + š‘¤ š‘Ÿ š‘›.š‘—
The asymptotic standard error is the root square of the asymptotic variance.
The variance under the null hypothesis that tau-b equals zero is computed as
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
16
š‘£š‘Žš‘Ÿ0( šœ š‘) =
4
š‘¤ š‘Ÿ š‘¤ š‘
{ āˆ‘ āˆ‘ š‘›š‘–š‘— āˆ— š‘‘š‘–š‘—
2
š½
š‘—=1
āˆ’
(š‘ƒ āˆ’ š‘„)2
š‘
š¼
š‘–=1
}
The asymptotic standard error under the null hypothesis that d equals zero is the root square of the
variance.
Stuart-Kendall's tau-c
Stuart-Kendallā€™s tau-c makes an adjustment for table size in addition to a correction for ties. Tau-c is
appropriate only when both variables lie on an ordinal scale. The range of tau-c is {-1, 1}. Stuart-
Kendallā€™s tau-c is estimated by
šœ š‘ =
š‘š(š‘ƒ āˆ’ š‘„)
š‘2(š‘š āˆ’ 1)
Where m =min {I, J}.
The asymptotic variance is
š‘£š‘Žš‘Ÿ( šœ š‘) =
4š‘š2
š‘4
(š‘š āˆ’ 1)2 { āˆ‘ āˆ‘ š‘›š‘–š‘— āˆ— š‘‘š‘–š‘—
2
š½
š‘—=1
āˆ’
(š‘ƒ āˆ’ š‘„)2
š‘
š¼
š‘–=1
}
The asymptotic standard error is the root square of the asymptotic variance.
The variance under the null hypothesis that tau-c equals zero is the same as the asymptotic variance.
Sommersā€™ D
Somersā€™ D(C/R) and Somersā€™ D(R/C) are asymmetric modifications of tau-b. C/R indicates that the row
variable X is regarded as the independent variable and the column variable Y is regarded as dependent.
Similarly, R/C indicates that the column variable Y is regarded as the independent variable and the row
variable X is regarded as dependent. Somersā€™ D differs from tau-b in that it uses a correction only for
pairs that are tied on the independent variable. Somersā€™ D is appropriate only when both variables lie on
an ordinal scale. The range of Somersā€™ D is {-1, 1}. Somersā€™ D is computed as
š·(š¶/š‘…) = š· š‘š‘œš‘™š‘¢š‘šš‘› š‘£š‘Žš‘Ÿš‘–š‘Žš‘š‘™š‘’ š‘‘š‘’š‘š‘’š‘›š‘‘š‘’š‘›š‘” =
š‘ƒ āˆ’ š‘„
š‘¤š‘Ÿ
The asymptotic variance is
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
17
š‘£š‘Žš‘Ÿ( š·(š¶/š‘…)) =
4
š‘¤ š‘Ÿ
4
{ āˆ‘ āˆ‘ š‘›š‘–š‘—[š‘¤š‘Ÿ š‘‘š‘–š‘— āˆ’ (š‘ƒ āˆ’ š‘„)(š‘ āˆ’ š‘›š‘–.)]2
š½
š‘—=1
š¼
š‘–=1
}
The asymptotic standard error is the root square of the asymptotic variance.
The variance under the null hypothesis that D(C/R) equals zero is computed as
š‘£š‘Žš‘Ÿ0( š·(š¶/š‘…)) =
4
š‘¤ š‘Ÿ
2
{ āˆ‘ āˆ‘ š‘›š‘–š‘— āˆ— š‘‘š‘–š‘—
2
š½
š‘—=1
āˆ’
(š‘ƒ āˆ’ š‘„)2
š‘
š¼
š‘–=1
}
The asymptotic standard error under the null hypothesis that D(C/R) equals zero is the root square of the
variance.
Formulas for Somersā€™ D(R/C) are obtained by interchanging the indices.
Symmetric version of Somersā€™ d is
š‘‘ =
š‘ƒ āˆ’ š‘„
š‘¤š‘Ÿ + š‘¤š‘
2
The standard error is
š“š‘†šø(š‘‘) =
2šœŽ šœš‘ š‘¤
š‘¤ š‘Ÿ + š‘¤ š‘
where ĻƒĻ„b is the asymptotic standard error of Kendallā€™s tau-b.
The variance under the null hypothesis that d equals zero is computed as
š‘£š‘Žš‘Ÿ0(š‘‘) =
16
( š‘¤ š‘Ÿ + š‘¤ š‘)2
{ āˆ‘ āˆ‘ š‘›š‘–š‘— āˆ— š‘‘š‘–š‘—
2
š½
š‘—=1
āˆ’
(š‘ƒ āˆ’ š‘„)2
š‘
š¼
š‘–=1
}
The asymptotic standard error under the null hypothesis that d equals zero is the root square of the
variance.
Confidence Bounds and One-Sided Tests
Suppose you are testing the null hypothesis H0: ļ­ ā‰„ ļ­0 against the one-sided alternative H1: ļ­ < ļ­0. Rather
than give a two-sided confidence interval for ļ­, the more appropriate procedure is to give an upper
confidence bound in this setting. This upper confidence bound has a direct relationship to the one-sided
test, namely:
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
18
1. A level ļ” test of H0: ļ­ ā‰„ ļ­0 against the one-sided alternative H1: ļ­ < ļ­0 rejects H0 exactly when
the value ļ­0 is above the 1ā€“Ī± upper confidence bound.
2. A level ļ” test of H0: ļ­ ā‰¤ ļ­0 against the one-sided alternative H1: ļ­ > ļ­0 rejects H0 exactly when
the value ļ­0 is above the 1ā€“Ī± lower confidence bound.
ANOVA Test
š‘†š‘†š‘‡š‘œš‘”š‘Žš‘™ = āˆ‘ āˆ‘(š‘¦š‘–š‘— āˆ’ š‘Œ..
Ģ…)2
š‘š‘–
š‘—=1
š‘˜
š‘–=1
š‘†š‘†š¼š‘›š‘”š‘’š‘Ÿ = āˆ‘ š‘›š‘–(š‘ŒĢ…š‘–. āˆ’ š‘Œ..
Ģ…)2
š‘˜
š‘–=1
š‘†š‘†š¼š‘›š‘”š‘Ÿš‘Ž = āˆ‘ āˆ‘(š‘¦š‘–š‘— āˆ’ š‘Œš‘–.
Ģ… )2
š‘› š‘–
š‘—=1
š‘˜
š‘–=1
= š‘†š‘†š‘‡š‘œš‘”š‘Žš‘™ āˆ’ š‘†š‘†š¼š‘›š‘”š‘’š‘Ÿ
DF Total = N ā€“ 1
DF Inter = k ā€“ 1
DF Intra = N ā€“ k
š‘€š‘†š‘‡š‘œš‘”š‘Žš‘™ =
SSTotal
DFTotal
š‘€š‘†š¼š‘›š‘”š‘’š‘Ÿ =
SSInter
DFInter
š‘€š‘†š¼š‘›š‘”š‘Ÿš‘Ž =
SSIntra
DFIntra
š¹ =
MSInter
MSIntra
where
ļ‚· F is the result of the test
ļ‚· k is the number of different groups to which the sampled cases belong
ļ‚· š‘ = āˆ‘ š‘›š‘–
š‘˜
š‘–=1 is the total sample size
ļ‚· ni is the number of cases in the i-th group
ļ‚· yij is the value of the measured variable for the j-th case from the i-th group
ļ‚· š‘ŒĢ….. is the mean of all yij
ļ‚· š‘ŒĢ…š‘–. is the mean of the yij for group i.
The test statistic has a F-distribution with DF Inter and DF Intra degrees of freedom. Thus the null
hypothesis is rejected if š¹ ā‰„ š¹(1 āˆ’ š›¼) š‘āˆ’š‘˜
š‘˜āˆ’1
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
19
ANOVA Multiple Comparisons
Difference of Means
š‘¦Ģ…š‘– āˆ’ š‘¦Ģ…š‘—
Standard Error of the Difference of Means Estimator
š‘†š‘”š‘‘. šøš‘Ÿš‘Ÿš‘œš‘Ÿ = āˆšš‘€š‘†š¼š‘›š‘”š‘Ÿš‘Ž āˆ— (
1
š‘›š‘–
+
1
š‘›š‘—
)
Scheffeā€™s Method
Confidence Interval for Difference of Means
š¶š¼ (1 āˆ’ š›¼) = š‘¦Ģ…š‘– āˆ’ š‘¦Ģ…š‘— Ā± āˆšš·š¹š¼š‘›š‘”š‘’š‘Ÿ āˆ— š‘€š‘†š¼š‘›š‘”š‘Ÿš‘Ž āˆ— š¹(1 āˆ’ š›¼) š·š¹ š¼š‘›š‘”š‘Ÿš‘Ž
š·š¹ š¼š‘›š‘”š‘’š‘Ÿ
āˆ— (
1
š‘›š‘–
+
1
š‘›š‘—
)
Source: http://en.wikipedia.org/wiki/Scheff%C3%A9%27s_method
Tukey's range test HSD
Confidence Interval for Difference of Means
š¶š¼ (1 āˆ’ š›¼) = š‘¦Ģ…š‘– āˆ’ š‘¦Ģ…š‘— Ā± š‘ž(1 āˆ’ š›¼) š·š¹ š¼š‘›š‘”š‘Ÿš‘Ž
š‘˜
āˆš
š‘€š‘†š¼š‘›š‘”š‘Ÿš‘Ž
2
āˆ— (
1
š‘›š‘–
+
1
š‘›š‘—
)
Where q is the studentized range distribution.
Source: https://en.wikipedia.org/wiki/Tukey%27s_range_test
Fisher's Method LSD
If overall ANOVA test is not significant, you must not consider any results of Fisher test, significant or
not.
Confidence Interval for Difference of Means
š¶š¼ (1 āˆ’ š›¼) = š‘¦Ģ…š‘– āˆ’ š‘¦Ģ…š‘— Ā± š‘”(1 āˆ’ š›¼
2ā„ )
š·š¹ š¼š‘›š‘”š‘Ÿš‘Ž
āˆšš‘€š‘†š¼š‘›š‘”š‘Ÿš‘Ž āˆ— (
1
š‘›š‘–
+
1
š‘›š‘—
)
Where t is the student distribution.
Bonferroni's Method
The family-wise significance level (FWER) is Ī± = 1 - Confidence Level. Thus any comparison flagged by
ISSTATS as significant is based on a Bonferroni Correction:
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
20
š›¼ā€² =
2š›¼
š‘˜(š‘˜ āˆ’ 1)
š‘ā€² = š‘
š‘˜(š‘˜ āˆ’ 1)
2
Where k is the number of groups.
Confidence Interval for Difference of Means
š¶š¼ (1 āˆ’ š›¼) = š‘¦Ģ…š‘– āˆ’ š‘¦Ģ…š‘— Ā± š‘” (1 āˆ’ š›¼ā€²
2ā„ )
š·š¹ š¼š‘›š‘”š‘Ÿš‘Ž
āˆšš‘€š‘†š¼š‘›š‘”š‘Ÿš‘Ž āˆ— (
1
š‘›š‘–
+
1
š‘›š‘—
)
Where t is the student distribution.
Sidak's Method
The family-wise significance level (FWER) is Ī± = 1 - Confidence Level. So any comparison flagged by
ISSTATS as significant is based on a Sidak Correction:
š›¼ā€² = (1 āˆ’ š›¼)
2
š‘˜(š‘˜āˆ’1)
š‘ā€²
= 1 āˆ’ š‘’
log(1āˆ’š‘)š‘˜(š‘˜āˆ’1)
2
Where k is the number of groups.
Confidence Interval for Difference of Means
š¶š¼ (1 āˆ’ š›¼) = š‘¦Ģ…š‘– āˆ’ š‘¦Ģ…š‘— Ā± š‘” (1 āˆ’ š›¼ā€²
2ā„ )
š·š¹ š¼š‘›š‘”š‘Ÿš‘Ž
āˆšš‘€š‘†š¼š‘›š‘”š‘Ÿš‘Ž āˆ— (
1
š‘›š‘–
+
1
š‘›š‘—
)
Where t is the student distribution.
Welchā€™s Test for equality of means
The test statistic, F*
, is defined as follows:
š¹āˆ—
=
āˆ‘ š‘¤š‘–(š‘„Ģ…š‘– āˆ’ š‘‹Ģƒ)2š‘˜
š‘–=1
š‘˜ āˆ’ 1
1 +
2(š‘˜ āˆ’ 2)
š‘˜2 āˆ’ 1
āˆ— āˆ‘ ā„Žš‘–
š‘˜
š‘–=1
where
ļ‚· F*
is the result of the test
ļ‚· k is the number of different groups to which the sampled cases belong
ļ‚· ni is the number of cases in the i-th group
ļ‚· š‘¤š‘– =
š‘› š‘–
š‘†š‘–
2
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
21
ļ‚· š‘Š = āˆ‘ š‘¤š‘– = āˆ‘
š‘› š‘–
š‘†š‘–
2
š‘˜
š‘–=1
š‘˜
š‘–=1
ļ‚· š‘‹Ģƒ =
āˆ‘ š‘¤ š‘– š‘„Ģ… š‘–
š‘˜
š‘–=1
š‘Š
ļ‚· ā„Žš‘– =
(1āˆ’
š‘¤ š‘–
š‘Š
)
2
š‘› š‘–āˆ’1
The test statistic has approximately a F-distribution with k-1 and š‘‘š‘“ =
š‘˜2āˆ’1
3āˆ—āˆ‘ ā„Ž š‘–
š‘˜
š‘–=1
degrees of freedom. Thus
the null hypothesis is rejected if š¹āˆ—
ā‰„ š¹(1 āˆ’ š›¼) š‘‘š‘“
š‘˜āˆ’1
Brownā€“Forsythe Test for equality of means
The test statistic, F*
, is defined as follows:
š¹āˆ—
=
āˆ‘ š‘›š‘–(š‘„Ģ…š‘– āˆ’ š‘‹Ģ…..)2š‘˜
š‘–=1
āˆ‘ (1 āˆ’
š‘›š‘–
š‘) š‘†š‘–
2š‘˜
š‘–=1
where
ļ‚· F*
is the result of the test
ļ‚· k is the number of different groups to which the sampled cases belong
ļ‚· ni is the number of cases in the i-th group (sample size of group i)
ļ‚· š‘ = āˆ‘ š‘›š‘–
š‘˜
š‘–=1 is the total sample size
ļ‚· š‘‹Ģ….. =
āˆ‘ š‘› š‘– š‘„Ģ… š‘–
š‘˜
š‘–=1
š‘
is the overall mean.
The test statistic has approximately a F-distribution with k-1 and df degrees of freedom. Where df is
obtained with the Satterthwaite (1941) approximation as
1
df
= āˆ‘
ci
2
ni āˆ’ 1
k
i=1
with
š‘š‘— =
(1 āˆ’
š‘›š‘—
š‘) š‘†š‘—
2
āˆ‘ (1 āˆ’
š‘›š‘–
š‘) š‘†š‘–
2š‘˜
š‘–=1
Thus the null hypothesis is rejected if š¹āˆ—
ā‰„ š¹(1 āˆ’ š›¼) š‘‘š‘“
š‘˜āˆ’1
Homoscedasticity Tests
Levene's Test
The test statistic, F, is defined as follows:
š¹ =
š‘ āˆ’ š‘˜
š‘˜ āˆ’ 1
āˆ—
āˆ‘ š‘›š‘–(š‘Ģ…š‘–. āˆ’ š‘Ģ…..)2š‘˜
š‘–=1
āˆ‘ āˆ‘ (š‘š‘–š‘— āˆ’ š‘Ģ…š‘–.)2š‘› š‘–
š‘—=1
š‘˜
š‘–=1
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
22
where
ļ‚· F is the result of the test
ļ‚· k is the number of different groups to which the sampled cases belong
ļ‚· š‘ = āˆ‘ š‘›š‘–
š‘˜
š‘–=1 is the total sample size
ļ‚· ni is the number of cases in the i-th group
ļ‚· Yij is the value of the measured variable for the j-th case from the i-th group
ļ‚· š‘š‘–š‘— = |š‘Œš‘–š‘— āˆ’ š‘ŒĢ…š‘–.| where š‘ŒĢ…š‘–. is a mean of i-th group
ļ‚· š‘Ģ….. is the mean of all Zij
ļ‚· š‘Ģ…š‘–. is the mean of the Zij for group i.
The test statistic has a F-distribution with k-1 and N-k degrees of freedom. Thus the null hypothesis is
rejected if š¹ ā‰„ š¹(1 āˆ’ š›¼) š‘āˆ’š‘˜
š‘˜āˆ’1
Source: http://en.wikipedia.org/wiki/Levene%27s_test
Brownā€“Forsythe Test for equality of variances
The test statistic, F, is defined as follows:
š¹ =
š‘ āˆ’ š‘˜
š‘˜ āˆ’ 1
āˆ—
āˆ‘ š‘›š‘–(š‘Ģ…š‘–. āˆ’ š‘Ģ…..)2š‘˜
š‘–=1
āˆ‘ āˆ‘ (š‘š‘–š‘— āˆ’ š‘Ģ…š‘–.)2š‘›š‘–
š‘—=1
š‘˜
š‘–=1
where
ļ‚· F is the result of the test
ļ‚· k is the number of different groups to which the sampled cases belong
ļ‚· š‘ = āˆ‘ š‘›š‘–
š‘˜
š‘–=1 is the total sample size
ļ‚· ni is the number of cases in the i-th group
ļ‚· Yij is the value of the measured variable for the j-th case from the i-th group
ļ‚· š‘š‘–š‘— = |š‘Œš‘–š‘— āˆ’ š‘ŒĢƒš‘–.| where š‘ŒĢƒš‘–. is a median of i-th group
ļ‚· š‘Ģ….. is the mean of all Zij
ļ‚· š‘Ģ…š‘–. is the mean of the Zij for group i.
The test statistic has a F-distribution with k-1 and N-k degrees of freedom. Thus the null hypothesis is
rejected if š¹ ā‰„ š¹(1 āˆ’ š›¼) š‘āˆ’š‘˜
š‘˜āˆ’1
Source: http://en.wikipedia.org/wiki/Levene%27s_test
Bartlett's Test
Bartlett's test is used to test the null hypothesis, H0 that all k population variances are equal against the
alternative that at least two are different.
If there are k samples with size ni and sample variances S2
i then Bartlett's test statistic is
šœ’2
=
(š‘ āˆ’ š‘˜)š‘™š‘›(š‘† š‘
2
) āˆ’ āˆ‘ (š‘›š‘– āˆ’ 1)š‘™š‘›(š‘†š‘–
2
)š‘˜
š‘–=1
1 +
1
3(š‘˜ āˆ’ 1)
āˆ— (āˆ‘ (
1
š‘›š‘– āˆ’ 1)š‘˜
š‘–=1 āˆ’
1
š‘ āˆ’ š‘˜
)
where
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
23
ļ‚· š‘ = āˆ‘ š‘›š‘–
š‘˜
š‘–=1 is the total sample size
ļ‚· š‘† š‘
2
=
āˆ‘ (š‘› š‘–āˆ’1)š‘†š‘–
2š‘˜
š‘–=1
š‘āˆ’š‘˜
is the pooled estimate for the variance
The test statistic has approximately a chi-squared distribution with k-1 degrees of freedom. Thus the null
hypothesis is rejected if šœ’2
ā‰„ šœ’ š‘˜āˆ’1
2
(1 āˆ’ š›¼).
Source: http://en.wikipedia.org/wiki/Bartlett%27s_test
Bivariate Correlation Tests
Sample Covariance
Sxy =
āˆ‘ (xi āˆ’ xĢ…)(yi āˆ’ yĢ…)N
i=1
N āˆ’ 1
Where š‘ = āˆ‘ š‘›š‘–
š‘˜
š‘–=1 is the total sample size.
Source: http://en.wikipedia.org/wiki/Covariance#Calculating_the_sample_covariance
Sample Pearson Product-Moment Correlation Coefficient
r =
1
N āˆ’ 1
āˆ—
āˆ‘ (š‘„š‘– āˆ’ š‘„Ģ…)(š‘¦š‘– āˆ’ š‘¦Ģ…)š‘
š‘–=1
š‘† š‘„ š‘† š‘¦
=
š‘† š‘„š‘¦
š‘† š‘„ š‘† š‘¦
where Sx and Sy are the sample standard deviation of the paired sample (xi, yi), Sxy is the sample
covariance and N is the total sample size.
Source: http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#For_a_sample
Test for the Significance of the Pearson Product-Moment Correlation Coefficient
Test hypothesis are:
ļ‚· H0: the sample values come from a population in which Ļ=0
ļ‚· H1: the sample values come from a population in which Ļā‰ 0
Test statistic is
t =
r āˆ— āˆšN āˆ’ 2
āˆš1 āˆ’ r2
where
ļ‚· š‘ = āˆ‘ š‘›š‘–
š‘˜
š‘–=1 is the total sample size
ļ‚· r is the Sample Pearson Product-Moment Correlation Coefficient
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
24
The test statistic has a t-student distribution with N-2 degrees of freedom.
Spearman Correlation Coefficient
For each of the variables X and Y separately, the observations are sorted into ascending order and
replaced by their ranks. Identical values (rank ties or value duplicates) are assigned a rank equal to the
average of their positions in the ascending order of the values. Each time t observations are tied (t>1), the
quantity t3
āˆ’t is calculated and summed separately for each variable. These sums will be designated STx
and STy.
For each of the N observations, the difference between the rank of X and rank of Y is computed as:
di = Rank(Xi) āˆ’ Rank(Yi)
If there are no ties in both samples, Spearmanā€™s rho (Ļ) is calculated as
Ļ = 1 āˆ’
6 āˆ‘ š‘‘š‘–
N(š‘2 āˆ’ 1)
If there are any ties in any of the samples, Spearmanā€™s rho (Ļ) is calculated as (Siegel, 1956):
Ļ =
š‘‡š‘„ + š‘‡š‘¦ āˆ’ āˆ‘ di
2āˆš š‘‡š‘„ āˆ— š‘‡š‘¦
where
š‘‡š‘„ =
N(š‘2
āˆ’ 1) āˆ’ š‘†š‘‡š‘„
12
š‘‡š‘¦ =
N(š‘2
āˆ’ 1) āˆ’ š‘†š‘‡š‘¦
12
If Tx or Ty is 0, the statistic is not computed.
Source:
http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2F
alg_nonpar_corr_spearman.htm
Test for the Significance of the Spearmanā€™s Correlation Coefficient
Test hypothesis are:
ļ‚· H0: the sample values come from a population in which Ļ=0
ļ‚· H1: the sample values come from a population in which Ļā‰ 0
Test statistic is
t =
Ļ āˆ— āˆšN āˆ’ 2
āˆš1 āˆ’ Ļ2
The test statistic has a t-student distribution with N-2 degrees of freedom.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
25
Kendall's Tau-b Correlation Coefficient
For each of the variables X and Y separately, the observations are sorted into ascending order and
replaced by their ranks. In situations where t observations are tied, the average rank is assigned.
Each time t > 1, the following quantities are computed and summed over all groups of ties for each
variable separately.
T1 = āˆ‘ š‘”2
āˆ’ š‘”
T2 = āˆ‘(š‘”2
āˆ’ š‘”)(š‘” āˆ’ 2)
T3 = āˆ‘(š‘”2
āˆ’ š‘”)(2š‘” + 5)
Each of the N cases is compared to the others to determine with how many cases its ranking of X and Y is
concordant or discordant. The following procedure is used. For each distinct pair of cases (i, j), where i <
j the quantity
dij=[Rank(Xj)āˆ’Rank(Xi)][Rank(Yj)āˆ’Rank(Yi)]
is computed. If the sign of this product is positive, the pair of observations (i, j) is concordant. If the sign
is negative, the pair is discordant. The number of concordant pairs minus the number of discordant pairs
is
S = āˆ‘ āˆ‘ š‘ š‘–š‘”š‘›(š‘‘š‘–š‘—)
š‘
š‘—=š‘–+1
š‘āˆ’1
š‘–=1
where sign(dij) is defined as +1 or ā€“1 depending on the sign of dij. Pairs in which dij=0 are ignored in the
computation of S.
If there are no ties in both samples, Kendallā€™s tau (Ļ„) is computed as
Ļ„ =
2S
N2 āˆ’ N
If there are any ties in any of the samples, Kendallā€™s tau (Ļ„) is computed as
Ļ„ =
2S
āˆšN2 āˆ’ N āˆ’ š‘‡1 š‘„āˆšN2 āˆ’ N āˆ’ š‘‡1 š‘¦
If the denominator is 0, the statistic is not computed.
Source: http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient#Tau-b
Test for the Significance of the Kendall's Tau-b Correlation Coefficient
The variance of S is estimated by (Kendall, 1955):
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
26
Var =
(N2
āˆ’ N)(2N + 5) āˆ’ T3x āˆ’ T3y
18
+
T2x āˆ— T2y
9(N2 āˆ’ N)(N āˆ’ 2)
+
T1x āˆ— T1y
2(N2 āˆ’ N)
The significance level is obtained using
Z =
S
āˆšVar
Which, under the null hypothesis, is approximately distributed as a standard normal when the variables
are statistically independent.
Sources: http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient#Significance_tests
http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2F
alg_nonpar_corr_kendalls.htm
Parametric Value at Risk
Value at Risk of a single asset
Given the time series of daily return rates for an asset, the daily mean of the return rates is Ī¼, the daily
variance of the daily return rates is Ļƒ2
. Given the position, hold or investment in the asset P.
One-day Expected Return is:
ER = PĪ¼
The Standard Deviation or Volatility is the square root of the Variance:
šœŽ = āˆš šœŽ2
One-day Value at Risk is:
š‘‰š‘Žš‘…1āˆ’š›¼ = āˆ’(Ī¼ + š‘§ š›¼ šœŽ)P
where zĪ± is the left-tail Ī± quantile of the normal standard distribution.
Total Value at Risk for n trading days is:
š‘‰š‘Žš‘…1āˆ’š›¼
š‘› š‘‘š‘Žš‘¦š‘ 
= š‘‰š‘Žš‘…1āˆ’š›¼ āˆ— āˆš š‘› = āˆ’(Ī¼ + š‘§ š›¼ šœŽ)Pāˆš š‘›
Portfolio Value at Risk
Given the time series of daily return rates on different assets, the daily mean of the return rates for the i-th
asset is Ī¼i, the daily variance of the return rate for the i-th asset is Ļƒi
2
, the daily standard deviation (or
volatility) of the return rates for the i-th asset is Ļƒi. The covariance of the daily return rates of i-th and j-th
assets is Ļƒij. All parameters are unbiased estimates. Given the holds, positions or investments on each of
these assets: Pi
Total positions is
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
27
P = āˆ‘ š‘ƒš‘–
š‘
š‘–=1
The weighting of each position is
š‘¤š‘– =
š‘ƒš‘–
š‘ƒ
The weighted mean of the portfolio is
Ī¼ š‘ƒ = āˆ‘ š‘¤š‘– šœ‡š‘– =
š‘
š‘–=1
1
š‘ƒ
āˆ‘ š‘ƒš‘– šœ‡š‘–
š‘
š‘–=1
One-day Expected Return of the portfolio is the weighted mean of the portfolio multiplied by the total
position
ER = PĪ¼ š‘ƒ = P āˆ‘ š‘¤š‘– šœ‡š‘– =
š‘
š‘–=1
āˆ‘ š‘ƒš‘– šœ‡š‘–
š‘
š‘–=1
The Portfolio Variance is
šœŽ š‘ƒ
2
= [š‘¤1 ā€¦ š‘¤š‘– ā€¦ š‘¤ š‘›] [
šœŽ1
2
ā‹Æ šœŽ1š‘›
ā‹® ā‹± ā‹®
šœŽ š‘›1 ā‹Æ šœŽ š‘›
2
]
[
š‘¤1
ā‹®
š‘¤š‘–
ā‹®
š‘¤ š‘›]
= š‘Š š‘‡
š‘€š‘Š
where W is the vector of weights and M is the covariance matrix. The item i-th in the diagonal of M is the
daily variance of the return rates for the i-th asset. The items outside the diagonal are covariances.
Portfolio Variance also can be computed as:
šœŽ š‘ƒ
2
=
1
š‘ƒ2
āˆ— [š‘ƒ1 ā€¦ š‘ƒš‘– ā€¦ š‘ƒš‘›] [
šœŽ1
2
ā‹Æ šœŽ1š‘›
ā‹® ā‹± ā‹®
šœŽ š‘›1 ā‹Æ šœŽ š‘›
2
]
[
š‘ƒ1
ā‹®
š‘ƒš‘–
ā‹®
š‘ƒš‘›]
=
1
š‘ƒ2
āˆ— š‘‹ š‘‡
š‘€š‘‹
where X is the vector of positions.
The Portfolio Standard Deviation or Portfolio Volatility is the square root of the Portfolio Variance:
šœŽ š‘ƒ = āˆššœŽ š‘ƒ
2
One-day Value at Risk is:
š‘‰š‘Žš‘…1āˆ’š›¼ = āˆ’(Ī¼ š‘ƒ + š‘§ š›¼ šœŽ š‘ƒ)P
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
28
Where zĪ± is the left-tail Ī± quantile of the normal standard distribution.
Total Value at Risk for n trading days is:
š‘‰š‘Žš‘…1āˆ’š›¼
š‘› š‘‘š‘Žš‘¦š‘ 
= š‘‰š‘Žš‘…1āˆ’š›¼ āˆ— āˆš š‘› = āˆ’(Ī¼ š‘ƒ + š‘§ š›¼ šœŽ š‘ƒ)Pāˆš š‘›
š‘‰š‘Žš‘…1āˆ’š›¼
š‘› š‘‘š‘Žš‘¦š‘ 
is the minimum potential loss that a portfolio can suffer in the Ī±% worst cases in n days.
About the Signs: A positive value of VaR is an expected loss. A negative VaR would imply the portfolio
has a high probability of making a profit.
Source: http://www.jpmorgan.com/tss/General/Risk_Management/1159360877242
Remark: Some texts about VaR express the covariance as Ļƒij = ĻƒiĻƒjĻij where Ļij is the correlation
coefficient.
Remark: Sometimes VaR is assumed to be the Portfolio Volatility multiplied by the position as expected
return is supposed to be approximately zero. ISSTATS does NOT consider VaR as Portfolio Volatility
and do NOT suppose expected return is zero.
Marginal Value at Risk
Marginal Value at Risk is the change in portfolio VaR resulting from a marginal change in the currency
(dollar, euroā€¦) position in component i:
š‘€š‘‰š‘Žš‘…š‘– =
šœ•š‘‰š‘Žš‘…
šœ•š‘ƒš‘–
Assuming the linearity of the risk in the parametric approach, the vector of Marginal Value at Risk is
[
š‘€š‘‰š‘Žš‘…1
ā‹®
š‘€š‘‰š‘Žš‘…š‘–
ā‹®
š‘€š‘‰š‘Žš‘… š‘›]
= āˆ’
([
šœ‡1
ā‹®
šœ‡š‘–
ā‹®
šœ‡ š‘›]
+
š‘§ š›¼
šœŽ š‘ƒ
āˆ— [
šœŽ1
2
ā‹Æ šœŽ1š‘›
ā‹® ā‹± ā‹®
šœŽ š‘›1 ā‹Æ šœŽ š‘›
2
]
[
š‘¤1
ā‹®
š‘¤š‘–
ā‹®
š‘¤ š‘›])
[
š‘€š‘‰š‘Žš‘…1
ā‹®
š‘€š‘‰š‘Žš‘…š‘–
ā‹®
š‘€š‘‰š‘Žš‘… š‘›]
= āˆ’
([
šœ‡1
ā‹®
šœ‡š‘–
ā‹®
šœ‡ š‘›]
+
š‘§ š›¼
š‘ƒ āˆ— šœŽ š‘ƒ
āˆ— [
šœŽ1
2
ā‹Æ šœŽ1š‘›
ā‹® ā‹± ā‹®
šœŽ š‘›1 ā‹Æ šœŽ š‘›
2
]
[
š‘ƒ1
ā‹®
š‘ƒš‘–
ā‹®
š‘ƒš‘›])
Total Marginal Value at Risk for n trading days is:
š‘€š‘‰š‘Žš‘…š‘–
š‘› š‘‘š‘Žš‘¦š‘ 
= š‘€š‘‰š‘Žš‘…š‘– āˆ— āˆš š‘›
Component Value at Risk
Component Value at Risk is a partition of the portfolio VaR that indicates the change of VaR if a given
component was deleted.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
29
š¶š‘‰š‘Žš‘…š‘– =
šœ•š‘‰š‘Žš‘…
šœ•š‘ƒš‘–
š‘ƒš‘– = š‘€š‘‰š‘Žš‘…š‘– āˆ— š‘ƒš‘–
Note that the sum of all component VaRs (CVaR) is the VaR for the entire portfolio:
š‘‰š‘Žš‘… = āˆ‘ š¶š‘‰š‘Žš‘…š‘–
š‘
š‘–=1
= āˆ‘
šœ•š‘‰š‘Žš‘…
šœ•š‘ƒš‘–
š‘
š‘–=1
š‘ƒš‘– = āˆ‘ š‘€š‘‰š‘Žš‘…š‘–
š‘
š‘–=1
āˆ— š‘ƒš‘–
Total Component Value at Risk for n trading days is:
š¶š‘‰š‘Žš‘…š‘–
š‘› š‘‘š‘Žš‘¦š‘ 
= š¶š‘‰š‘Žš‘…š‘– āˆ— āˆš š‘›
Source: http://www.math.nus.edu.sg/~urops/Projects/valueatrisk.pdf
Incremental Value at Risk
Incremental VaR of a given position is the VaR of the portfolio with the given position minus the VaR of
the portfolio without the given position, which measures the change in VaR due to a new position on the
portfolio:
IVaR (a) = VaR (P) ā€“ VaR (P - a)
Source:
http://www.jpmorgan.com/tss/General/Portfolio_Management_With_Incremental_VaR/1259104336084
Conditional Value at Risk, Expected Shortfall, Expected Tail Loss or Average Value at Risk
šøš‘†1āˆ’š›¼
1 š‘‘š‘Žš‘¦
is the expected value of the loss of the portfolio in the Ī±% worst cases in one day.
Under Multivariate Normal Assumption, Expected Shortfall, also known as Expected Tail Loss (ETL),
Conditional Value-at-Risk (CVaR), Average Value at Risk (AVaR) and Worst Conditional Expectation,
is computed by
ES(āˆ’VaR) = āˆ’šø(š‘„|š‘„ < āˆ’š‘‰š‘Žš‘…) āˆ— š‘ƒ = āˆ’[šœ‡ + šøš‘†(š‘§ š›¼)šœŽ] āˆ— š‘ƒ = āˆ’[šœ‡ + šø(š‘§|š‘§ < š‘§ š›¼)šœŽ] āˆ— š‘ƒ
= āˆ’ [šœ‡ +
āˆ« š‘”š‘’āˆ’
š‘”2
2
š‘§ š›¼
āˆ’āˆž
š‘‘š‘”
š›¼
šœŽ] āˆ— š‘ƒ = āˆ’(šœ‡ āˆ’
š‘’āˆ’
š‘§ š›¼
2
2
š›¼āˆš2šœ‹
šœŽ) āˆ— š‘ƒ
where zĪ± is the left-tail Ī± quantile of the normal standard distribution.
About the Sign: Because VaR is given by ISSTATS with a negative sign, as J.P. Mogan recommend, we
take its original value to perform calculations (-VaR = Ī¼ + zĪ±Ļƒ). Once the ES is computed, it is given
multiplied by a negative sign. That is mean; a positive value of ES is an expected loss. On the other hand,
a negative value of ES would imply the portfolio has a high probability of making a profit even in the
worst cases.
Source: http://www.imes.boj.or.jp/english/publication/mes/2002/me20-1-3.pdf
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
30
Exponentially Weighted Moving Average (EWMA) Forecast
Given a series of k daily return rates {r1, ā€¦ā€¦.., rk} computed as Continuously Compounded Return:
š‘Ÿš‘– = ln (
š‘ š‘–
š‘ š‘–āˆ’1
)
Where r1 corresponds to the earliest date in the series, and rk corresponds to the latest or most recent date.
Supposed k > 50, and assuming that the sample mean of daily returns is zero, the EWMA estimates the
one-day variance for a given sequence of k returns as:
šœŽ2
= (1 āˆ’ šœ†) āˆ‘ šœ†š‘–
š‘Ÿš‘˜āˆ’š‘–
2
š‘˜āˆ’1
š‘–=0
where 0 < Ī»< 1 is the decay factor.
The one-day volatility is:
šœŽ = āˆš šœŽ2
For horizons greater than one-day, the T-period (i.e., over T days) forecasts of the volatility is:
šœŽ š‘‡ š‘‘š‘Žš‘¦š‘  = šœŽāˆšš‘‡
For two return series, assuming that both averages are zero, the EWMA estimate of one-day covariance
for a given sequence of k returns is given by
š‘š‘œš‘£1,2 = šœŽ1,2 = (1 āˆ’ šœ†) āˆ‘ šœ†š‘–
š‘Ÿ1,š‘˜āˆ’š‘– š‘Ÿ2,š‘˜āˆ’š‘–
š‘˜āˆ’1
š‘–=0
The corresponding one-day correlation forecast for the two returns is given by
šœŒ1,2 =
š‘š‘œš‘£1,2
šœŽ1 šœŽ2
=
šœŽ1,2
šœŽ1 šœŽ2
For horizons greater than one-day, the T-period (i.e., over T days) forecasts of the covariance is:
š‘š‘œš‘£1,2
š‘‡ š‘‘š‘Žš‘¦š‘ 
= šœŽ1,2 š‘‡
Source: http://pascal.iseg.utl.pt/~aafonso/eif/rm/TD4ePt_2.pdf
Value at Risk of a single asset, Portfolio Value at Risk, Marginal Value at Risk, Component
Value at Risk, Incremental Value at Risk, Incremental Value at Risk by EWMA method.
See methods and formulas at Parametric Value at Risk.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
31
Linear Regression
Given n equations for a regression model, with p predictor variables. The i-th given equation is
yi = Ī²0 + Ī²1xi1 + Ī²2xi2 + ā€¦+ Ī²pxip
The n equations stacked together and written in vector form is
[
š‘¦1
ā‹®
š‘¦š‘–
ā‹®
š‘¦š‘›]
= [
1 ā‹Æ š‘„1š‘
ā‹® ā‹± ā‹®
1 ā‹Æ š‘„ š‘›š‘
]
[
Ī²0
ā‹®
Ī² š‘–
ā‹®
Ī² š‘]
+
[
Ō‘0
ā‹®
Ō‘š‘–
ā‹®
Ō‘ š‘›]
In matrix notation:
Y = XĪ² + Ō‘
X is here named the design matrix, of dimensions n-by-(p+1).
If constant is not included, the matrix are
[
š‘¦1
ā‹®
š‘¦š‘–
ā‹®
š‘¦š‘›]
= [
š‘„11 ā‹Æ š‘„1š‘
ā‹® ā‹± ā‹®
š‘„ š‘›1 ā‹Æ š‘„ š‘›š‘
]
[
Ī²1
ā‹®
Ī² š‘–
ā‹®
Ī² š‘]
+
[
Ō‘1
ā‹®
Ō‘š‘–
ā‹®
Ō‘ š‘›]
If constant is not included, X, the design matrix, has now dimensions n-by-p.
The estimated value of the unknown parameter Ī²:
š›½Ģ‚ = (š‘‹š‘‹ š‘‡
)āˆ’1
š‘‹ š‘‡
š‘Œ
Estimation can be carried out if, and only if, there is no perfect multicollinearity between the predictor
variables.
If constant is not included, the parameters can also be estimated by
š›½Ģ‚š‘— =
āˆ‘ š‘„š‘–š‘— š‘¦š‘–
š‘›
š‘–=1
āˆ‘ š‘„š‘–š‘—
2š‘›
š‘–=1
The standardized coefficients are
š›½Ģ‚š‘–
š‘ š‘”
=
š›½Ģ‚š‘– āˆ— š‘† š‘„š‘–
š‘† š‘¦
Where
ļ‚· Sxi is the unbiased standard deviation of the i-th predictor variable
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
32
ļ‚· Sy is the unbiased standard deviation of the response variable y
The estimate of the standard error of each coefficient is obtained by
š‘ š‘’(š›½Ģ‚š‘–) = āˆšš‘€š‘†šø āˆ— (š‘‹š‘‹ š‘‡)š‘–š‘–
āˆ’1
Where MSE is the mean squared error of the regression model.
It is known that
š›½Ģ‚š‘–
š‘ š‘’(š›½Ģ‚š‘–)
ā† š‘” š‘›āˆ’š‘āˆ’1
Where
ļ‚· p is the number of predictor variables
ļ‚· n is the total number of observations (number of rows in the design matrix)
If constant is not included, the degrees of freedom for the t statistics are n-p.
ANOVA for linear regression
If the constant is included.
Component Sum of
squares
Degrees of
freedom
Mean of squares F
Model SSM p MSM = SSM/p MSM/MSE
Error SSE n-p-1 MSE = SSE/(n-p-1)
Total SST n-1 MST = SST/(n-1)
Being
š‘†š‘†š‘€ = āˆ‘(š‘¦š‘–Ģ‚ āˆ’ š‘¦Ģ…)2
š‘›
š‘–=1
š‘†š‘†šø = āˆ‘(š‘¦š‘– āˆ’ š‘¦š‘–Ģ‚)2
š‘›
š‘–=1
š‘†š‘†š‘‡ = āˆ‘(š‘¦š‘– āˆ’ š‘¦Ģ…)2
š‘›
š‘–=1
Where
ļ‚· p is the number of predictor variables
ļ‚· n is the total number of observations (number of rows in the design matrix)
ļ‚· SSE = sum of squared residuals
ļ‚· MSE = mean squared error of the regression model
The test statistic has a F-distribution with p and (n-p-1) degrees of freedom. Thus the ANOVA null
hypothesis is rejected if š¹ ā‰„ š¹(1 āˆ’ š›¼) š‘›āˆ’š‘āˆ’1
š‘
The coefficient of determination R2
is defined as SSM/SST. It is output as a percentage.
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
33
The Adjusted R2
is defined as 1-MSE/MST. It is output as a percentage.
The square root of MSE is called the standard error of the regression, or standard error of the Estimate.
If the constant is not included.
Component Sum of
squares
Degrees of
freedom
Mean of squares F
Model SSM p MSM = SSM/p MSM/MSE
Error SSE n-p MSE = SSE/(n-p)
Total SST n SST/n
Being
š‘†š‘†š‘€ = āˆ‘ š‘¦š‘–Ģ‚2
š‘›
š‘–=1
š‘†š‘†šø = āˆ‘(š‘¦š‘– āˆ’ š‘¦š‘–Ģ‚)2
š‘›
š‘–=1
š‘†š‘†š‘‡ = āˆ‘ š‘¦š‘–
2
š‘›
š‘–=1
Unstandardized Predicted Values
The fitted values (or unstandardized predicted values) from the regression will be
š‘ŒĢ‚ = š‘‹š›½Ģ‚ = š‘‹(š‘‹š‘‹ š‘‡
)āˆ’1
š‘‹ š‘‡
š‘Œ = HY
where H is the projection matrix (also known as hat matrix)
H = X(XXT
)-1
XT
Standardized Predicted Values
Once computed the mean and unbiased standard deviation of the unstandardized predicted values, we
standardize the fitted values as
š‘¦Ģ‚š‘–
š‘ š‘”
=
š‘¦Ģ‚š‘– āˆ’ š‘¦Ģ‚Ģ…
š‘† š‘¦Ģ‚
When new predictions are included outside of the design matrix, they are standardized with the above
values.
Prediction Intervals for Mean
Let define the vector of given predictors as
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
34
Xh = (1, xh,1, xh,2, ā€¦, xh, p)T
We define the standard error of the fit at Xh given by:
š‘ š‘’(š‘¦Ģ‚ā„Ž) = āˆšš‘€š‘†šø āˆ— š‘‹ā„Ž
š‘‡
(š‘‹ š‘‡ š‘‹)āˆ’1 š‘‹ā„Ž
Then, the Confidence Interval for the Mean Response is
š‘¦Ģ‚ā„Ž Ā± š‘” š›¼
2
;š‘›āˆ’š‘āˆ’1
āˆ— š‘ š‘’(š‘¦Ģ‚ā„Ž)
Where
ļ‚· X is the design matrix
ļ‚· Å·h is the "fitted value" or "predicted value" of the response when the predictor values are Xh.
ļ‚· MSE is the mean squared error of the regression model
ļ‚· n is the total number of observations
ļ‚· p is the number of predictor variables
Prediction Intervals for Individuals
Let define the vector of given predictors as
Xh = (1, xh,1, xh,2, ā€¦, xh, p)T
We define the standard error of the fit at Xh given by:
š‘ š‘’(š‘¦Ģ‚ā„Ž) = āˆšš‘€š‘†šø āˆ— [1 + š‘‹ā„Ž
š‘‡(š‘‹ š‘‡ š‘‹)āˆ’1 š‘‹ā„Ž]
Then, the Confidence Interval for individuals or new observations is
š‘¦Ģ‚ā„Ž Ā± š‘” š›¼
2
;š‘›āˆ’š‘āˆ’1
āˆ— š‘ š‘’(š‘¦Ģ‚ā„Ž)
Where
ļ‚· X is the design matrix
ļ‚· Å·h is the "fitted value" or "predicted value" of the response when the predictor values are Xh.
ļ‚· MSE is the mean squared error of the regression model
ļ‚· n is the total number of observations
ļ‚· p is the number of predictor variables
Unstandardized Residuals
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
35
The Unstandardized Residual for the i-th data unit is defined as:
ĆŖi = yi - Å·i
In matrix notation
Ɗ = Y - Č² = Y ā€“ HY = (Inxn ā€“ H)Y
Where H is the hat matrix.
Standardized Residuals
The standardized Residual for the i-th data unit is defined as:
esĢ‚ š‘– =
eĢ‚ š‘–
āˆš š‘€š‘†šø
Where
ļ‚· ĆŖi is the unstandardized residual for the i-th data unit.
ļ‚· MSE is the mean squared error of the regression model
Studentized Residuals (internally studentized residuals)
The leverage score for the i-th data unit is defined as:
hii = [H]ii
the i-th diagonal element of the projection matrix (also known as hat matrix)
H = X(XXT
)-1
XT
where X is the design matrix.
The Studentized Residual for the i-th data unit is defined as:
š‘”š‘– =
š‘’Ģ‚ š‘–
āˆšš‘€š‘†šø āˆ— (1 āˆ’ ā„Žš‘–š‘–)
Where
ļ‚· ĆŖi is the unstandardized residual for the i-th data unit.
ļ‚· MSE is the mean squared error of the regression model
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
36
Source:https://en.wikipedia.org/wiki/Studentized_residual
Centered Leverage Values
The regular leverage score for the i-th data unit is defined as:
hii = [H]ii
the i-th diagonal element of the projection matrix (also known as hat matrix)
H = X(XXT
)-1
XT
where X is the design matrix.
The centered leverage value for the i-th data unit is defined as:
clvi = hii ā€“ 1/n
Where n is the number of observations.
If the intercept is not included, then the centered leverage value for the i-th data unit is defined as:
clvi = hii
Source:https://en.wikipedia.org/wiki/Leverage_(statistics)
Mahalanobis Distance
The Mahalanobis Distance for the i-th data unit is defined as:
Di2
= (n - 1)*(hii ā€“ 1/n) = (n - 1)*clvi
Where
ļ‚· hii is the i-th diagonal element of the projection matrix.
ļ‚· n is the number of observations
If the intercept is not included, the Mahalanobis Distance for the i-th data unit is defined as:
Di2
= n*hii
Source: https://en.wikipedia.org/wiki/Mahalanobis_distance
Cookā€™s Distance
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
37
The Cookā€™s Distance for the i-th data unit is defined as:
š·š‘– =
š‘’Ģ‚ š‘–
2
ā„Žš‘–š‘–
š‘€š‘†šø āˆ— (š‘ + 1) āˆ— (1 āˆ’ ā„Žš‘–š‘–)2
Where
ļ‚· hii is the i-th diagonal element of the projection matrix.
ļ‚· p is the number of predictor variables
ļ‚· ĆŖi is the unstandardized residual for the i-th data unit.
ļ‚· MSE is the mean squared error of the regression model
If the intercept is not included, the Cookā€™s Distance for the i-th data unit is defined as:
š·š‘– =
š‘’Ģ‚ š‘–
2
ā„Žš‘–š‘–
š‘€š‘†šø āˆ— š‘ āˆ— (1 āˆ’ ā„Žš‘–š‘–)2
Source: https://en.wikipedia.org/wiki/Cook%27s_distance
Curve Estimation Models
Linear. Model whose equation is Y = b0 + (b1 * t). The series values are modeled as a linear
function of time.
Quadratic. Model whose equation is Y = b0 + (b1 * t) + (b2 * t**2). The quadratic model can be
used to model a series that "takes off" or a series that dampens.
Cubic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3).
Quartic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3) + (b4
* t**4).
Quintic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3) + (b4
* t**4) + (b5 * t**5).
Sextic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3) + (b4 *
t**4) + (b5 * t**5) + (b6 * t**6).
Logarithmic. Model whose equation is Y = b0 + (b1 * ln(t)).
Inverse. Model whose equation is Y = b0 + (b1 / t).
Power. Model whose equation is Y = b0 * (t**b1) or ln(Y) = ln(b0) + (b1 * ln(t)).
Compound. Model whose equation is Y = b0 * (b1**t) or ln(Y) = ln(b0) + (ln(b1) * t).
S-curve. Model whose equation is Y = e**(b0 + (b1/t)) or ln(Y) = b0 + (b1/t).
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
38
Logistic. Model whose equation is Y = 1 / (1/u + (b0 * (b1**t))) or ln(1/y-1/u) = ln (b0) + (ln(b1)
* t) where u is the upper boundary value. After selecting Logistic, specify the upper boundary value to
use in the regression equation. The value must be a positive number that is greater than the largest
dependent variable value.
Growth. Model whose equation is Y = e**(b0 + (b1 * t)) or ln(Y) = b0 + (b1 * t).
Exponential. Model whose equation is Y = b0 * (e**(b1 * t)) or ln(Y) = ln(b0) + (b1 * t).
METHODS AND FORMULAS HELP V2.1 InnerSoft STATS
39
Ā© Copyright InnerSoft 2017. All rights reserved.
Los hijos perdidos del Sinclair ZX Spectrum 128K (RANDOMIZE USR 123456)
innersoft@itspanish.org
innersoft@gmail.com
http://isstats.itspanish.org/

More Related Content

What's hot

Studentā€™s t test
Studentā€™s  t testStudentā€™s  t test
Studentā€™s t test
Lorena Villela
Ā 
Nonparametric statistics
Nonparametric statisticsNonparametric statistics
Nonparametric statistics
Tarun Gehlot
Ā 
Chi square test
Chi square testChi square test
Chi square test
Patel Parth
Ā 
My regression lecture mk3 (uploaded to web ct)
My regression lecture   mk3 (uploaded to web ct)My regression lecture   mk3 (uploaded to web ct)
My regression lecture mk3 (uploaded to web ct)
chrisstiff
Ā 
F test Analysis of Variance (ANOVA)
F test Analysis of Variance (ANOVA)F test Analysis of Variance (ANOVA)
F test Analysis of Variance (ANOVA)
Marianne Maluyo
Ā 

What's hot (20)

Chi square using excel
Chi square using excelChi square using excel
Chi square using excel
Ā 
Solution to the practice test ch 10 correlation reg ch 11 gof ch12 anova
Solution to the practice test ch 10 correlation reg ch 11 gof ch12 anovaSolution to the practice test ch 10 correlation reg ch 11 gof ch12 anova
Solution to the practice test ch 10 correlation reg ch 11 gof ch12 anova
Ā 
Categorical data analysis
Categorical data analysisCategorical data analysis
Categorical data analysis
Ā 
One-Way ANOVA
One-Way ANOVAOne-Way ANOVA
One-Way ANOVA
Ā 
Studentā€™s t test
Studentā€™s  t testStudentā€™s  t test
Studentā€™s t test
Ā 
Mc namer test of correlation
Mc namer test of correlationMc namer test of correlation
Mc namer test of correlation
Ā 
Goodness of Fit Notation
Goodness of Fit NotationGoodness of Fit Notation
Goodness of Fit Notation
Ā 
Analysis of variance (ANOVA)
Analysis of variance (ANOVA)Analysis of variance (ANOVA)
Analysis of variance (ANOVA)
Ā 
Nonparametric statistics
Nonparametric statisticsNonparametric statistics
Nonparametric statistics
Ā 
Analysis of Variance-ANOVA
Analysis of Variance-ANOVAAnalysis of Variance-ANOVA
Analysis of Variance-ANOVA
Ā 
Chi square test
Chi square testChi square test
Chi square test
Ā 
My regression lecture mk3 (uploaded to web ct)
My regression lecture   mk3 (uploaded to web ct)My regression lecture   mk3 (uploaded to web ct)
My regression lecture mk3 (uploaded to web ct)
Ā 
Lesson 27 using statistical techniques in analyzing data
Lesson 27 using statistical techniques in analyzing dataLesson 27 using statistical techniques in analyzing data
Lesson 27 using statistical techniques in analyzing data
Ā 
F test Analysis of Variance (ANOVA)
F test Analysis of Variance (ANOVA)F test Analysis of Variance (ANOVA)
F test Analysis of Variance (ANOVA)
Ā 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
Ā 
Chapter 14
Chapter 14 Chapter 14
Chapter 14
Ā 
PG STAT 531 Lecture 2 Descriptive statistics
PG STAT 531 Lecture 2 Descriptive statisticsPG STAT 531 Lecture 2 Descriptive statistics
PG STAT 531 Lecture 2 Descriptive statistics
Ā 
Contingency Tables
Contingency TablesContingency Tables
Contingency Tables
Ā 
Data Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVAData Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVA
Ā 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
Ā 

Similar to InnerSoft STATS - Methods and formulas help

Data Science Cheatsheet.pdf
Data Science Cheatsheet.pdfData Science Cheatsheet.pdf
Data Science Cheatsheet.pdf
qawali1
Ā 
Descriptive Statistics Formula Sheet Sample Populatio.docx
Descriptive Statistics Formula Sheet    Sample Populatio.docxDescriptive Statistics Formula Sheet    Sample Populatio.docx
Descriptive Statistics Formula Sheet Sample Populatio.docx
simonithomas47935
Ā 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
priyarokz
Ā 

Similar to InnerSoft STATS - Methods and formulas help (20)

Simple Regression.pptx
Simple Regression.pptxSimple Regression.pptx
Simple Regression.pptx
Ā 
What is chi square test
What  is  chi square testWhat  is  chi square test
What is chi square test
Ā 
Variance component analysis by paravayya c pujeri
Variance component analysis by paravayya c pujeriVariance component analysis by paravayya c pujeri
Variance component analysis by paravayya c pujeri
Ā 
Data Science Cheatsheet.pdf
Data Science Cheatsheet.pdfData Science Cheatsheet.pdf
Data Science Cheatsheet.pdf
Ā 
Descriptive Statistics Formula Sheet Sample Populatio.docx
Descriptive Statistics Formula Sheet    Sample Populatio.docxDescriptive Statistics Formula Sheet    Sample Populatio.docx
Descriptive Statistics Formula Sheet Sample Populatio.docx
Ā 
Sampling distribution.pptx
Sampling distribution.pptxSampling distribution.pptx
Sampling distribution.pptx
Ā 
Lecture 4
Lecture 4Lecture 4
Lecture 4
Ā 
Statistical parameters
Statistical parametersStatistical parameters
Statistical parameters
Ā 
Memorization of Various Calculator shortcuts
Memorization of Various Calculator shortcutsMemorization of Various Calculator shortcuts
Memorization of Various Calculator shortcuts
Ā 
A Mathematical Model for the Hormonal Responses During Neurally Mediated Sync...
A Mathematical Model for the Hormonal Responses During Neurally Mediated Sync...A Mathematical Model for the Hormonal Responses During Neurally Mediated Sync...
A Mathematical Model for the Hormonal Responses During Neurally Mediated Sync...
Ā 
A Mathematical Model for the Hormonal Responses During Neurally Mediated Sync...
A Mathematical Model for the Hormonal Responses During Neurally Mediated Sync...A Mathematical Model for the Hormonal Responses During Neurally Mediated Sync...
A Mathematical Model for the Hormonal Responses During Neurally Mediated Sync...
Ā 
Statistics78 (2)
Statistics78 (2)Statistics78 (2)
Statistics78 (2)
Ā 
Test of hypothesis test of significance
Test of hypothesis test of significanceTest of hypothesis test of significance
Test of hypothesis test of significance
Ā 
Testing of hypothesis
Testing of hypothesisTesting of hypothesis
Testing of hypothesis
Ā 
Categorical data analysis full lecture note PPT.pptx
Categorical data analysis full lecture note  PPT.pptxCategorical data analysis full lecture note  PPT.pptx
Categorical data analysis full lecture note PPT.pptx
Ā 
Stat3 central tendency & dispersion
Stat3 central tendency & dispersionStat3 central tendency & dispersion
Stat3 central tendency & dispersion
Ā 
Stat3 central tendency & dispersion
Stat3 central tendency & dispersionStat3 central tendency & dispersion
Stat3 central tendency & dispersion
Ā 
Medical statistics2
Medical statistics2Medical statistics2
Medical statistics2
Ā 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
Ā 
Application of Statistical and mathematical equations in Chemistry Part 2
Application of Statistical and mathematical equations in Chemistry Part 2Application of Statistical and mathematical equations in Chemistry Part 2
Application of Statistical and mathematical equations in Chemistry Part 2
Ā 

More from InnerSoft

InnerSoft CAD para AutoCAD, v4.0 Manual
InnerSoft CAD para AutoCAD, v4.0 ManualInnerSoft CAD para AutoCAD, v4.0 Manual
InnerSoft CAD para AutoCAD, v4.0 Manual
InnerSoft
Ā 
Manual de InnerSoft CAD en espaƱol
Manual de InnerSoft CAD en espaƱolManual de InnerSoft CAD en espaƱol
Manual de InnerSoft CAD en espaƱol
InnerSoft
Ā 

More from InnerSoft (10)

InnerSoft CAD para AutoCAD, v4.0 Manual
InnerSoft CAD para AutoCAD, v4.0 ManualInnerSoft CAD para AutoCAD, v4.0 Manual
InnerSoft CAD para AutoCAD, v4.0 Manual
Ā 
InnerSoft STATS - Introduction
InnerSoft STATS - IntroductionInnerSoft STATS - Introduction
InnerSoft STATS - Introduction
Ā 
InnerSoft STATS - Index
InnerSoft STATS - IndexInnerSoft STATS - Index
InnerSoft STATS - Index
Ā 
InnerSoft STATS - Graphs
InnerSoft STATS - GraphsInnerSoft STATS - Graphs
InnerSoft STATS - Graphs
Ā 
InnerSoft STATS - Analyze
InnerSoft STATS - AnalyzeInnerSoft STATS - Analyze
InnerSoft STATS - Analyze
Ā 
Manual InnerSoft STATS
Manual InnerSoft STATSManual InnerSoft STATS
Manual InnerSoft STATS
Ā 
IngenierĆ­a de caminos rurales
IngenierĆ­a de caminos ruralesIngenierĆ­a de caminos rurales
IngenierĆ­a de caminos rurales
Ā 
InnerSoft CAD Manual
InnerSoft CAD ManualInnerSoft CAD Manual
InnerSoft CAD Manual
Ā 
Norma 3.1 ic. trazado, de la instrucciĆ³n de carreteras
Norma 3.1 ic. trazado, de la instrucciĆ³n de carreterasNorma 3.1 ic. trazado, de la instrucciĆ³n de carreteras
Norma 3.1 ic. trazado, de la instrucciĆ³n de carreteras
Ā 
Manual de InnerSoft CAD en espaƱol
Manual de InnerSoft CAD en espaƱolManual de InnerSoft CAD en espaƱol
Manual de InnerSoft CAD en espaƱol
Ā 

Recently uploaded

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
Ā 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
Ā 

Recently uploaded (20)

Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
Ā 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Ā 
latest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answerslatest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answers
Ā 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
Ā 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
Ā 
Philosophy of china and it's charactistics
Philosophy of china and it's charactisticsPhilosophy of china and it's charactistics
Philosophy of china and it's charactistics
Ā 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
Ā 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
Ā 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
Ā 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
Ā 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Ā 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Ā 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
Ā 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
Ā 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
Ā 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Ā 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
Ā 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf arts
Ā 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
Ā 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
Ā 

InnerSoft STATS - Methods and formulas help

  • 2. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 2 Mean The arithmetic mean is the sum of a collection of numbers divided by the number of numbers in the collection. Sample Variance The estimator of population variance, also called the unbiased sample variance, is: š‘†2 = āˆ‘ (š‘„š‘– āˆ’ š‘„Ģ…)2š‘› š‘–=1 š‘› āˆ’ 1 Source: http://en.wikipedia.org/wiki/Variance Sample Kurtosis The estimators of population kurtosis is: šŗ2 = š‘˜4 š‘˜2 2 = (š‘› + 1)š‘› (š‘› āˆ’ 1)(š‘› āˆ’ 2)(š‘› āˆ’ 3) āˆ— āˆ‘ (š‘„š‘– āˆ’ š‘„Ģ…)4š‘› š‘–=1 š‘˜2 2 āˆ’ 3 (š‘› āˆ’ 1)2 (š‘› āˆ’ 2)(š‘› āˆ’ 3) The standard error of the sample kurtosis of a sample of size n from the normal distribution is: š¾ š‘†š‘”š‘‘. šøš‘Ÿš‘Ÿš‘œš‘Ÿ = āˆš 4[6š‘›(š‘› āˆ’ 1)2(š‘› + 1)] (š‘› āˆ’ 3)(š‘› āˆ’ 2)(š‘› + 1)(š‘› + 3)(š‘› + 5) Source: http://en.wikipedia.org/wiki/Kurtosis#Estimators_of_population_kurtosis Sample Skewness Skewness of a population sample is estimated by the adjusted Fisherā€“Pearson standardized moment coefficient: šŗ = š‘› (š‘› āˆ’ 1)(š‘› āˆ’ 2) āˆ‘ ( š‘„š‘– āˆ’ š‘„Ģ… š‘  ) 3š‘› š‘–=1 where n is the sample size and s is the sample standard deviation. The standard error of the skewness of a sample of size n from a normal distribution is: šŗ š‘†š‘”š‘‘. šøš‘Ÿš‘Ÿš‘œš‘Ÿ = āˆš 6š‘›(š‘› āˆ’ 1) (š‘› āˆ’ 2)(š‘› + 1)(š‘› + 3) Source: https://en.wikipedia.org/wiki/Skewness#Sample_skewness Total Variance Variance of the entire population is: šœŽ2 = āˆ‘ (š‘„š‘– āˆ’ š‘„Ģ…)2š‘› š‘–=1 š‘›
  • 3. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 3 Source: http://en.wikipedia.org/wiki/Variance Total Kurtosis Kurtosis of the entire population is: šŗ2 = āˆ‘ (š‘„š‘– āˆ’ š‘„Ģ…)4š‘› š‘–=1 š‘› šœŽ4 āˆ’ 3 where n is the sample size and Ļƒ is the total standard deviation. Source: http://en.wikipedia.org/wiki/Kurtosis Total Skewness Skewness of the entire population is: šŗ = āˆ‘ (š‘„š‘– āˆ’ š‘„Ģ…)3š‘› š‘–=1 š‘› šœŽ3 where n is the sample size and Ļƒ is the total standard deviation. Source: https://en.wikipedia.org/wiki/Skewness Quantiles of a population ISSTATS uses the same method as Rā€“7, Excel CUARTIL.INC function, SciPyā€“(1,1), SPSS and Minitab. Qp, the estimate for the kth qā€“quantile, where p = k/q and h = (Nā€“1)*p + 1, is computing by Qp = Linear interpolation of the modes for the order statistics for the uniform distribution on [0, 1]. When p = 1, use xN. Source: http://en.wikipedia.org/wiki/Quantile#Estimating_the_quantiles_of_a_population MSSD (Mean of the squared successive differences) It is calculated by taking the sum of the differences between consecutive observations squared, then taking the mean of that sum and dividing by two. š‘€š‘†š‘†š· = āˆ‘ (š‘„š‘–+1 āˆ’ š‘„š‘–)2š‘› š‘–=1 2(š‘› āˆ’ 1) The MSSD has the desirable property that one half the MSSD is an unbiased estimator of true variance. Pearson Chi Square Test The value of the test-statistic is
  • 4. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 4 šœ’2 = āˆ‘ (š‘‚š‘– āˆ’ šøš‘–)2 šøš‘– š‘› š‘–=1 Where ļ‚· šœ’2 is the Pearson's cumulative test statistic, which asymptotically approaches a šœ’2 distribution with (r - 1)(c - 1) degrees of freedom. ļ‚· š‘‚š‘– is the number of observations of type i. ļ‚· šøš‘– is the expected (theoretical) frequency of type i Yates's Continuity Correction The value of the test-statistic is šœ’2 = āˆ‘ (š‘šš‘Žš‘„{0, |š‘‚š‘– āˆ’ šøš‘–| āˆ’ 0.5})2 šøš‘– š‘› š‘–=1 When |š‘‚š‘– āˆ’ šøš‘–| āˆ’ 0.5 is below zero, the null value is computed. The effect of Yates' correction is to prevent overestimation of statistical significance for small data. This formula is chiefly used when at least one cell of the table has an expected count smaller than 5. Likelihood Ratio G-Test The value of the test-statistic is šŗ = 2 (āˆ‘ āˆ‘ š‘‚š‘–š‘— āˆ— š‘™š‘›( š‘‚š‘–š‘— šøš‘–š‘— ) š‘ š‘—=1 š‘Ÿ š‘–=1 ) where ļ‚· Oij is the observed count in row i and column j ļ‚· Eij is the expected count in row i and column j G has an asymptotically approximate Ļ‡2 distribution with (r - 1)(c - 1) degrees of freedom when the null hypothesis is true and n is large enough. Mantel-Haenszel Chi-Square Test The Mantel-Haenszel chi-square statistic tests the alternative hypothesis that there is a linear association between the row variable and the column variable. Both variables must lie on an ordinal scale. The Mantel-Haenszel chi-square statistic is computed as: š‘„ š‘€š» = (š‘› āˆ’ 1)š‘Ÿ2 Where r is the Pearson correlation between the row variable and the column variable, n is the sample size. Under the null hypothesis of no association, has an asymptotic chi-square distribution with one degree of freedom.
  • 5. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 5 Fisher's Exact Test Fisherā€™s exact test assumes that the row and column totals are fixed, and then uses the hypergeometric distribution to compute probabilities of possible tables conditional on the observed row and column totals. Fisherā€™s exact test does not depend on any large-sample distribution assumptions, and so it is appropriate even for small sample sizes and for sparse tables. This test is computed for 2X2 tables such as š“ = ( š‘Ž š‘ š‘ š‘‘ ) For an efficient computing, the elements of the matrix A are reordered Aā€™ = ( š‘Žā€² š‘ā€² š‘ā€² š‘‘ā€² ) Being aā€™ the cell of A that have the minimum marginals (minimum row and column totals). The test result does not depend on the cells disposition. The left-sided ā€“value sums the probability for all the tables that have equal or smaller aā€™. p š‘™š‘’š‘“š‘” = P(š‘„ ā‰¤ š‘Žā€²) = āˆ‘ ( š¾ = š‘Žā€² + š‘ā€² š‘– ) ( š‘ āˆ’ š¾ š‘› āˆ’ š‘– ) ( š‘ = š‘Žā€² + š‘ā€² + š‘ā€² + š‘‘ā€² š‘› = š‘Žā€² + š‘ā€² ) š‘Žā€² š‘–=0 The right-sided ā€“value sums the probability for all the tables that have equal or larger aā€™. p š‘Ÿš‘–š‘”ā„Žš‘” = P(š‘„ ā‰„ š‘Žā€²) = āˆ‘ ( š¾ = š‘Žā€² + š‘ā€² š‘– ) ( š‘ āˆ’ š¾ š‘› āˆ’ š‘– ) ( š‘ = š‘Žā€² + š‘ā€² + š‘ā€² + š‘‘ā€² š‘› = š‘Žā€² + š‘ā€² ) š¾=š‘Žā€²+š‘ā€² š‘–=š‘Žā€² Most of the statistical packages output -as the one-sided test result- the minimum value of pleft and pright. The Fisher two-tailed p-value for a table A is defined as the sum of probabilities for all tables consistent with the marginals that are as likely as the current table. McNemar's Test This test is computed for 2X2 tables such as š“ = ( š‘Ž š‘ š‘ š‘‘ ) The value of the test-statistic is
  • 6. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 6 šœ’2 = (š‘ āˆ’ š‘)2 š‘ + š‘ The statistic is asymptotically distributed like a chi-squared distribution with 1 degree of freedom. Edwards Continuity Correction The value of the test-statistic is šœ’2 = (š‘šš‘Žš‘„{0, |š‘ āˆ’ š‘| āˆ’ 1})2 š‘ + š‘ When |š‘ āˆ’ š‘| āˆ’ 1 is below zero, the statistic is zero. The statistic is asymptotically distributed like a chi-squared distribution with 1 degree of freedom. McNemar Exact Binomial Assuming that b < c. Let be n = b + c, and B(x, n, p) the binomial distribution Two āˆ’ sided p āˆ’ value = 2 āˆ— (one āˆ’ sided p āˆ’ value) = 2 āˆ— āˆ‘ šµ(š‘„, š‘›, 0.5) š‘ š‘„=0 = 2 āˆ— āˆ‘ ( š‘› š‘„ ) āˆ— 0.5 š‘„ āˆ— 0.5 š‘›āˆ’š‘„ š‘ š‘„=0 = 2 āˆ— 1 2 š‘› āˆ— āˆ‘ ( š‘› š‘„ ) š‘ š‘„=0 If b = c, the exact p-value equals 1.0. Mid-P McNemar Test Let be n = b + c. Assuming that b < c. Mid āˆ’ P value = 2 āˆ— āˆ‘ šµ(š‘„, š‘›, 0.5) š‘ š‘„=0 āˆ’ šµ(š‘, š‘›, 0.5) = 2 āˆ— 1 2 š‘› āˆ— āˆ‘ ( š‘› š‘„ ) āˆ’ ( š‘› š‘ ) āˆ— 1 2 š‘› š‘ š‘„=0 If b = c, the mid p-value is 1.0 āˆ’ 1 2 ( š‘› š‘ ) āˆ— 1 2 š‘› Bowkerā€™s Test of Symmetry This test is computed for m-by-m square matrix as: šµš‘Š = āˆ‘ āˆ‘ (š‘›š‘–š‘— āˆ’ š‘›š‘—š‘–)2 š‘›š‘–š‘— + š‘›š‘—š‘– š‘–āˆ’1 š‘—=1 š‘šāˆ’1 š‘–=1 For large samples, BW has an asymptotic chi-square distribution with M*(M - 1)/2 ā€“ R degrees of freedom under the null hypothesis of symmetry, where R is the number of off-diagonal cells with nij + nji = 0.
  • 7. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 7 Risk Test Let be Risk Factor Disease status Cohort = Present Cohort = Absent Present a b Absent c d Odds ratio The odds ratio (Risk Factor = Present / Risk Factor = Absent) is computed as: š‘‚š‘… = š‘Ž š‘ā„ š‘ š‘‘ā„ The distribution of the log odds ratio is approximately normal with: šœ’ ~ š‘(log(š‘‚š‘…) , šœŽ2 ) The standard error for the log odds ratio is approximately š‘†šø = āˆš 1 š‘Ž + 1 š‘ + 1 š‘ + 1 š‘‘ The 95% confidence interval for the odds ratio is computed as [exp(log(š‘‚š‘…) āˆ’ š‘§0.025 āˆ— š‘†šø) ; exp(log(š‘‚š‘…) + š‘§0.025 āˆ— š‘†šø)] To test the hypothesis that the population odds ratio equals one, is computed the two-sided p-value as š‘ š‘–š‘”š‘›š‘–š‘“š‘–š‘š‘Žš‘›š‘š‘’ (2 āˆ’ š‘ š‘–š‘‘š‘’š‘‘) = 2 āˆ— š‘ƒ(š‘§ ā‰¤ āˆ’|log(š‘‚š‘…)| š‘†šø ) Source: https://en.wikipedia.org/wiki/Odds_ratio Relative Risk The relative risk (for cohort Disease status = Present) is computed as š‘…š‘… = š‘Ž š‘Ž + š‘ā„ š‘ š‘ + š‘‘ā„ The distribution of the log relative risk is approximately normal with: šœ’ ~ š‘(log(š‘‚š‘…) , šœŽ2 )
  • 8. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 8 The standard error for the log relative risk is approximately š‘†šø = āˆš 1 š‘Ž + 1 š‘ āˆ’ 1 š‘Ž + š‘ āˆ’ 1 š‘ + š‘‘ The 95% confidence interval for the relative risk is computed as [exp(log(š‘…š‘…) āˆ’ š‘§0.025 āˆ— š‘†šø) ; exp(log(š‘…š‘…) + š‘§0.025 āˆ— š‘†šø)] To test the hypothesis that the population relative risk equals one, is computed the two-sided p-value as š‘ š‘–š‘”š‘›š‘–š‘“š‘–š‘š‘Žš‘›š‘š‘’ (2 āˆ’ š‘ š‘–š‘‘š‘’š‘‘) = 2 āˆ— š‘ƒ(š‘§ ā‰¤ āˆ’|log(š‘…š‘…)| š‘†šø ) The relative risk (for cohort Disease status = Absent) is computed as š‘…š‘… = š‘ š‘Ž + š‘ā„ š‘‘ š‘ + š‘‘ā„ Epidemiology Risk All the parameters are computed for cohort Disease status = Present. Attributable risk, represents how much the risk factor increase/decrease the risk of disease š“š‘… = š‘Ž š‘Ž + š‘ āˆ’ š‘ š‘ + š‘‘ If AR > 0 there an increase of the risk. If AR < 0 there is a reduction of the risk. Relative Attributable Risk š‘…š‘… = š‘Ž š‘Ž + š‘ āˆ’ š‘ š‘ + š‘‘ š‘ š‘ + š‘‘ = š“š‘… š‘ š‘ + š‘‘ Number Needed to Harm š‘š‘š» = 1 š‘Ž š‘Ž + š‘ āˆ’ š‘ š‘ + š‘‘ = 1 š“š‘… The number needed to harm (NNH) is an epidemiological measure that indicates how many patients on average need to be exposed to a risk-factor over a specific period to cause harm in an average of one patient who would not otherwise have been harmed. A negative number would not be presented as a NNH, rather, as the risk factor is not harmful, it is expressed as a number needed to treat (NNT) or number needed to avoid to expose to risk.
  • 9. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 9 Attributable risk per unit š“š‘…š‘ƒ = š‘…š‘… āˆ’ 1 š‘…š‘… Preventive fraction š‘ƒš¹ = 1 āˆ’ š‘…š‘… Etiologic fraction is the proportion of cases in which the exposure has played a causal role in disease development. šøš¹ = š‘Ž āˆ’ š‘ š‘Ž A similar parameters are computed for cohort Disease status = Absent. Source: https://en.wikipedia.org/wiki/Relative_risk Cohen's Kappa Test Given a k-by-k square matrix, which collect the scores of two raters who each classify N items into k mutually exclusive categories, the equation for Cohen's kappa coefficient is š‘˜Ģ‚ = š‘ š‘œ āˆ’ š‘ š‘’ 1 āˆ’ š‘ š‘’ Where š‘ š‘œ = āˆ‘ š‘›š‘–š‘– š‘ = āˆ‘ š‘š‘–š‘– š‘˜ š‘–=1 š‘˜ š‘–=1 š‘Žš‘›š‘‘ š‘š‘’ = āˆ‘ š‘š‘–. š‘.š‘– š‘˜ š‘–=1 š‘¤ā„Žš‘’š‘Ÿš‘’ š‘š‘–š‘— = š‘›š‘–š‘— š‘ š‘Žš‘›š‘‘ š‘š‘–. = āˆ‘ š‘›š‘–š‘— š‘ š‘˜ š‘—=1 š‘Žš‘›š‘‘ š‘.š‘— = āˆ‘ š‘›š‘–š‘— š‘ š‘˜ š‘–=1 The asymptotic variance is computed by š‘£š‘Žš‘Ÿ(š‘˜Ģ‚) = 1 š‘(1 āˆ’ š‘š‘’)4 { āˆ‘ š‘š‘–š‘–[(1 āˆ’ š‘š‘’) āˆ’ (š‘.š‘– + š‘š‘–.)(1 āˆ’ š‘ š‘œ)]2 š‘˜ š‘–=1 + (1 āˆ’ š‘0)2 āˆ‘ āˆ‘ š‘š‘–š‘—(š‘.š‘– + š‘š‘—.)2 š‘˜ š‘—=1,š‘—ā‰ š‘– āˆ’ (š‘ š‘œ š‘š‘’ āˆ’ 2š‘š‘’ + š‘ š‘œ)2 š‘˜ š‘–=1 }
  • 10. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 10 The formulae is given by Fleiss, Cohen, and Everitt (1969), and modified by Fleiss (1981). The asymptotic standard error is the root square of the value given above. This standard error and the normal distribution N(0,1) must be used to compute confidence intervals. š‘˜Ģ‚ Ā± š‘§āˆ/2āˆš š‘£š‘Žš‘Ÿ(š‘˜Ģ‚) To compute an asymptotic test for the kappa coefficient, ISSTATS uses a standardized test statistic T which has an asymptotic standard normal distribution under the null hypothesis that kappa equals zero (H0: k = 0). The standardized test statistic is computed as š‘‡ = š‘˜Ģ‚ āˆš š‘£š‘Žš‘Ÿ0(š‘˜Ģ‚) ā‰… š‘(0,1) Where the variance of the kappa coefficient under the null hypothesis is š‘£š‘Žš‘Ÿ0(š‘˜Ģ‚) = 1 š‘(1 āˆ’ š‘š‘’)2 { š‘š‘’ + š‘š‘’ 2 āˆ’ āˆ‘ š‘.š‘– š‘š‘–.(š‘.š‘–+ š‘š‘–.) š‘˜ š‘–=1 } Refer to Fleiss (1981) Source: https://v8doc.sas.com/sashtml/stat/chap28/sect26.htm Nominal by Nominal Measures of Association Contingency Coefficient Cramer's V is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). š¶ = āˆš šœ’2 šœ’2 + š‘ Where ļ‚· šœ’2 is the Pearson's cumulative test statistic. ļ‚· N is the total sample size. C asymptotically approaches a šœ’2 distribution with (r - 1)(c - 1) degrees of freedom. Standardized Contingency Coefficient
  • 11. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 11 If X and Y have the same number of categories (r = c), then the maximum value for the contingency coefficient is calculated as: š‘ š‘šš‘Žš‘„ = āˆš š‘Ÿ āˆ’ 1 š‘Ÿ If X and Y have a differing number of categories (r ā‰  c), then the maximum value for the contingency coefficient is calculated as š‘ š‘šš‘Žš‘„ = āˆš (š‘Ÿ āˆ’ 1)(š‘ āˆ’ 1) š‘Ÿ āˆ— š‘ 4 The standardized contingency coefficient is calculated as the ratio: š‘š‘†š‘”š‘Žš‘›š‘‘š‘Žš‘Ÿš‘‘š‘–š‘§š‘’š‘‘ = š¶ š‘ š‘šš‘Žš‘„ which varies between 0 and 1 with 0 indicating independence and 1 dependence. Phi coefficient The phi coefficient is a measure of association for two nominal variables. š›· = āˆš šœ’2 š‘ Where ļ‚· šœ’2 is the Pearson's cumulative test statistic. ļ‚· N is the total sample size. Phi asymptotically approaches a šœ’2 distribution with (r - 1)(c - 1) degrees of freedom. Cramer's V Cramer's V is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). š‘‰ = āˆš šœ’2 š‘ ā„ š‘šš‘–š‘›{š‘Ÿ āˆ’ 1, š‘ āˆ’ 1} Where ļ‚· šœ’2 is the Pearson's cumulative test statistic. ļ‚· N is the total sample size. V asymptotically approaches a Ļ‡2 distribution with (r - 1)(c - 1) degrees of freedom.
  • 12. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 12 Tschuprow's T Tschuprow's T is a measure of association between two nominal variables, giving a value between 0 and 1 (inclusive). š‘‡ = āˆš šœ’2 š‘ ā„ āˆš(š‘Ÿ āˆ’ 1)(š‘ āˆ’ 1) Lambda Asymmetric lambda, Ī»(C/R) or column variable dependent, is interpreted as the probable improvement in predicting the column variable Y given knowledge of the row variable X. The range of asymmetric lambda is {0, 1}. Asymmetric lambda (C/R) or column variable dependent is computed as šœ†(š¶/š‘…) = āˆ‘ š‘Ÿš‘–š‘– āˆ’ š‘Ÿ š‘ āˆ’ š‘Ÿ The asymptotic variance is š‘£š‘Žš‘Ÿ( šœ†(š¶/š‘…)) = š‘ āˆ’ āˆ‘ š‘Ÿš‘–š‘– ( š‘Ÿ āˆ’ š‘)3 { āˆ‘ š‘Ÿš‘– š‘– + š‘Ÿ āˆ’ 2 āˆ‘(š‘Ÿš‘–|š‘™š‘– = š‘™) š‘– } Where š‘Ÿš‘– = max š‘— {š‘›š‘–š‘—} š‘Žš‘›š‘‘ š‘Ÿ = max š‘— {š‘Ÿ.š‘—} š‘Žš‘›š‘‘ š‘š‘— = max š‘– {š‘›š‘–š‘—} š‘Žš‘›š‘‘ š‘ = max š‘– {š‘›š‘–.} The values of li and l are determined as follows. Denote by li the unique value of j such that ri = nij, and let l be the unique value of j such that r = n.j. Because of the uniqueness assumptions, ties in the frequencies or in the marginal totals must be broken in an arbitrary but consistent manner. In case of ties, l is defined as the smallest value of such that r = n.j. For those columns containing a cell (i, j) for which nij = ri = cj, csj records the row in which cj is assumed to occur. Initially is set equal to ā€“1 for all j. Beginning with i = 1, if there is at least one value j such that nij = ri = cj, and if csj = -1, then li is defined to be the smallest such value of j, and csj is set equal to i. Otherwise, if nil = ri, then li is defined to be equal to l. If neither condition is true, then li is taken to be the smallest value of j such that nij = ri. The asymptotic standard error is the root square of the asymptotic variance. The formulas for lambda asymmetric Ī»(R/C) can be obtained by interchanging the indices.
  • 13. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 13 šœ†(š‘…/š¶) = āˆ‘ š‘š‘—š‘— āˆ’ š‘ š‘ āˆ’ š‘ The Symmetric lambda is the average of the two asymmetric lambdas, Ī»(C/R) and Ī»(R/C). Its range is {- 1, 1}. Lambda symmetric is computed as šœ† = āˆ‘ š‘Ÿš‘–š‘– + āˆ‘ š‘š‘—š‘— āˆ’ š‘Ÿ āˆ’ š‘ 2š‘ āˆ’ š‘Ÿ āˆ’ š‘ The asymptotic variance is š‘£š‘Žš‘Ÿ( šœ†) = 1 š‘¤4 { š‘¤š‘£š‘¦ āˆ’ 2š‘¤2 [š‘ āˆ’ āˆ‘ āˆ‘(š‘›š‘–š‘—|š‘— = š‘™š‘–, š‘– = š‘˜š‘—) š‘—š‘– ] āˆ’ 2š‘£2 (š‘ āˆ’ š‘› š‘˜š‘™)} Where š‘¤ = 2š‘› āˆ’ š‘Ÿ āˆ’ š‘ š‘Žš‘›š‘‘ š‘£ = 2š‘› āˆ’ āˆ‘ š‘Ÿš‘– š‘– āˆ’ āˆ‘ š‘š‘— š‘— š‘Žš‘›š‘‘ š‘„ = āˆ‘(š‘Ÿš‘– | š‘™š‘– = š‘™) š‘– + āˆ‘(š‘š‘— | š‘˜š‘— = š‘˜) š‘— + š‘Ÿš‘˜ + š‘š‘™ š‘Žš‘›š‘‘ š‘¦ = 8š‘ āˆ’ š‘¤ āˆ’ š‘£ āˆ’ 2š‘„ The definitions of l and li are given in the previous section. The values k and kj are defined in a similar way for lambda asymmetric (R/C). Uncertainty Coefficient The uncertainty coefficient U (C/R) -or column variable dependent U- measures the proportion of uncertainty (entropy) in the column variable Y that is explained by the row variable X. Its range is {0, 1}. The uncertainty coefficient is computed as š‘ˆ(š¶/š‘…) = š‘ˆ š‘š‘œš‘™š‘¢š‘šš‘› š‘£š‘Žš‘Ÿš‘–š‘Žš‘š‘™š‘’ š‘‘š‘’š‘š‘’š‘›š‘‘š‘’š‘›š‘” = š‘£ š‘¤ = H(X) + H(Y) āˆ’ H(XY) H(Y) Where š»(š‘‹) = āˆ’ āˆ‘ š‘›š‘–. š‘› ln ( š‘›š‘–. š‘› ) š‘– š‘Žš‘›š‘‘ š»(š‘Œ) = āˆ’ āˆ‘ š‘›.š‘— š‘› ln ( š‘›.š‘— š‘› ) š‘– š‘Žš‘›š‘‘ š»(š‘‹š‘Œ) = āˆ’ āˆ‘ āˆ‘ š‘›š‘–š‘— š‘› ln ( š‘›š‘–š‘— š‘› ) š‘—š‘– The asymptotic variance is š‘£š‘Žš‘Ÿ(š‘ˆ(š¶/š‘…)) = 1 š‘›2 š‘¤4 āˆ‘ āˆ‘ š‘›š‘–š‘— {š»(š‘Œ) ln ( š‘›š‘–š‘— š‘›š‘–. ) + (H(X) āˆ’ H(XY)) ln ( š‘›.š‘— š‘› )} 2 š‘—š‘– The asymptotic standard error is the root square of the asymptotic variance.
  • 14. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 14 The formulas for the uncertainty coefficient U (C/R) can be obtained by interchanging the indices. The symmetric uncertainty coefficient is computed as š‘ˆ = 2 āˆ— [H(X) + H(Y) āˆ’ H(XY)] H(X) + H(Y) The asymptotic variance is š‘£š‘Žš‘Ÿ(š‘ˆ) = 4 āˆ‘ āˆ‘ š‘›š‘–š‘— {š»(š‘‹š‘Œ) ln ( š‘›š‘–. š‘›.š‘— š‘›2 ) āˆ’ (H(X) āˆ’ H(Y)) ln ( š‘›.š‘— š‘› )} 2 š‘›2(H(X) + H(Y))4 š‘—š‘– The asymptotic standard error is the root square of the asymptotic variance. Ordinal by Ordinal Measures of Association Let nij denote the observed frequency in cell (i, j) in a IxJ contingency table. Let be N the total frequency and š“š‘–š‘— = āˆ‘ āˆ‘ š‘› š‘˜š‘™ š‘™<š‘—š‘˜<š‘– + āˆ‘ āˆ‘ š‘› š‘˜š‘™ š‘™>š‘—š‘˜>š‘– š·š‘–š‘— = āˆ‘ āˆ‘ š‘› š‘˜š‘™ š‘™<š‘—š‘˜>š‘– + āˆ‘ āˆ‘ š‘› š‘˜š‘™ š‘™>š‘—š‘˜<š‘– š‘ƒ = āˆ‘ āˆ‘ š‘Žš‘–š‘— š“š‘–š‘— š‘—š‘– š‘Žš‘›š‘‘ š‘„ = āˆ‘ āˆ‘ š‘Žš‘–š‘— š·š‘–š‘— š‘—š‘– Gamma Coefficient The gamma (G) statistic is based only on the number of concordant and discordant pairs of observations. It ignores tied pairs (that is, pairs of observations that have equal values of X or equal values of Y). Gamma is appropriate only when both variables lie on an ordinal scale. The range of gamma is {-1, 1}. If the row and column variables are independent, then gamma tends to be close to zero. Gamma is estimated by šŗ = š‘ƒ āˆ’ š‘„ š‘ƒ + š‘„ The asymptotic variance is
  • 15. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 15 š‘£š‘Žš‘Ÿ(šŗ) = 16 ( š‘ƒ + š‘„)2 { āˆ‘ āˆ‘ š‘›š‘–š‘— āˆ— (š‘„š“š‘–š‘— āˆ’ š‘ƒš·š‘–š‘—)2 š½ š‘—=1 š¼ š‘–=1 } The asymptotic standard error is the root square of the asymptotic variance. The variance under the null hypothesis that gamma equals zero is computed as š‘£š‘Žš‘Ÿ0(šŗ) = 4 ( š‘ƒ + š‘„)2 { āˆ‘ āˆ‘ š‘›š‘–š‘— āˆ— š‘‘š‘–š‘— 2 š½ š‘—=1 āˆ’ (š‘ƒ āˆ’ š‘„)2 š‘ š¼ š‘–=1 } Where dij = Aij - Dij The asymptotic standard error under the null hypothesis that d equals zero is the root square of the variance. Kendall's tau-b Kendallā€™s tau-b is similar to gamma except that tau-b uses a correction for ties. Tau-b is appropriate only when both variables lie on an ordinal scale. The range of tau-b is {-1, 1}. Kendallā€™s tau-b is estimated by šœ š‘ = š‘ƒ āˆ’ š‘„ š‘¤ Where š‘¤š‘Ÿ = š‘›2 āˆ’ āˆ‘ š‘›š‘–. 2 š‘– š‘Žš‘›š‘‘ š‘¤š‘ = š‘›2 āˆ’ āˆ‘ š‘›.š‘— 2 š‘– š‘Žš‘›š‘‘ š‘¤ = āˆš š‘¤š‘Ÿ š‘¤š‘ The asymptotic variance is š‘£š‘Žš‘Ÿ( šœ š‘) = 1 š‘¤4 { āˆ‘ āˆ‘ š‘›š‘–š‘—(2š‘¤š‘‘š‘–š‘— + šœ š‘ š‘£š‘–š‘—)2 š½ š‘—=1 š¼ š‘–=1 āˆ’ š‘3 šœ š‘ 2 ( š‘¤ š‘Ÿ + š‘¤ š‘)2 } where š‘£š‘–š‘— = š‘¤ š‘ š‘›š‘–. + š‘¤ š‘Ÿ š‘›.š‘— The asymptotic standard error is the root square of the asymptotic variance. The variance under the null hypothesis that tau-b equals zero is computed as
  • 16. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 16 š‘£š‘Žš‘Ÿ0( šœ š‘) = 4 š‘¤ š‘Ÿ š‘¤ š‘ { āˆ‘ āˆ‘ š‘›š‘–š‘— āˆ— š‘‘š‘–š‘— 2 š½ š‘—=1 āˆ’ (š‘ƒ āˆ’ š‘„)2 š‘ š¼ š‘–=1 } The asymptotic standard error under the null hypothesis that d equals zero is the root square of the variance. Stuart-Kendall's tau-c Stuart-Kendallā€™s tau-c makes an adjustment for table size in addition to a correction for ties. Tau-c is appropriate only when both variables lie on an ordinal scale. The range of tau-c is {-1, 1}. Stuart- Kendallā€™s tau-c is estimated by šœ š‘ = š‘š(š‘ƒ āˆ’ š‘„) š‘2(š‘š āˆ’ 1) Where m =min {I, J}. The asymptotic variance is š‘£š‘Žš‘Ÿ( šœ š‘) = 4š‘š2 š‘4 (š‘š āˆ’ 1)2 { āˆ‘ āˆ‘ š‘›š‘–š‘— āˆ— š‘‘š‘–š‘— 2 š½ š‘—=1 āˆ’ (š‘ƒ āˆ’ š‘„)2 š‘ š¼ š‘–=1 } The asymptotic standard error is the root square of the asymptotic variance. The variance under the null hypothesis that tau-c equals zero is the same as the asymptotic variance. Sommersā€™ D Somersā€™ D(C/R) and Somersā€™ D(R/C) are asymmetric modifications of tau-b. C/R indicates that the row variable X is regarded as the independent variable and the column variable Y is regarded as dependent. Similarly, R/C indicates that the column variable Y is regarded as the independent variable and the row variable X is regarded as dependent. Somersā€™ D differs from tau-b in that it uses a correction only for pairs that are tied on the independent variable. Somersā€™ D is appropriate only when both variables lie on an ordinal scale. The range of Somersā€™ D is {-1, 1}. Somersā€™ D is computed as š·(š¶/š‘…) = š· š‘š‘œš‘™š‘¢š‘šš‘› š‘£š‘Žš‘Ÿš‘–š‘Žš‘š‘™š‘’ š‘‘š‘’š‘š‘’š‘›š‘‘š‘’š‘›š‘” = š‘ƒ āˆ’ š‘„ š‘¤š‘Ÿ The asymptotic variance is
  • 17. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 17 š‘£š‘Žš‘Ÿ( š·(š¶/š‘…)) = 4 š‘¤ š‘Ÿ 4 { āˆ‘ āˆ‘ š‘›š‘–š‘—[š‘¤š‘Ÿ š‘‘š‘–š‘— āˆ’ (š‘ƒ āˆ’ š‘„)(š‘ āˆ’ š‘›š‘–.)]2 š½ š‘—=1 š¼ š‘–=1 } The asymptotic standard error is the root square of the asymptotic variance. The variance under the null hypothesis that D(C/R) equals zero is computed as š‘£š‘Žš‘Ÿ0( š·(š¶/š‘…)) = 4 š‘¤ š‘Ÿ 2 { āˆ‘ āˆ‘ š‘›š‘–š‘— āˆ— š‘‘š‘–š‘— 2 š½ š‘—=1 āˆ’ (š‘ƒ āˆ’ š‘„)2 š‘ š¼ š‘–=1 } The asymptotic standard error under the null hypothesis that D(C/R) equals zero is the root square of the variance. Formulas for Somersā€™ D(R/C) are obtained by interchanging the indices. Symmetric version of Somersā€™ d is š‘‘ = š‘ƒ āˆ’ š‘„ š‘¤š‘Ÿ + š‘¤š‘ 2 The standard error is š“š‘†šø(š‘‘) = 2šœŽ šœš‘ š‘¤ š‘¤ š‘Ÿ + š‘¤ š‘ where ĻƒĻ„b is the asymptotic standard error of Kendallā€™s tau-b. The variance under the null hypothesis that d equals zero is computed as š‘£š‘Žš‘Ÿ0(š‘‘) = 16 ( š‘¤ š‘Ÿ + š‘¤ š‘)2 { āˆ‘ āˆ‘ š‘›š‘–š‘— āˆ— š‘‘š‘–š‘— 2 š½ š‘—=1 āˆ’ (š‘ƒ āˆ’ š‘„)2 š‘ š¼ š‘–=1 } The asymptotic standard error under the null hypothesis that d equals zero is the root square of the variance. Confidence Bounds and One-Sided Tests Suppose you are testing the null hypothesis H0: ļ­ ā‰„ ļ­0 against the one-sided alternative H1: ļ­ < ļ­0. Rather than give a two-sided confidence interval for ļ­, the more appropriate procedure is to give an upper confidence bound in this setting. This upper confidence bound has a direct relationship to the one-sided test, namely:
  • 18. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 18 1. A level ļ” test of H0: ļ­ ā‰„ ļ­0 against the one-sided alternative H1: ļ­ < ļ­0 rejects H0 exactly when the value ļ­0 is above the 1ā€“Ī± upper confidence bound. 2. A level ļ” test of H0: ļ­ ā‰¤ ļ­0 against the one-sided alternative H1: ļ­ > ļ­0 rejects H0 exactly when the value ļ­0 is above the 1ā€“Ī± lower confidence bound. ANOVA Test š‘†š‘†š‘‡š‘œš‘”š‘Žš‘™ = āˆ‘ āˆ‘(š‘¦š‘–š‘— āˆ’ š‘Œ.. Ģ…)2 š‘š‘– š‘—=1 š‘˜ š‘–=1 š‘†š‘†š¼š‘›š‘”š‘’š‘Ÿ = āˆ‘ š‘›š‘–(š‘ŒĢ…š‘–. āˆ’ š‘Œ.. Ģ…)2 š‘˜ š‘–=1 š‘†š‘†š¼š‘›š‘”š‘Ÿš‘Ž = āˆ‘ āˆ‘(š‘¦š‘–š‘— āˆ’ š‘Œš‘–. Ģ… )2 š‘› š‘– š‘—=1 š‘˜ š‘–=1 = š‘†š‘†š‘‡š‘œš‘”š‘Žš‘™ āˆ’ š‘†š‘†š¼š‘›š‘”š‘’š‘Ÿ DF Total = N ā€“ 1 DF Inter = k ā€“ 1 DF Intra = N ā€“ k š‘€š‘†š‘‡š‘œš‘”š‘Žš‘™ = SSTotal DFTotal š‘€š‘†š¼š‘›š‘”š‘’š‘Ÿ = SSInter DFInter š‘€š‘†š¼š‘›š‘”š‘Ÿš‘Ž = SSIntra DFIntra š¹ = MSInter MSIntra where ļ‚· F is the result of the test ļ‚· k is the number of different groups to which the sampled cases belong ļ‚· š‘ = āˆ‘ š‘›š‘– š‘˜ š‘–=1 is the total sample size ļ‚· ni is the number of cases in the i-th group ļ‚· yij is the value of the measured variable for the j-th case from the i-th group ļ‚· š‘ŒĢ….. is the mean of all yij ļ‚· š‘ŒĢ…š‘–. is the mean of the yij for group i. The test statistic has a F-distribution with DF Inter and DF Intra degrees of freedom. Thus the null hypothesis is rejected if š¹ ā‰„ š¹(1 āˆ’ š›¼) š‘āˆ’š‘˜ š‘˜āˆ’1
  • 19. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 19 ANOVA Multiple Comparisons Difference of Means š‘¦Ģ…š‘– āˆ’ š‘¦Ģ…š‘— Standard Error of the Difference of Means Estimator š‘†š‘”š‘‘. šøš‘Ÿš‘Ÿš‘œš‘Ÿ = āˆšš‘€š‘†š¼š‘›š‘”š‘Ÿš‘Ž āˆ— ( 1 š‘›š‘– + 1 š‘›š‘— ) Scheffeā€™s Method Confidence Interval for Difference of Means š¶š¼ (1 āˆ’ š›¼) = š‘¦Ģ…š‘– āˆ’ š‘¦Ģ…š‘— Ā± āˆšš·š¹š¼š‘›š‘”š‘’š‘Ÿ āˆ— š‘€š‘†š¼š‘›š‘”š‘Ÿš‘Ž āˆ— š¹(1 āˆ’ š›¼) š·š¹ š¼š‘›š‘”š‘Ÿš‘Ž š·š¹ š¼š‘›š‘”š‘’š‘Ÿ āˆ— ( 1 š‘›š‘– + 1 š‘›š‘— ) Source: http://en.wikipedia.org/wiki/Scheff%C3%A9%27s_method Tukey's range test HSD Confidence Interval for Difference of Means š¶š¼ (1 āˆ’ š›¼) = š‘¦Ģ…š‘– āˆ’ š‘¦Ģ…š‘— Ā± š‘ž(1 āˆ’ š›¼) š·š¹ š¼š‘›š‘”š‘Ÿš‘Ž š‘˜ āˆš š‘€š‘†š¼š‘›š‘”š‘Ÿš‘Ž 2 āˆ— ( 1 š‘›š‘– + 1 š‘›š‘— ) Where q is the studentized range distribution. Source: https://en.wikipedia.org/wiki/Tukey%27s_range_test Fisher's Method LSD If overall ANOVA test is not significant, you must not consider any results of Fisher test, significant or not. Confidence Interval for Difference of Means š¶š¼ (1 āˆ’ š›¼) = š‘¦Ģ…š‘– āˆ’ š‘¦Ģ…š‘— Ā± š‘”(1 āˆ’ š›¼ 2ā„ ) š·š¹ š¼š‘›š‘”š‘Ÿš‘Ž āˆšš‘€š‘†š¼š‘›š‘”š‘Ÿš‘Ž āˆ— ( 1 š‘›š‘– + 1 š‘›š‘— ) Where t is the student distribution. Bonferroni's Method The family-wise significance level (FWER) is Ī± = 1 - Confidence Level. Thus any comparison flagged by ISSTATS as significant is based on a Bonferroni Correction:
  • 20. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 20 š›¼ā€² = 2š›¼ š‘˜(š‘˜ āˆ’ 1) š‘ā€² = š‘ š‘˜(š‘˜ āˆ’ 1) 2 Where k is the number of groups. Confidence Interval for Difference of Means š¶š¼ (1 āˆ’ š›¼) = š‘¦Ģ…š‘– āˆ’ š‘¦Ģ…š‘— Ā± š‘” (1 āˆ’ š›¼ā€² 2ā„ ) š·š¹ š¼š‘›š‘”š‘Ÿš‘Ž āˆšš‘€š‘†š¼š‘›š‘”š‘Ÿš‘Ž āˆ— ( 1 š‘›š‘– + 1 š‘›š‘— ) Where t is the student distribution. Sidak's Method The family-wise significance level (FWER) is Ī± = 1 - Confidence Level. So any comparison flagged by ISSTATS as significant is based on a Sidak Correction: š›¼ā€² = (1 āˆ’ š›¼) 2 š‘˜(š‘˜āˆ’1) š‘ā€² = 1 āˆ’ š‘’ log(1āˆ’š‘)š‘˜(š‘˜āˆ’1) 2 Where k is the number of groups. Confidence Interval for Difference of Means š¶š¼ (1 āˆ’ š›¼) = š‘¦Ģ…š‘– āˆ’ š‘¦Ģ…š‘— Ā± š‘” (1 āˆ’ š›¼ā€² 2ā„ ) š·š¹ š¼š‘›š‘”š‘Ÿš‘Ž āˆšš‘€š‘†š¼š‘›š‘”š‘Ÿš‘Ž āˆ— ( 1 š‘›š‘– + 1 š‘›š‘— ) Where t is the student distribution. Welchā€™s Test for equality of means The test statistic, F* , is defined as follows: š¹āˆ— = āˆ‘ š‘¤š‘–(š‘„Ģ…š‘– āˆ’ š‘‹Ģƒ)2š‘˜ š‘–=1 š‘˜ āˆ’ 1 1 + 2(š‘˜ āˆ’ 2) š‘˜2 āˆ’ 1 āˆ— āˆ‘ ā„Žš‘– š‘˜ š‘–=1 where ļ‚· F* is the result of the test ļ‚· k is the number of different groups to which the sampled cases belong ļ‚· ni is the number of cases in the i-th group ļ‚· š‘¤š‘– = š‘› š‘– š‘†š‘– 2
  • 21. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 21 ļ‚· š‘Š = āˆ‘ š‘¤š‘– = āˆ‘ š‘› š‘– š‘†š‘– 2 š‘˜ š‘–=1 š‘˜ š‘–=1 ļ‚· š‘‹Ģƒ = āˆ‘ š‘¤ š‘– š‘„Ģ… š‘– š‘˜ š‘–=1 š‘Š ļ‚· ā„Žš‘– = (1āˆ’ š‘¤ š‘– š‘Š ) 2 š‘› š‘–āˆ’1 The test statistic has approximately a F-distribution with k-1 and š‘‘š‘“ = š‘˜2āˆ’1 3āˆ—āˆ‘ ā„Ž š‘– š‘˜ š‘–=1 degrees of freedom. Thus the null hypothesis is rejected if š¹āˆ— ā‰„ š¹(1 āˆ’ š›¼) š‘‘š‘“ š‘˜āˆ’1 Brownā€“Forsythe Test for equality of means The test statistic, F* , is defined as follows: š¹āˆ— = āˆ‘ š‘›š‘–(š‘„Ģ…š‘– āˆ’ š‘‹Ģ…..)2š‘˜ š‘–=1 āˆ‘ (1 āˆ’ š‘›š‘– š‘) š‘†š‘– 2š‘˜ š‘–=1 where ļ‚· F* is the result of the test ļ‚· k is the number of different groups to which the sampled cases belong ļ‚· ni is the number of cases in the i-th group (sample size of group i) ļ‚· š‘ = āˆ‘ š‘›š‘– š‘˜ š‘–=1 is the total sample size ļ‚· š‘‹Ģ….. = āˆ‘ š‘› š‘– š‘„Ģ… š‘– š‘˜ š‘–=1 š‘ is the overall mean. The test statistic has approximately a F-distribution with k-1 and df degrees of freedom. Where df is obtained with the Satterthwaite (1941) approximation as 1 df = āˆ‘ ci 2 ni āˆ’ 1 k i=1 with š‘š‘— = (1 āˆ’ š‘›š‘— š‘) š‘†š‘— 2 āˆ‘ (1 āˆ’ š‘›š‘– š‘) š‘†š‘– 2š‘˜ š‘–=1 Thus the null hypothesis is rejected if š¹āˆ— ā‰„ š¹(1 āˆ’ š›¼) š‘‘š‘“ š‘˜āˆ’1 Homoscedasticity Tests Levene's Test The test statistic, F, is defined as follows: š¹ = š‘ āˆ’ š‘˜ š‘˜ āˆ’ 1 āˆ— āˆ‘ š‘›š‘–(š‘Ģ…š‘–. āˆ’ š‘Ģ…..)2š‘˜ š‘–=1 āˆ‘ āˆ‘ (š‘š‘–š‘— āˆ’ š‘Ģ…š‘–.)2š‘› š‘– š‘—=1 š‘˜ š‘–=1
  • 22. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 22 where ļ‚· F is the result of the test ļ‚· k is the number of different groups to which the sampled cases belong ļ‚· š‘ = āˆ‘ š‘›š‘– š‘˜ š‘–=1 is the total sample size ļ‚· ni is the number of cases in the i-th group ļ‚· Yij is the value of the measured variable for the j-th case from the i-th group ļ‚· š‘š‘–š‘— = |š‘Œš‘–š‘— āˆ’ š‘ŒĢ…š‘–.| where š‘ŒĢ…š‘–. is a mean of i-th group ļ‚· š‘Ģ….. is the mean of all Zij ļ‚· š‘Ģ…š‘–. is the mean of the Zij for group i. The test statistic has a F-distribution with k-1 and N-k degrees of freedom. Thus the null hypothesis is rejected if š¹ ā‰„ š¹(1 āˆ’ š›¼) š‘āˆ’š‘˜ š‘˜āˆ’1 Source: http://en.wikipedia.org/wiki/Levene%27s_test Brownā€“Forsythe Test for equality of variances The test statistic, F, is defined as follows: š¹ = š‘ āˆ’ š‘˜ š‘˜ āˆ’ 1 āˆ— āˆ‘ š‘›š‘–(š‘Ģ…š‘–. āˆ’ š‘Ģ…..)2š‘˜ š‘–=1 āˆ‘ āˆ‘ (š‘š‘–š‘— āˆ’ š‘Ģ…š‘–.)2š‘›š‘– š‘—=1 š‘˜ š‘–=1 where ļ‚· F is the result of the test ļ‚· k is the number of different groups to which the sampled cases belong ļ‚· š‘ = āˆ‘ š‘›š‘– š‘˜ š‘–=1 is the total sample size ļ‚· ni is the number of cases in the i-th group ļ‚· Yij is the value of the measured variable for the j-th case from the i-th group ļ‚· š‘š‘–š‘— = |š‘Œš‘–š‘— āˆ’ š‘ŒĢƒš‘–.| where š‘ŒĢƒš‘–. is a median of i-th group ļ‚· š‘Ģ….. is the mean of all Zij ļ‚· š‘Ģ…š‘–. is the mean of the Zij for group i. The test statistic has a F-distribution with k-1 and N-k degrees of freedom. Thus the null hypothesis is rejected if š¹ ā‰„ š¹(1 āˆ’ š›¼) š‘āˆ’š‘˜ š‘˜āˆ’1 Source: http://en.wikipedia.org/wiki/Levene%27s_test Bartlett's Test Bartlett's test is used to test the null hypothesis, H0 that all k population variances are equal against the alternative that at least two are different. If there are k samples with size ni and sample variances S2 i then Bartlett's test statistic is šœ’2 = (š‘ āˆ’ š‘˜)š‘™š‘›(š‘† š‘ 2 ) āˆ’ āˆ‘ (š‘›š‘– āˆ’ 1)š‘™š‘›(š‘†š‘– 2 )š‘˜ š‘–=1 1 + 1 3(š‘˜ āˆ’ 1) āˆ— (āˆ‘ ( 1 š‘›š‘– āˆ’ 1)š‘˜ š‘–=1 āˆ’ 1 š‘ āˆ’ š‘˜ ) where
  • 23. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 23 ļ‚· š‘ = āˆ‘ š‘›š‘– š‘˜ š‘–=1 is the total sample size ļ‚· š‘† š‘ 2 = āˆ‘ (š‘› š‘–āˆ’1)š‘†š‘– 2š‘˜ š‘–=1 š‘āˆ’š‘˜ is the pooled estimate for the variance The test statistic has approximately a chi-squared distribution with k-1 degrees of freedom. Thus the null hypothesis is rejected if šœ’2 ā‰„ šœ’ š‘˜āˆ’1 2 (1 āˆ’ š›¼). Source: http://en.wikipedia.org/wiki/Bartlett%27s_test Bivariate Correlation Tests Sample Covariance Sxy = āˆ‘ (xi āˆ’ xĢ…)(yi āˆ’ yĢ…)N i=1 N āˆ’ 1 Where š‘ = āˆ‘ š‘›š‘– š‘˜ š‘–=1 is the total sample size. Source: http://en.wikipedia.org/wiki/Covariance#Calculating_the_sample_covariance Sample Pearson Product-Moment Correlation Coefficient r = 1 N āˆ’ 1 āˆ— āˆ‘ (š‘„š‘– āˆ’ š‘„Ģ…)(š‘¦š‘– āˆ’ š‘¦Ģ…)š‘ š‘–=1 š‘† š‘„ š‘† š‘¦ = š‘† š‘„š‘¦ š‘† š‘„ š‘† š‘¦ where Sx and Sy are the sample standard deviation of the paired sample (xi, yi), Sxy is the sample covariance and N is the total sample size. Source: http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#For_a_sample Test for the Significance of the Pearson Product-Moment Correlation Coefficient Test hypothesis are: ļ‚· H0: the sample values come from a population in which Ļ=0 ļ‚· H1: the sample values come from a population in which Ļā‰ 0 Test statistic is t = r āˆ— āˆšN āˆ’ 2 āˆš1 āˆ’ r2 where ļ‚· š‘ = āˆ‘ š‘›š‘– š‘˜ š‘–=1 is the total sample size ļ‚· r is the Sample Pearson Product-Moment Correlation Coefficient
  • 24. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 24 The test statistic has a t-student distribution with N-2 degrees of freedom. Spearman Correlation Coefficient For each of the variables X and Y separately, the observations are sorted into ascending order and replaced by their ranks. Identical values (rank ties or value duplicates) are assigned a rank equal to the average of their positions in the ascending order of the values. Each time t observations are tied (t>1), the quantity t3 āˆ’t is calculated and summed separately for each variable. These sums will be designated STx and STy. For each of the N observations, the difference between the rank of X and rank of Y is computed as: di = Rank(Xi) āˆ’ Rank(Yi) If there are no ties in both samples, Spearmanā€™s rho (Ļ) is calculated as Ļ = 1 āˆ’ 6 āˆ‘ š‘‘š‘– N(š‘2 āˆ’ 1) If there are any ties in any of the samples, Spearmanā€™s rho (Ļ) is calculated as (Siegel, 1956): Ļ = š‘‡š‘„ + š‘‡š‘¦ āˆ’ āˆ‘ di 2āˆš š‘‡š‘„ āˆ— š‘‡š‘¦ where š‘‡š‘„ = N(š‘2 āˆ’ 1) āˆ’ š‘†š‘‡š‘„ 12 š‘‡š‘¦ = N(š‘2 āˆ’ 1) āˆ’ š‘†š‘‡š‘¦ 12 If Tx or Ty is 0, the statistic is not computed. Source: http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2F alg_nonpar_corr_spearman.htm Test for the Significance of the Spearmanā€™s Correlation Coefficient Test hypothesis are: ļ‚· H0: the sample values come from a population in which Ļ=0 ļ‚· H1: the sample values come from a population in which Ļā‰ 0 Test statistic is t = Ļ āˆ— āˆšN āˆ’ 2 āˆš1 āˆ’ Ļ2 The test statistic has a t-student distribution with N-2 degrees of freedom.
  • 25. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 25 Kendall's Tau-b Correlation Coefficient For each of the variables X and Y separately, the observations are sorted into ascending order and replaced by their ranks. In situations where t observations are tied, the average rank is assigned. Each time t > 1, the following quantities are computed and summed over all groups of ties for each variable separately. T1 = āˆ‘ š‘”2 āˆ’ š‘” T2 = āˆ‘(š‘”2 āˆ’ š‘”)(š‘” āˆ’ 2) T3 = āˆ‘(š‘”2 āˆ’ š‘”)(2š‘” + 5) Each of the N cases is compared to the others to determine with how many cases its ranking of X and Y is concordant or discordant. The following procedure is used. For each distinct pair of cases (i, j), where i < j the quantity dij=[Rank(Xj)āˆ’Rank(Xi)][Rank(Yj)āˆ’Rank(Yi)] is computed. If the sign of this product is positive, the pair of observations (i, j) is concordant. If the sign is negative, the pair is discordant. The number of concordant pairs minus the number of discordant pairs is S = āˆ‘ āˆ‘ š‘ š‘–š‘”š‘›(š‘‘š‘–š‘—) š‘ š‘—=š‘–+1 š‘āˆ’1 š‘–=1 where sign(dij) is defined as +1 or ā€“1 depending on the sign of dij. Pairs in which dij=0 are ignored in the computation of S. If there are no ties in both samples, Kendallā€™s tau (Ļ„) is computed as Ļ„ = 2S N2 āˆ’ N If there are any ties in any of the samples, Kendallā€™s tau (Ļ„) is computed as Ļ„ = 2S āˆšN2 āˆ’ N āˆ’ š‘‡1 š‘„āˆšN2 āˆ’ N āˆ’ š‘‡1 š‘¦ If the denominator is 0, the statistic is not computed. Source: http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient#Tau-b Test for the Significance of the Kendall's Tau-b Correlation Coefficient The variance of S is estimated by (Kendall, 1955):
  • 26. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 26 Var = (N2 āˆ’ N)(2N + 5) āˆ’ T3x āˆ’ T3y 18 + T2x āˆ— T2y 9(N2 āˆ’ N)(N āˆ’ 2) + T1x āˆ— T1y 2(N2 āˆ’ N) The significance level is obtained using Z = S āˆšVar Which, under the null hypothesis, is approximately distributed as a standard normal when the variables are statistically independent. Sources: http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient#Significance_tests http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2F alg_nonpar_corr_kendalls.htm Parametric Value at Risk Value at Risk of a single asset Given the time series of daily return rates for an asset, the daily mean of the return rates is Ī¼, the daily variance of the daily return rates is Ļƒ2 . Given the position, hold or investment in the asset P. One-day Expected Return is: ER = PĪ¼ The Standard Deviation or Volatility is the square root of the Variance: šœŽ = āˆš šœŽ2 One-day Value at Risk is: š‘‰š‘Žš‘…1āˆ’š›¼ = āˆ’(Ī¼ + š‘§ š›¼ šœŽ)P where zĪ± is the left-tail Ī± quantile of the normal standard distribution. Total Value at Risk for n trading days is: š‘‰š‘Žš‘…1āˆ’š›¼ š‘› š‘‘š‘Žš‘¦š‘  = š‘‰š‘Žš‘…1āˆ’š›¼ āˆ— āˆš š‘› = āˆ’(Ī¼ + š‘§ š›¼ šœŽ)Pāˆš š‘› Portfolio Value at Risk Given the time series of daily return rates on different assets, the daily mean of the return rates for the i-th asset is Ī¼i, the daily variance of the return rate for the i-th asset is Ļƒi 2 , the daily standard deviation (or volatility) of the return rates for the i-th asset is Ļƒi. The covariance of the daily return rates of i-th and j-th assets is Ļƒij. All parameters are unbiased estimates. Given the holds, positions or investments on each of these assets: Pi Total positions is
  • 27. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 27 P = āˆ‘ š‘ƒš‘– š‘ š‘–=1 The weighting of each position is š‘¤š‘– = š‘ƒš‘– š‘ƒ The weighted mean of the portfolio is Ī¼ š‘ƒ = āˆ‘ š‘¤š‘– šœ‡š‘– = š‘ š‘–=1 1 š‘ƒ āˆ‘ š‘ƒš‘– šœ‡š‘– š‘ š‘–=1 One-day Expected Return of the portfolio is the weighted mean of the portfolio multiplied by the total position ER = PĪ¼ š‘ƒ = P āˆ‘ š‘¤š‘– šœ‡š‘– = š‘ š‘–=1 āˆ‘ š‘ƒš‘– šœ‡š‘– š‘ š‘–=1 The Portfolio Variance is šœŽ š‘ƒ 2 = [š‘¤1 ā€¦ š‘¤š‘– ā€¦ š‘¤ š‘›] [ šœŽ1 2 ā‹Æ šœŽ1š‘› ā‹® ā‹± ā‹® šœŽ š‘›1 ā‹Æ šœŽ š‘› 2 ] [ š‘¤1 ā‹® š‘¤š‘– ā‹® š‘¤ š‘›] = š‘Š š‘‡ š‘€š‘Š where W is the vector of weights and M is the covariance matrix. The item i-th in the diagonal of M is the daily variance of the return rates for the i-th asset. The items outside the diagonal are covariances. Portfolio Variance also can be computed as: šœŽ š‘ƒ 2 = 1 š‘ƒ2 āˆ— [š‘ƒ1 ā€¦ š‘ƒš‘– ā€¦ š‘ƒš‘›] [ šœŽ1 2 ā‹Æ šœŽ1š‘› ā‹® ā‹± ā‹® šœŽ š‘›1 ā‹Æ šœŽ š‘› 2 ] [ š‘ƒ1 ā‹® š‘ƒš‘– ā‹® š‘ƒš‘›] = 1 š‘ƒ2 āˆ— š‘‹ š‘‡ š‘€š‘‹ where X is the vector of positions. The Portfolio Standard Deviation or Portfolio Volatility is the square root of the Portfolio Variance: šœŽ š‘ƒ = āˆššœŽ š‘ƒ 2 One-day Value at Risk is: š‘‰š‘Žš‘…1āˆ’š›¼ = āˆ’(Ī¼ š‘ƒ + š‘§ š›¼ šœŽ š‘ƒ)P
  • 28. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 28 Where zĪ± is the left-tail Ī± quantile of the normal standard distribution. Total Value at Risk for n trading days is: š‘‰š‘Žš‘…1āˆ’š›¼ š‘› š‘‘š‘Žš‘¦š‘  = š‘‰š‘Žš‘…1āˆ’š›¼ āˆ— āˆš š‘› = āˆ’(Ī¼ š‘ƒ + š‘§ š›¼ šœŽ š‘ƒ)Pāˆš š‘› š‘‰š‘Žš‘…1āˆ’š›¼ š‘› š‘‘š‘Žš‘¦š‘  is the minimum potential loss that a portfolio can suffer in the Ī±% worst cases in n days. About the Signs: A positive value of VaR is an expected loss. A negative VaR would imply the portfolio has a high probability of making a profit. Source: http://www.jpmorgan.com/tss/General/Risk_Management/1159360877242 Remark: Some texts about VaR express the covariance as Ļƒij = ĻƒiĻƒjĻij where Ļij is the correlation coefficient. Remark: Sometimes VaR is assumed to be the Portfolio Volatility multiplied by the position as expected return is supposed to be approximately zero. ISSTATS does NOT consider VaR as Portfolio Volatility and do NOT suppose expected return is zero. Marginal Value at Risk Marginal Value at Risk is the change in portfolio VaR resulting from a marginal change in the currency (dollar, euroā€¦) position in component i: š‘€š‘‰š‘Žš‘…š‘– = šœ•š‘‰š‘Žš‘… šœ•š‘ƒš‘– Assuming the linearity of the risk in the parametric approach, the vector of Marginal Value at Risk is [ š‘€š‘‰š‘Žš‘…1 ā‹® š‘€š‘‰š‘Žš‘…š‘– ā‹® š‘€š‘‰š‘Žš‘… š‘›] = āˆ’ ([ šœ‡1 ā‹® šœ‡š‘– ā‹® šœ‡ š‘›] + š‘§ š›¼ šœŽ š‘ƒ āˆ— [ šœŽ1 2 ā‹Æ šœŽ1š‘› ā‹® ā‹± ā‹® šœŽ š‘›1 ā‹Æ šœŽ š‘› 2 ] [ š‘¤1 ā‹® š‘¤š‘– ā‹® š‘¤ š‘›]) [ š‘€š‘‰š‘Žš‘…1 ā‹® š‘€š‘‰š‘Žš‘…š‘– ā‹® š‘€š‘‰š‘Žš‘… š‘›] = āˆ’ ([ šœ‡1 ā‹® šœ‡š‘– ā‹® šœ‡ š‘›] + š‘§ š›¼ š‘ƒ āˆ— šœŽ š‘ƒ āˆ— [ šœŽ1 2 ā‹Æ šœŽ1š‘› ā‹® ā‹± ā‹® šœŽ š‘›1 ā‹Æ šœŽ š‘› 2 ] [ š‘ƒ1 ā‹® š‘ƒš‘– ā‹® š‘ƒš‘›]) Total Marginal Value at Risk for n trading days is: š‘€š‘‰š‘Žš‘…š‘– š‘› š‘‘š‘Žš‘¦š‘  = š‘€š‘‰š‘Žš‘…š‘– āˆ— āˆš š‘› Component Value at Risk Component Value at Risk is a partition of the portfolio VaR that indicates the change of VaR if a given component was deleted.
  • 29. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 29 š¶š‘‰š‘Žš‘…š‘– = šœ•š‘‰š‘Žš‘… šœ•š‘ƒš‘– š‘ƒš‘– = š‘€š‘‰š‘Žš‘…š‘– āˆ— š‘ƒš‘– Note that the sum of all component VaRs (CVaR) is the VaR for the entire portfolio: š‘‰š‘Žš‘… = āˆ‘ š¶š‘‰š‘Žš‘…š‘– š‘ š‘–=1 = āˆ‘ šœ•š‘‰š‘Žš‘… šœ•š‘ƒš‘– š‘ š‘–=1 š‘ƒš‘– = āˆ‘ š‘€š‘‰š‘Žš‘…š‘– š‘ š‘–=1 āˆ— š‘ƒš‘– Total Component Value at Risk for n trading days is: š¶š‘‰š‘Žš‘…š‘– š‘› š‘‘š‘Žš‘¦š‘  = š¶š‘‰š‘Žš‘…š‘– āˆ— āˆš š‘› Source: http://www.math.nus.edu.sg/~urops/Projects/valueatrisk.pdf Incremental Value at Risk Incremental VaR of a given position is the VaR of the portfolio with the given position minus the VaR of the portfolio without the given position, which measures the change in VaR due to a new position on the portfolio: IVaR (a) = VaR (P) ā€“ VaR (P - a) Source: http://www.jpmorgan.com/tss/General/Portfolio_Management_With_Incremental_VaR/1259104336084 Conditional Value at Risk, Expected Shortfall, Expected Tail Loss or Average Value at Risk šøš‘†1āˆ’š›¼ 1 š‘‘š‘Žš‘¦ is the expected value of the loss of the portfolio in the Ī±% worst cases in one day. Under Multivariate Normal Assumption, Expected Shortfall, also known as Expected Tail Loss (ETL), Conditional Value-at-Risk (CVaR), Average Value at Risk (AVaR) and Worst Conditional Expectation, is computed by ES(āˆ’VaR) = āˆ’šø(š‘„|š‘„ < āˆ’š‘‰š‘Žš‘…) āˆ— š‘ƒ = āˆ’[šœ‡ + šøš‘†(š‘§ š›¼)šœŽ] āˆ— š‘ƒ = āˆ’[šœ‡ + šø(š‘§|š‘§ < š‘§ š›¼)šœŽ] āˆ— š‘ƒ = āˆ’ [šœ‡ + āˆ« š‘”š‘’āˆ’ š‘”2 2 š‘§ š›¼ āˆ’āˆž š‘‘š‘” š›¼ šœŽ] āˆ— š‘ƒ = āˆ’(šœ‡ āˆ’ š‘’āˆ’ š‘§ š›¼ 2 2 š›¼āˆš2šœ‹ šœŽ) āˆ— š‘ƒ where zĪ± is the left-tail Ī± quantile of the normal standard distribution. About the Sign: Because VaR is given by ISSTATS with a negative sign, as J.P. Mogan recommend, we take its original value to perform calculations (-VaR = Ī¼ + zĪ±Ļƒ). Once the ES is computed, it is given multiplied by a negative sign. That is mean; a positive value of ES is an expected loss. On the other hand, a negative value of ES would imply the portfolio has a high probability of making a profit even in the worst cases. Source: http://www.imes.boj.or.jp/english/publication/mes/2002/me20-1-3.pdf
  • 30. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 30 Exponentially Weighted Moving Average (EWMA) Forecast Given a series of k daily return rates {r1, ā€¦ā€¦.., rk} computed as Continuously Compounded Return: š‘Ÿš‘– = ln ( š‘ š‘– š‘ š‘–āˆ’1 ) Where r1 corresponds to the earliest date in the series, and rk corresponds to the latest or most recent date. Supposed k > 50, and assuming that the sample mean of daily returns is zero, the EWMA estimates the one-day variance for a given sequence of k returns as: šœŽ2 = (1 āˆ’ šœ†) āˆ‘ šœ†š‘– š‘Ÿš‘˜āˆ’š‘– 2 š‘˜āˆ’1 š‘–=0 where 0 < Ī»< 1 is the decay factor. The one-day volatility is: šœŽ = āˆš šœŽ2 For horizons greater than one-day, the T-period (i.e., over T days) forecasts of the volatility is: šœŽ š‘‡ š‘‘š‘Žš‘¦š‘  = šœŽāˆšš‘‡ For two return series, assuming that both averages are zero, the EWMA estimate of one-day covariance for a given sequence of k returns is given by š‘š‘œš‘£1,2 = šœŽ1,2 = (1 āˆ’ šœ†) āˆ‘ šœ†š‘– š‘Ÿ1,š‘˜āˆ’š‘– š‘Ÿ2,š‘˜āˆ’š‘– š‘˜āˆ’1 š‘–=0 The corresponding one-day correlation forecast for the two returns is given by šœŒ1,2 = š‘š‘œš‘£1,2 šœŽ1 šœŽ2 = šœŽ1,2 šœŽ1 šœŽ2 For horizons greater than one-day, the T-period (i.e., over T days) forecasts of the covariance is: š‘š‘œš‘£1,2 š‘‡ š‘‘š‘Žš‘¦š‘  = šœŽ1,2 š‘‡ Source: http://pascal.iseg.utl.pt/~aafonso/eif/rm/TD4ePt_2.pdf Value at Risk of a single asset, Portfolio Value at Risk, Marginal Value at Risk, Component Value at Risk, Incremental Value at Risk, Incremental Value at Risk by EWMA method. See methods and formulas at Parametric Value at Risk.
  • 31. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 31 Linear Regression Given n equations for a regression model, with p predictor variables. The i-th given equation is yi = Ī²0 + Ī²1xi1 + Ī²2xi2 + ā€¦+ Ī²pxip The n equations stacked together and written in vector form is [ š‘¦1 ā‹® š‘¦š‘– ā‹® š‘¦š‘›] = [ 1 ā‹Æ š‘„1š‘ ā‹® ā‹± ā‹® 1 ā‹Æ š‘„ š‘›š‘ ] [ Ī²0 ā‹® Ī² š‘– ā‹® Ī² š‘] + [ Ō‘0 ā‹® Ō‘š‘– ā‹® Ō‘ š‘›] In matrix notation: Y = XĪ² + Ō‘ X is here named the design matrix, of dimensions n-by-(p+1). If constant is not included, the matrix are [ š‘¦1 ā‹® š‘¦š‘– ā‹® š‘¦š‘›] = [ š‘„11 ā‹Æ š‘„1š‘ ā‹® ā‹± ā‹® š‘„ š‘›1 ā‹Æ š‘„ š‘›š‘ ] [ Ī²1 ā‹® Ī² š‘– ā‹® Ī² š‘] + [ Ō‘1 ā‹® Ō‘š‘– ā‹® Ō‘ š‘›] If constant is not included, X, the design matrix, has now dimensions n-by-p. The estimated value of the unknown parameter Ī²: š›½Ģ‚ = (š‘‹š‘‹ š‘‡ )āˆ’1 š‘‹ š‘‡ š‘Œ Estimation can be carried out if, and only if, there is no perfect multicollinearity between the predictor variables. If constant is not included, the parameters can also be estimated by š›½Ģ‚š‘— = āˆ‘ š‘„š‘–š‘— š‘¦š‘– š‘› š‘–=1 āˆ‘ š‘„š‘–š‘— 2š‘› š‘–=1 The standardized coefficients are š›½Ģ‚š‘– š‘ š‘” = š›½Ģ‚š‘– āˆ— š‘† š‘„š‘– š‘† š‘¦ Where ļ‚· Sxi is the unbiased standard deviation of the i-th predictor variable
  • 32. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 32 ļ‚· Sy is the unbiased standard deviation of the response variable y The estimate of the standard error of each coefficient is obtained by š‘ š‘’(š›½Ģ‚š‘–) = āˆšš‘€š‘†šø āˆ— (š‘‹š‘‹ š‘‡)š‘–š‘– āˆ’1 Where MSE is the mean squared error of the regression model. It is known that š›½Ģ‚š‘– š‘ š‘’(š›½Ģ‚š‘–) ā† š‘” š‘›āˆ’š‘āˆ’1 Where ļ‚· p is the number of predictor variables ļ‚· n is the total number of observations (number of rows in the design matrix) If constant is not included, the degrees of freedom for the t statistics are n-p. ANOVA for linear regression If the constant is included. Component Sum of squares Degrees of freedom Mean of squares F Model SSM p MSM = SSM/p MSM/MSE Error SSE n-p-1 MSE = SSE/(n-p-1) Total SST n-1 MST = SST/(n-1) Being š‘†š‘†š‘€ = āˆ‘(š‘¦š‘–Ģ‚ āˆ’ š‘¦Ģ…)2 š‘› š‘–=1 š‘†š‘†šø = āˆ‘(š‘¦š‘– āˆ’ š‘¦š‘–Ģ‚)2 š‘› š‘–=1 š‘†š‘†š‘‡ = āˆ‘(š‘¦š‘– āˆ’ š‘¦Ģ…)2 š‘› š‘–=1 Where ļ‚· p is the number of predictor variables ļ‚· n is the total number of observations (number of rows in the design matrix) ļ‚· SSE = sum of squared residuals ļ‚· MSE = mean squared error of the regression model The test statistic has a F-distribution with p and (n-p-1) degrees of freedom. Thus the ANOVA null hypothesis is rejected if š¹ ā‰„ š¹(1 āˆ’ š›¼) š‘›āˆ’š‘āˆ’1 š‘ The coefficient of determination R2 is defined as SSM/SST. It is output as a percentage.
  • 33. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 33 The Adjusted R2 is defined as 1-MSE/MST. It is output as a percentage. The square root of MSE is called the standard error of the regression, or standard error of the Estimate. If the constant is not included. Component Sum of squares Degrees of freedom Mean of squares F Model SSM p MSM = SSM/p MSM/MSE Error SSE n-p MSE = SSE/(n-p) Total SST n SST/n Being š‘†š‘†š‘€ = āˆ‘ š‘¦š‘–Ģ‚2 š‘› š‘–=1 š‘†š‘†šø = āˆ‘(š‘¦š‘– āˆ’ š‘¦š‘–Ģ‚)2 š‘› š‘–=1 š‘†š‘†š‘‡ = āˆ‘ š‘¦š‘– 2 š‘› š‘–=1 Unstandardized Predicted Values The fitted values (or unstandardized predicted values) from the regression will be š‘ŒĢ‚ = š‘‹š›½Ģ‚ = š‘‹(š‘‹š‘‹ š‘‡ )āˆ’1 š‘‹ š‘‡ š‘Œ = HY where H is the projection matrix (also known as hat matrix) H = X(XXT )-1 XT Standardized Predicted Values Once computed the mean and unbiased standard deviation of the unstandardized predicted values, we standardize the fitted values as š‘¦Ģ‚š‘– š‘ š‘” = š‘¦Ģ‚š‘– āˆ’ š‘¦Ģ‚Ģ… š‘† š‘¦Ģ‚ When new predictions are included outside of the design matrix, they are standardized with the above values. Prediction Intervals for Mean Let define the vector of given predictors as
  • 34. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 34 Xh = (1, xh,1, xh,2, ā€¦, xh, p)T We define the standard error of the fit at Xh given by: š‘ š‘’(š‘¦Ģ‚ā„Ž) = āˆšš‘€š‘†šø āˆ— š‘‹ā„Ž š‘‡ (š‘‹ š‘‡ š‘‹)āˆ’1 š‘‹ā„Ž Then, the Confidence Interval for the Mean Response is š‘¦Ģ‚ā„Ž Ā± š‘” š›¼ 2 ;š‘›āˆ’š‘āˆ’1 āˆ— š‘ š‘’(š‘¦Ģ‚ā„Ž) Where ļ‚· X is the design matrix ļ‚· Å·h is the "fitted value" or "predicted value" of the response when the predictor values are Xh. ļ‚· MSE is the mean squared error of the regression model ļ‚· n is the total number of observations ļ‚· p is the number of predictor variables Prediction Intervals for Individuals Let define the vector of given predictors as Xh = (1, xh,1, xh,2, ā€¦, xh, p)T We define the standard error of the fit at Xh given by: š‘ š‘’(š‘¦Ģ‚ā„Ž) = āˆšš‘€š‘†šø āˆ— [1 + š‘‹ā„Ž š‘‡(š‘‹ š‘‡ š‘‹)āˆ’1 š‘‹ā„Ž] Then, the Confidence Interval for individuals or new observations is š‘¦Ģ‚ā„Ž Ā± š‘” š›¼ 2 ;š‘›āˆ’š‘āˆ’1 āˆ— š‘ š‘’(š‘¦Ģ‚ā„Ž) Where ļ‚· X is the design matrix ļ‚· Å·h is the "fitted value" or "predicted value" of the response when the predictor values are Xh. ļ‚· MSE is the mean squared error of the regression model ļ‚· n is the total number of observations ļ‚· p is the number of predictor variables Unstandardized Residuals
  • 35. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 35 The Unstandardized Residual for the i-th data unit is defined as: ĆŖi = yi - Å·i In matrix notation Ɗ = Y - Č² = Y ā€“ HY = (Inxn ā€“ H)Y Where H is the hat matrix. Standardized Residuals The standardized Residual for the i-th data unit is defined as: esĢ‚ š‘– = eĢ‚ š‘– āˆš š‘€š‘†šø Where ļ‚· ĆŖi is the unstandardized residual for the i-th data unit. ļ‚· MSE is the mean squared error of the regression model Studentized Residuals (internally studentized residuals) The leverage score for the i-th data unit is defined as: hii = [H]ii the i-th diagonal element of the projection matrix (also known as hat matrix) H = X(XXT )-1 XT where X is the design matrix. The Studentized Residual for the i-th data unit is defined as: š‘”š‘– = š‘’Ģ‚ š‘– āˆšš‘€š‘†šø āˆ— (1 āˆ’ ā„Žš‘–š‘–) Where ļ‚· ĆŖi is the unstandardized residual for the i-th data unit. ļ‚· MSE is the mean squared error of the regression model
  • 36. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 36 Source:https://en.wikipedia.org/wiki/Studentized_residual Centered Leverage Values The regular leverage score for the i-th data unit is defined as: hii = [H]ii the i-th diagonal element of the projection matrix (also known as hat matrix) H = X(XXT )-1 XT where X is the design matrix. The centered leverage value for the i-th data unit is defined as: clvi = hii ā€“ 1/n Where n is the number of observations. If the intercept is not included, then the centered leverage value for the i-th data unit is defined as: clvi = hii Source:https://en.wikipedia.org/wiki/Leverage_(statistics) Mahalanobis Distance The Mahalanobis Distance for the i-th data unit is defined as: Di2 = (n - 1)*(hii ā€“ 1/n) = (n - 1)*clvi Where ļ‚· hii is the i-th diagonal element of the projection matrix. ļ‚· n is the number of observations If the intercept is not included, the Mahalanobis Distance for the i-th data unit is defined as: Di2 = n*hii Source: https://en.wikipedia.org/wiki/Mahalanobis_distance Cookā€™s Distance
  • 37. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 37 The Cookā€™s Distance for the i-th data unit is defined as: š·š‘– = š‘’Ģ‚ š‘– 2 ā„Žš‘–š‘– š‘€š‘†šø āˆ— (š‘ + 1) āˆ— (1 āˆ’ ā„Žš‘–š‘–)2 Where ļ‚· hii is the i-th diagonal element of the projection matrix. ļ‚· p is the number of predictor variables ļ‚· ĆŖi is the unstandardized residual for the i-th data unit. ļ‚· MSE is the mean squared error of the regression model If the intercept is not included, the Cookā€™s Distance for the i-th data unit is defined as: š·š‘– = š‘’Ģ‚ š‘– 2 ā„Žš‘–š‘– š‘€š‘†šø āˆ— š‘ āˆ— (1 āˆ’ ā„Žš‘–š‘–)2 Source: https://en.wikipedia.org/wiki/Cook%27s_distance Curve Estimation Models Linear. Model whose equation is Y = b0 + (b1 * t). The series values are modeled as a linear function of time. Quadratic. Model whose equation is Y = b0 + (b1 * t) + (b2 * t**2). The quadratic model can be used to model a series that "takes off" or a series that dampens. Cubic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3). Quartic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3) + (b4 * t**4). Quintic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3) + (b4 * t**4) + (b5 * t**5). Sextic. Model that is defined by the equation Y = b0 + (b1 * t) + (b2 * t**2) + (b3 * t**3) + (b4 * t**4) + (b5 * t**5) + (b6 * t**6). Logarithmic. Model whose equation is Y = b0 + (b1 * ln(t)). Inverse. Model whose equation is Y = b0 + (b1 / t). Power. Model whose equation is Y = b0 * (t**b1) or ln(Y) = ln(b0) + (b1 * ln(t)). Compound. Model whose equation is Y = b0 * (b1**t) or ln(Y) = ln(b0) + (ln(b1) * t). S-curve. Model whose equation is Y = e**(b0 + (b1/t)) or ln(Y) = b0 + (b1/t).
  • 38. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 38 Logistic. Model whose equation is Y = 1 / (1/u + (b0 * (b1**t))) or ln(1/y-1/u) = ln (b0) + (ln(b1) * t) where u is the upper boundary value. After selecting Logistic, specify the upper boundary value to use in the regression equation. The value must be a positive number that is greater than the largest dependent variable value. Growth. Model whose equation is Y = e**(b0 + (b1 * t)) or ln(Y) = b0 + (b1 * t). Exponential. Model whose equation is Y = b0 * (e**(b1 * t)) or ln(Y) = ln(b0) + (b1 * t).
  • 39. METHODS AND FORMULAS HELP V2.1 InnerSoft STATS 39 Ā© Copyright InnerSoft 2017. All rights reserved. Los hijos perdidos del Sinclair ZX Spectrum 128K (RANDOMIZE USR 123456) innersoft@itspanish.org innersoft@gmail.com http://isstats.itspanish.org/