Principles of Statistical Inference
School of Mathematical Sciences, University of Adelaide
Two-sample t-test homework question
In a study of the eﬀects of diabetes on problems associated with
the wearing of contact lenses, 16 diabetics and a control group of
16 non-diabetics wore contact lenses for a prescribed length of
time. The swelling of their eyes was measured (as a percentage)
immediately after removal of the lenses. The following data were
“We use the two-independent samples t-test. . .”
x1 − x2
sp = 1.712
t = −2.67, df = 30, p-value = 0.012
Conclude swelling signiﬁcantly lower for diabetic subjects.
What are we actually concluding?
Underlying the analysis is a model for the data.
• The 16 diabetic subjects are sampled from a larger population
• The 16 controls are sampled from a population of
What are we actually concluding?
• Let µ1 be the mean swelling percentage for the diabetic
• Let µ2 be the mean swelling percentage for the control
• When we conclude that “swelling was signiﬁcantly lower for
the diabetic subjects” we are saying that the data give us
reason to believe that µ1 < µ2 .
More about the model
• The data in each group is a random sample from its parent
• The two samples are independent
Speciﬁc to the t-test we used:
The two populations are normal: and have the same standard
deviation, σ1 = σ2
More about the conlusion
• The sample means for our data are x1 = 8.33% for the
diabetic patients and x2 = 9.95% for the controls.
• Eventually, we conclude that we have reason to believe that
the population means satisfy µ1 < µ2 .
• This is not just because the sample means satisfy x1 < x2
• An alternative explanation is that there is no diﬀerence
between diabetics and controls at the population level and the
observed diﬀerence occurred by chance.
• The variability in the data suggests this could happen.
• The purpose of the t-test is to decide whether such an
explanation is really plausible.
• The population means, µ1 and µ2 , are unknown parameters.
• The purpose of the experiment is to make conclusions about
µ1 − µ2 based on the data.
• The statement H0 : µ1 = µ2 is the Null Hypothesis.
• This is the “chance explanation” for the data.
• The statement H1 : µ1 = µ2 is the Alternative Hypothesis.
• This is what we will conclude if we can eliminate the null
hypothesis, H0 .
• The hypothesis test is a procedure that uses data to reach one
of the conclusions: Accept H0 or Reject H0 .
Anatomy of the t-statistic
x1 − x2
• Think of x1 as an estimate of µ1 and x2 as an estimate of µ2 .
• Therefore x1 − x2 is an estimate of µ1 − µ2 .
• The quantity sp
is called the standard error.
• It measures the accuracy of the estimate.
• As a rule of thumb, estimates are accurate to ±2 standard
• If H0 is true (i.e. µ1 − µ2 = 0) then x1 − x2 will be estimating
zero and should be small relative to the standard error.
• If H0 is not true then x1 − x2 will be estimating a non-zero
quantity and may be large relative to the standard error.
• When we reject H0 , it is because x1 − x2 is too large relative to
the standard error for us to believe it is just estimating zero.
• The evidence for our decision is summarised by the P-value.
The P-value is the probability that we would obtain
a t-statistic as diﬀerent from zero as that actually
observed if the null hypothesis were true.
For the diabetes data, the t-statistic is -2.67 and the P-value is
• In the diabetes example the P-value is 1.2%.
• If there really was no diﬀerence between the diabetes and
control populations, then the chance that we would have
obtained data that led to a t-statistic ≤ −2.67 or ≥ 2.67 is
• This is a low probability
• Therefore the data are not what we would have expected if H0
• Consequently we reject H0 and conclude the two populations
are not the same.
• In general, it is conventional to reject H0 for P-values ≤ 0.05
and accept H0 otherwise.
• Hypothesis tests are used to decide whether to accept or
reject the null hypothesis on the basis of observed data.
• The null hypothesis is formulated in terms of the (unknown)
• The test statistic calculates the discrepancy between what was
observed in the data and what we would expect if the null
hypothesis were true.
• The evidence in the test statistic can be expressed as a
• H0 is rejected for small P-values and accepted otherwise.
• In conventional applications, the threshold is P-value ≤ 0.05.
Hypothesis tests can be cast as a decision problem.
Type II Error
Type I Error
• A Type I error occurs when H0 is actually true but the test
leads us to reject.
• This corresponds to a false positive ﬁnding.
• The Type I error rate is the probability of falsely rejecting
when H0 is true.
• A Type II error occurs when H0 is no true but the test leads
us to accept.
• This corresponds to a false negative.
Controlling the error rates.
• It can be seen that the Type I error rate is the P-value
threshold for rejection.
• If we reject for P-value ≤ 0.05 the Type I error rate is 5%.
• That is, for every 100 true hypotheses we test, we would
expect to falsely reject 5 by chance.
• Therefore we can adjust the Type 1 error rate by changing the
threshold of rejection.
• For example, if we reject for P-value ≤ 0.01 the Type I error
rate becomes 1%.
• But this will increase the chance of Type II errors.
• Type II errors are not controlled by a separate parameter.
Once the Type I error rate is ﬁxed, the Type II errors can only
be reduced by selecting a suitable sample size and appropriate
• An important application of hypothesis testing is diﬀerential
• To test diﬀerential expression of a single gene under two
diﬀerent conditions is like a two-sample t-test.
• For microarray data, the moderated t-statistic produced by
LIMMA is literally a variation on the t-statistic described here.
• For RNA-Seq data, a diﬀerent type of test statistic is used but
the notions of P-value and error rate still apply.
• In all cases, a major factor is the large number of tests being
conducted in parallel.
• For example, the Aﬀymetrix HU-G133 set comprises 45,000
probe sets derived from 33,000 established human genes.
• If we screen for DE between two groups, this means
performing a test for each gene.
• Suppose now we conduct a sequence of tests to screen for DE
in 20,000 genes.
• If we were to just use the standard 5% level of signiﬁcance, we
would be swamped by false positives.
True non-DE genes
True DE genes
Expected false +ves
• Conventional adjustments for multiple testing were introduced
in the context of a relatively small number of tests.
• For example, 5 or 10 or 20 tests.
• Applying methods such as the Bonferroni adjustment to large
scale multiple testing problems lead to ineﬃcient procedures.
False Discovery Rates
• In large scale multiple testing problems, it is more useful to
consider quantities such as the false discovery rate (FDR).
• Roughly speaking, the false discovery rate is
FDR = E
False Positives + True Positives
• As a hypothetical illustration, suppose out of 20,000 genes
500 are actually DE and the remaining 19,500 are non-DE.
• Suppose we reject H0 for all 500 of the true DE genes.
• Suppose we also make 975 false rejections from the non-DE
• In this case the rate false discoveries is
• Although it is not possible to discern a false positive from a
true positive in any single test, there are methods for
estimating and controlling the rate of false positives.
• Statistical inference is concerned with making conclusions
about a model assumed to have generated the observed data.
• The framework we considered was that of sample and
• In reality such a simple framework is not usually realistic.
• For example, with the diabetes data we don’t really have a
random sample from the population and observations made on
subjects are not guaranteed to be reproducible.
• If the framework is well understood and appropriately
modelled, the statistical conclusions can be taken at face
• In bioinformatics applications, the framework can be
• Biological material is often not randomly sampled.
• There may be several levels of technical variability and
• There may also be non-random components of error that need
to be estimated and allowed for.
• Perhaps the worst errors occur when the analysis does not
account for the true complexity of the framework and the
numbers are treated on face value.
• Nevertheless, you can’t do good bioinformatics without good