Michael Festing - MedicReS World Congress 2011

11
The design and statistical analysis of laboratory animal experimentsThe design and statistical analysis of laboratory animal experiments
An interactive program aimed at helping scientists to improve the design of their
animal experiments and reduce the number of animas which they use.
The main menu is designed to be worked through sequentially, but you are free to dip
in wherever you want.
© Michael Festing
This disk may be copied for personal use but must not be
used for commercial purposes and must not be sold
michaelfesting@aol.com

2
Home page
1. Why bother?
2, Types of experiment
3. The experimental unit
4. “A good experiment..”
5. Avoiding bias
6. Power and sample size
8. Strains of mice and rats
9. Experimental designs
10. Factorial experiments
11. Statistical analysis
12. Presenting your results
14. Test yourself7. Controlling variability
13. The ARRIVE guidelines
15. Summary of main points

3
Why bother?Why bother?
Because there is a good chance that you will:
1. Improve the quality of your science
2. Get published in journals with a higher
impact
3. Save yourself time when doing your
research
4. Use fewer animals
5. Save money
6. Afford to do more experiments & buy better
apparatus
This section introduces you to the ethical aspects of
animal experimentation (the “3Rs”) and then shows
you the results of some surveys of published papers
which suggest that there is scope for improving
animal research.
A black and tan rabbit

4
Principles of Humane Experimental TechniquePrinciples of Humane Experimental Technique
In 1959 two British scientists, Bill Russell and Rex Burch (right) wrote
“The Principles of Humane Experimental Technique” in 1959, in which
they introduced the “3Rs”:” These provide a useful background against
which we can assess each individual experiment.
Wherever possible live animals should be replaced by non-
sentient or less sentient alternatives such as cell cultures, lower
forms of life or even mathematical models.
Some people claim that you can’t replace animals in this way.
But there are many examples where this has been possible.
Rabbits and mice used to be used to assay batches of insulin,
but now this can be done chemically. Monoclonal antibodies
used to be grown as ascites tumours in mice, but this is now
done in-vitro. Even in toxicity testing there are number of tests,
such as the Ames test for mutagenesis which, although they do
not completely replace the use of animals, at least partially
replace them so that fewer are used.
Replacement:
Russell,W.M.S. and R.L.Burch.
1959. The principles of humane
experimental technique. Special
Edition, Universities Federation
for Animal Welfare. Potters Bar,
England.

5
If the use of animals can not be avoided, then pain, distress or lasing
harm should be minimised.
Anaesthesia and analgesia should be used where appropriate. In
procedures which result in death, human end-points should be used
so that the animals are painlessly killed rather than being allowed to
die in pain. Tumours should not be allowed to grow to an excessive
size.
The animals should be protected from disease and be provided with
an enriched environment where they have sufficient space to be able
to behave naturally. Food and water should only be withdrawn for
strictly limited periods. Social animals such as mice and rats should
be housed with other animals.
Refinement:
Russell and Burch

6
Investigators should use the minimum number of animals consistent
with achieving the objectives of the study. This involves:
1. The development of a research strategy with clearly defined
objectives that can be achieved with the available resources
2. Choice of a suitable animal model including genotype
(inbred/outbred, mutant, genetically modified) if using rodents
3. Well designed experiments which use neither too many animals
so that resources are wasted nor too few animals so that important
effects are missed
4. The correct statistical analysis of the results, including summary
statistics such as means and standard deviations as well as
indicators of uncertainly such as significance levels and confidence
intervals.
Reduction (the main subject of this program)
Note that there is a possible conflict between Reduction
and Refinement. Fewer animals are needed if the response
to a treatment is greater. But giving a higher “dose” (or the
equivalent) may involve more pain.

7
Test yourselfTest yourself
A scientist improves the design
of an experiment so that it gives
more significant results. Which
of the 3Rs does this count as?
Replacement
Refinement
Reduction
Routine sterilisation of all
materials entering an animal
house helps prevent the entry
of disease-causing micro-
organisms. What does this
count as?
Replacement
Refinement
Reduction
A neuro-scientist finds that she
can use rats instead of non-
human primates for a project.
What does this count as?
Replacement
Refinement
Reduction
Feedback

8
FeedbackFeedback
Reducing the number of animals used, or getting more
information out of an experiment both count as “Reduction”
because in the latter case there will be fewer false negative
results, so the research should progress more rapidly.
Preventing a disease from entering the animal house stops
the animals getting sick, and therefore it counts as a
Refinement. However, because the animals will be more
uniform and because diseased animals may give the wrong
results, it also counts as “Reduction”.
Rats are regarded as less sentient than non-human primates,
so this counts as a Replacement. If she could do the work using
Drosophila or C. elegans, that would be a further replacement..

9
Survey of statistical quality of published papersSurvey of statistical quality of published papers
In the opinion of the statistician:
 61% Would have required statistical revision had they been
seen before publication
 5% had such serious errors such that the conclusions were
not supported by the data
 30% had deficiencies in design of the studies including:
 Failure to randomise, inappropriate size, heterogeneity of
subjects and possible bias
 45% had deficiencies in the statistical analysis including:
 The use of sub-optimal methods and errors in calculation
 33% had deficiencies in presentation of the results including:
 Unexplained omission of data and inappropriate
statistical methods
Surveys of published papers suggest that there is ample scope for
improving the quality of animal experimentation. This one
(McCance, 1995 Aust. Vet. Journal. 72:322) commissioned by the
editors found that:

10
A meta-analysis of 44 randomised controlled animalA meta-analysis of 44 randomised controlled animal
studies of fluid resuscitationstudies of fluid resuscitation
 Only 2 said how animals had been allocated (i.e. whether they had
been randomised)
 None had sufficient power to detect reliably a halving in risk of death,
a clinically relevant en point
 There was substantial scope for bias in the experimental designs
 There was substantial heterogeneity in results, due to method of
inducing the bleeding
 As a result the odds ratios were impossible to interpret
 The authors queried whether these animal experiments made any
contribution to human medicine
Roberts et al 2002, BMJ 324:474
If humans or animals lose a large proportion of their blood they go
into a state of shock. These studies used animal models to find
ways of reducing mortality. But a meta-analysis of 44 such studies
found that:

11
Poor agreement between animal and human responsesPoor agreement between animal and human responses
Intervention Human results Animal results (meta-
analysis)
Agree?
Corticosteroids for
head injury
No improvement Improved nurological
outcome
n=17
No
Antofibrinolytics for
surgery
Reduces blood
loss
Too little good quality data
n=8
No
Thrombolysis with
TPA for acute
ischemic stroke
Reduces death Reduces death but
publication bias and
overstatement (n=113)
Yes
Perel et al (2007) BMJ 334:197-200
In this study the authors looked at six interventions where the human outcome
was known and then did a meta-analysis of all the animal papers on the same
topic to see if the results were in agreement.
First three interventions. Next page for rest of results

12
Poor agreement between animal and human responsesPoor agreement between animal and human responses
Intervention Human results Animal results (meta-
analysis)
Agree?
Tirilazad for
stroke
Increases risk of
death
Reduced infarct volume and
improved behavioural score
n=18
No
Corticosteroids
for premature
birth
Reduces mortality Reduces mortality n=56 Yes
Bisphosphonates
for osteoporosis
Increase bone
density
Increase bone density n=16 Yes
Perel et al (2007) BMJ 334:197-200
In this study the authors looked at six interventions where the human
outcome was known and then did a meta-analysis of all the animal papers
on the same topic to see if the results were in agreement.
Three more interventions
Over all, in three cases the response in animals and humans differed. Was
this poor experimental design or were the models inadequate?
This is not known, but the authors stated that in many cases the designs
seemed to be inadequate.

13
A survey of a random sample of 271 papers involving liveA survey of a random sample of 271 papers involving live
mice, rats or non-human primates foundmice, rats or non-human primates found
Of the papers studied:
 5% did not clearly state the purpose of the study
 6% did not indicate how many separate experiments were done
 13% did not identify the experimental unit
 26% failed to state the sex of the animals
 24% reported neither age not weight of animals
 4% did not mention the number of animals used
 0% justified the sample sizes used
 35% which reported numbers used, these differed in the materials
and methods and the results sections
 etc.
Kilkenny et al (2009), PLoS One Vol. 4, e7824
Over-all conclusion
Most papers reported their results inadequately. None justified the
numbers they used, and in many cases the design of the
experiments and/or the statistical analysis were inadequate.

14
In the survey published in the Aust. Vet. J. what proportion of papers did the statistician think had
defects in the design of the experiments? (click ?)
10-20% 21-30% 31-40% >40%
What proportion did he think would have required statistical revision had he seen them
before publication?
10-20% 21-30% 31-40% 41-50% 51-60% >60%
In the survey of 271 papers, what proportion gave a justification for the sample size which they
used?
0-20% 21-30% 31-40% >40%

15
Click picture for main menu

16
There are severalThere are several typestypes of experiment which you might use. These are:of experiment which you might use. These are:
Pilot study/experiment
The aim is to study the logistics of a proposed experiment and to obtain
preliminary information. These experiments are usually small, and results
should be treated with caution. It is probably best not to publish them.
They may provide an estimate of the standard deviation which can later be
used in a “power analysis” to determine sample size, but because of small
sample size that estimate may be inaccurate.
Exploratory experiment
The aim is to provide data to generate new hypotheses.
Typically, these experiments may “work” or “not work”.
There are often many outcomes (characters) measured .
Statistical analysis may be problematical due to false positive results and
data snooping (looking at the results then doing a test of those which
seem most interesting). As a result, the p-values may not be correct.
However this sort of experiment can be useful at the start of a new
project.

17
Types of experimentTypes of experiment
Experiments may be set up just to estimate means or
standard deviations. “Uncontrolled” experiments (where
there is no comparison between groups) are sometimes
done. An LD50 test is an example. Regression and
correlation studies may be done to estimate the
relationship between variables (e.g. time and weight as
in a growth curve).
Formal hypothesis stated a priori. The double-blinded,
randomised placebo-controlled experiment is the gold
standard. p-values must be correct. It is this type of
experiment which is explored in more detail here.
Other types of study
Confirmatory experiments

18
A Hairless mouse. Click picture to return to main menu

19
The experimental unit isThe experimental unit is
“The smallest division of the experimental material such that any
two experimental units can receive different treatments”.
It is the unit of randomisation and of statistical analysis to compare
groups.
The animals are all housed in one
cage but the treatment is given by
injection. Any two can receive
different treatments, so the animal
is the experimental unit and “N” (the
total number of subjects) is 8

20
The experimental unitThe experimental unit
Here the animals are housed two per cage and
the treatment is given in the feed or water. The
two animals can not have different treatments.
What do you think is “N”, the total number of experimental units in this case?
2? 4? 8?

21
Here are two tanks each with seven fish. The treatment is given in the water,Here are two tanks each with seven fish. The treatment is given in the water,
and two fish have diedand two fish have died
What is N, the total number of
experimental units ?
One
Two
Seven
fourteen

22
It is sometimes possible to do within-animal experimentsIt is sometimes possible to do within-animal experiments..
In a crossover experiment an animal could be given a treatment for
a period, then rested and given a different treatment for a period. In
this case the experimental unit is an animal for a period of time. It is
assumed that the treatment doesn’t alter the animal, so it has to be
very mild.
In this experiment animals are given four treatments, sequentially in random order.
What do you think is N?
3? 4? 12?

23
Another within-animal experimentAnother within-animal experiment
The animal has had it’s back shaved and four
treatments have been applied topically in random
locations. In this case, N=12.

24
Teratology: mother treated, young measuredTeratology: mother treated, young measured
Mothers, not the pups, are the experimental unit because pups from
the same mother can not receive different treatments
N=2
n=1

25
A black and tan fancy mouse. Click picture for main menu

26
A well designed experiment should (summary):A well designed experiment should (summary):
1. Have a clear specification of the aims of the experiment.
The hypothesis to be studied needs to be clearly stated before
planning the experiment.
It would be a serious error to look at the results of the experiment
and then adjust the hypothesis to fit them!
2. Be unbiased
There should be no systematic differences between the treated
and control groups apart from the effects of the treatment
Bias may result in false positive results when the effects of some
other factor are assumed to be due to the treatment
Bias is avoided by correct identification of the experimental
unit, blinding, and by randomisation
2. The experiment should be powerful
If the treatment really has an effect, there should be a high chance
that it can be detected
Experiments which lack power have too many false negative
results
Power depends on sample size, control of variability and
sensitivity of the subject

27
A well designed experiment should:A well designed experiment should:
4. Have a wide range of applicability
Where possible the extent to which a result can be generalised
across strains, diets, environments or techniques should be known.
An experiment where the results can only be replicated in some
animal houses but not in others lacks generality
The range of applicability is explored using factorial designs
5. Experiments should be simple
They should not be so complex that mistakes are made or they are
impossible to interpret.
Clearly written protocols should be used
6. It should be possible to statistically analyse the result of an
experiment.
The statistical analysis and the experiment should be planned at the
same time.
An investigator should never start an experiment without
knowing how it is going to be analysed

28
How can you increase the power of an experiment (there may be more than
one correct answer)?
By:
Using factorial designs
Increasing sample size
Better randomisation
Avoiding bias
Better statistical methods
Controlling variability

29
Click picture for main menu

30
Avoiding biasAvoiding bias
Bias is avoided by:
1. Correct selection of the experimental unit (as discussed previously)
2. Randomisation of the experimental units to the treatment groups in a
method appropriate to the type of experiment.
3. Randomisation of the order in which measurements are made and the
animals are housed
4. “Blinding” and the use of coded samples to ensure that the investigator or
other staff can not easily influence the outcome of the experiment.

31
RandomisationRandomisation
Why do we randomise?
Because it ensures that there can be no
systematic differences between the treatment
groups
Randomisation is easy using a spread sheet
such as EXCEL and all good statistical
packages provide ways of putting numbers or
letters in random order. The next page shows
how it might be done using EXCEL or another
spread sheet.

32
Randomisation of 12 animals to three treatments (A-C) using EXCELRandomisation of 12 animals to three treatments (A-C) using EXCEL
Original =rand() Sorted on =rand() Animal number
A 0.527 A 0.067 1
A 0.100 A 0.100 2
A 0.067 A 0.122 3
A 0.122 C 0.210 4
B 0.665 B 0.248 5
B 0.875 C 0.265 6
B 0.478 B 0.478 7
B 0.248 A 0.527 8
C 0.210 C 0.628 9
C 0.628 B 0.665 10
C 0.265 B 0.875 11
C 0.895 C 0.895 12
1. The treatment designations A-C were put in the first column
2. A random number was put in the second one (as “values”)
3. The columns were then sorted on the random number column to
give column 3 in random order. The animal numbers are then
added
4. In this case the first three animals will be assigned to A, the 4th
.
To C etc.
Sometimes a random
order doesn’t look very
random, such as when
the first three animals
(here) all receive
treatment A.
But use this sort of
method and you won’t
go far wrong.

33
RandomisationRandomisation
Treatment Random number Animal
(randomised) now sorted number
A 0.067 1
A 0.100 2
A 0.122 3
C 0.210 4
B 0.248 5
C 0.265 6
B 0.478 7
A 0.527 8
C 0.628 9
B 0.665 10
B 0.875 11
C 0.895 12
How should the animals be caged?
A AA C B etc
Single animal/cage
A
X
A
X
A
X
C
X
B
X
etc
Single with companion
A,A,A,C B,C,B,AA,A,A,C B,C,B,AA,A,A,C C,B,B,C
Several/cage at random
A,B,C A,B,C A,B,C A,B,C
Randomised block design

34
Number of animals per cageNumber of animals per cage
There is no correct answer to the numbers of animals housed per cage. It
depends on species and the nature of the experiment.
With rats and mice single housing may be stressful. But male mice may fight,
depending on the strain and husbandry conditions.
Sometimes, in order to avoid stress, expensive animals (those fitted with
telemetry apparatus, for example) can be housed with a companion which is
not part of the experiment.
Group housing poses problems if treatment is given in the food or water as the
cage is then the experimental unit unless sophisticated apparatus is used so
that each animal can have a different diet. This is sometimes done with rats
and farm animals.
Group housing may also be a problem if drug treatments are involved as rats
and mice are coprophageous so control animals may consume metabolites of
the test compound if the animals of different treatment groups are housed
together.
Finally, it is not a good idea to house all the controls in one cage, all of
treatment 1 in a second cage etc. as there can be “cage effects” due to social
interactions which could seriously bias the results (e.g. if all the controls are
fighting, but the treated animals are not).

35
BlindingBlinding
We usually have some idea of what we would like
to find in our experiments. So it is better, where
possible, to use coded samples so that we do not
bias our results by favouring (often inadvertently)
one or more of the treatment groups.
This is particularly important when scoring
histological sections or measuring behaviour.
The next slide shows the consequences of failing
to randomise and/or blind a study

36
Failure to randomise and/or blind leads to false positive resultsFailure to randomise and/or blind leads to false positive results
Blind/not blind odds ratio 3.4 (95% CI 1.7-6.9)
Random/not random odds ratio 3.2 (95% CI 1.3-7.7)
Blind Random/ odds ratio 5.2 (95% CI 2.0-13.5)
not blind random
290 animal studies were scored for blinding, randomisation and whether the
outcome was positive or negative outcome, as defined by authors
(Babasta et al 2003 Acad. emerg. med. 10:684-687)
An odds ratio of one would imply that blinding or randomisation was not
associated with the outcome of an experiment. These positive odds ratios
imply that on average studies which were not blinded and/or randomised
produced excessive numbers of false positive results.
In other words, studies where there was no blinding or randomisation were
unreliable. They give too many false positive results.

37
Randomisation is used to ensure that the means of each group are
identical
We randomise our animals so that they won’t fight so much
Randomisation ensures that each experimental unit has an equal probability of
being assigned to a particular treatment group
Randomisation is the only way to avoid bias
Blinding is used so that other people can not copy our data
Blinding helps us to avoid unintentionally biasing our results

38
C57BL/6 mice like to explore their environment. Click picture for main menu

39
Power and Sample sizePower and Sample size
It is important not to use too many animals (or other experimental
units) in an experiment because it costs money, time and effort,
and it is unethical.
Conversely, if too few animals are used the experiment may be
unable to detect a clinically or scientifically important response to
the treatment. This also wastes resources and could have serious
consequences, particularly in safety assessment.
We need to avoid making either of these mistakes

40
Minimising statistical errorsMinimising statistical errors
Experimental conclusion
State of nature Accept null
hypothesis
Reject null
hypothesis
Null hypothesis true Correct conclusion Type I or α error
Null hypothesis false Type II or β error Correct conclusion
The null hypothesis
In a controlled experiment the aim is usually to compare two or more
means (or sometimes medians or proportions). We normally set up a “null
hypothesis” that there is no difference between the means, and the aim of
our experiment is to disprove that null hypothesis.
However, as a result of inter-individual variability we may make a mistake.
If we fail to find a true difference, then we have a false negative result,
also known as a type II or β error. Conversely, if we think that there is a
difference when in fact it is just due to chance, then we have a false
positive, Type I, or α error. These are show in the` table below

41
Power analysis and the control of statistical errorsPower analysis and the control of statistical errors
We can control type I errors because we can estimate the probability
that the means could differ to a given degree knowing the sample
sizes and the degree of variability (and making some assumptions
about the distribution of the data).
If it is highly unlikely that they came from the same population, we
reject the null hypothesis and assume that the treatment has had an
effect.
The probability of a type I error is usually we set it at 0.05, or 5%. For
every 100 experiments we would expect, on average five type I errors
to be made.
We don’t usually set it much lower than this because that will increase
the probability of a type II error.

42
Power analysis and the control of statistical errorsPower analysis and the control of statistical errors
Type II errors are more difficult to control. False negative results
occur when there is excessive variation (“noise”) or there is only a
small response to the treatment (a low “signal”). We can specify
the probability of a type II error or the statistical power (one
minus the type II error) if we use a power analysis.
There is a mathematical relationship between the six variables
discussed in the next two slides such that if five of them are
specified or fixed, the sixth cam be` estimated.

43
Signal
Effect size of scientific interest
(or actual response)
Chance of a false positive
result. Significance level
(0.05?)
Sidedness of statistical
test (usually 2-sided)
Power of the
Experiment (80-90%?)
Noise (SD)
Variability of the
experimental material
Sample size
Power analysis: the variablesPower analysis: the variables
(More details on the next slide)(More details on the next slide)

44
Variables involved in a power analysisVariables involved in a power analysis
1. The effect size of scientific interest (the signal)
This is the magnitude of response to the treatment likely to be of scientific or
clinical importance. It has to be specified by the investigator. Alternatively, if the
experiment has already been done it is the actual response (difference between
treated and control means)
2. The variability among experimental units (the noise)
This is the standard deviation of the character of interest. It has to come from a
previous study or the literature as the experiment has not yet been done
3. The power of the proposed experiment
This is 1-β where β is the probability of a type II error. This also has to be specified
by the investigator. It is often set at 0.8 to 0.9 (80 or 90%)
4. The alternative hypothesis
The null hypothesis is that the means of the two groups do not differ. The
alternative hypothesis may be that they do differ (two sided), or that they differ in a
particular direction (one sided)
5. The significance level
As previously explained, this is usually set at 0.05
6. The sample size
This is the number in each group. It is usually what we want to estimate. However,
we sometimes have only a fixed number of subjects in which case the power
analysis can be used to estimate power or effect size.

45
The signal/noise ratio is:
(Control mean)-(treated mean)
pooled standard deviation
This is also known as the “standardised effect size” and “Cohen’s d”.
Most statistical packages provide power calculations and there are several
web sites that will do them. However, the graph in the next page is probably
sufficiently accurate for most people, given the uncertainty in deciding how
large an effect is going to be of biological importance, and the fact that the
estimate of the standard deviation may not be all that accurate.
The standardised effect size or signal/noise ratioThe standardised effect size or signal/noise ratio

46
Sample size as a function of signal/noise ratioSample size as a function of signal/noise ratio
0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
10
20
30
40
50
60
70
80
90
100
110
120
130
Signal to noise ratio
Samplesize
Sample size as a function of the signal/noise
ratio in a two-sample t-test with a 5%
significance level and a two-sided test. Red
circles are for a power of 90%, triangles for a
power of 80%.
Note that a sample size of about 17-23 in
each group, depending on power, is needed
to detect a signal/noise ratio of 1.0.
A sample size of about 9-11 in each group,
depending on power, is needed to detect a
signal/noise ratio of 1.4
Signal/noise ratios of less that 0.05 require
large numbers of animals in each group

47
Power analysis softwarePower analysis software
Most modern statistical packages will do power analysis calculations for
the two-sample situation. Some, such as “nQuery Advisor” will also do the
calculations for more complex situations. There are also a number of web
sites which will do the calculations for you (do a Google search for
“statistical power calculations”).
The R statistical package is a command-driven free package used by
professional statisticians. The command, below, generated the output,
below right, for the random dogs (see example, next slides), using the
signal/noise ratio 0.56 with a 90% power and a 5% significance level. Note
that we do not need “n” to five decimal places!
power.t.test(delta=0.56, sd=1, power=0.9, sig.level=0.05)
Two-sample t test power calculation
n = 67.98649
delta = 0.56
sd = 1
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group

48
An exampleAn example::
A vet wants to compare the effect on blood pressure
of two anaesthetics for dogs under clinical conditions.
He has some preliminary data. The dogs are unsexed
healthy animals:
weight 3.8 to 42.6 kg
mean systolic BP of 141 (SD 36) mm Hg
Assume that:
1. a difference of 20 mmHg or more would be of clinical
importance (a clinical not a statistical decision).
2. a significance level of α of 0.05,
3. a power of 90%
4. and a 2-sided t-test,
Then the signal/noise ratio would be 20/36 = 0.56
From the graph on the previous page the required
sample size is about 80 dogs/group.

49
A different vet has some beaglesA different vet has some beagles
Male Beagles weighing 17-23 kg,
Mean BP of 108 (SD 9) mm Hg.
Assuming a 20mm difference between groups would
be of clinical importance (as before)
With the same assumptions as previous slide:
Signal/noise ratio = 20/9 = 2.22
Referring again to the graph:
Required sample size 6/group (although it can
not be read very accurately off the graph)

50
Summary for two sources of dogs: aim is to be able to detect a 20mmHg change in blood pressureSummary for two sources of dogs: aim is to be able to detect a 20mmHg change in blood pressure
Type of dog SDev Signal/noise Sample %Power (n=8)
size/gp(1)
(2)
Random dogs 36 0.56 68 18
Male beagles 9 2.22 6 98
(1) Sample size: 90% power
(2) Power, Sample size 8/group (this can not be read off the graph)
Assumes α=5%, 2-sided t-test and effect size 20mmHg
Conclusion: It would not be sensible to do the experiment with the random
dogs. Either an investigator should assume that Beagles can represent
“Dogs” or the experiment could be done using several breeds, but using a
factorial design, discussed later.

51
The resource equationThe resource equation
A power analysis is not always possible.
1. If lots of characters are being measured it may
not be clear which one is the most important.
2. There may be no estimate of the standard
deviation,
3. In fundamental research it may be impossible to
specify an effect size likely to be of scientific
importance.
4. Experiments may be complex with many
treatment groups and possible interactions.
0
2
4
6
8
10
12
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Degrees of freedom (E)
InformationDiminishing returns set in rapidly as the
total number of subjects increases.
Adding units up to E=10 is good value
but increasing E beyond about 20
provides little extra information.
An alternative is the “Resource Equation”` method.
This depends on the law of diminishing returns.
E= (Total number of experimental units)-(number of
treatment groups)
E should be between 10 and 20

52
Resource equation exampleResource equation example
You decide to do an experiment with four treatment
groups (a control and 3 dose levels) and eight animals
per group.
E= 32 – 4 = 28. So this is unnecessarily large.
With six animals per group E=20, which is acceptable
Control Low dose Mid dose High dose
Each rectangle represents
a single experimental unit

53
A mutant Dwarf mouse and wild-type litter mate. Click picture for main menu

54
Controlling variationControlling variation
35 45 55
Weight
Body weight of mice housed 1, 2, 4 or 8 per cage
Mice/cage
1 SD=5.8
2 SD=3.9
4 SD=3.2
8 SD=2.9
Chvedoff et al (1980) Arch.Toxicol. Suppl 4:435
If variation can be reduced the signal/noise
ratio goes up and either sample size can be
reduced, power can be increased or a
smaller response can be detected.
The plot shows that mice housed singly are
more variable than those housed in pairs or
groups, although they weigh slightly more on
average.

55
Assume you want to do an experiment to see whether a specified
drug treatment affects body weight in mice, with individual mice
being the experimental unit.
You plan to compare treated and a control means and consider that
if the two means differ by 4g or more (the signal) this would be of
biological interest. You plan to use a two-sided t-test with a
significance level of 0.05. Should you house your mice singly or in
pairs? (you rule out having more per cage)
The consequences of inter-individual variabilityThe consequences of inter-individual variability
Mice/cage Mean SD (noise) Signal/noise Number needed/group
1 46.0 5.8 4/5.8=0.86 30
2 44.7 3.9 4/3.9=1.28 14
Assuming that the response (signal) is not affected by number
per cage you would only need half the number of animals if they
were to be housed in pairs

56
Controlling genetic variabilityControlling genetic variability
Strain Mean SD Sig/noise Group size* Power**
A/N 48 4 1.0 23 86
BALB/c 41 2 2.0 7 99
C57BL/He 33 3 1.3 13 98
C3HB/He 22 3 1.3 13 98
SWR/HeN 18 4 1.0 23 86
Stock
CFW 48 12 0.3 191 17
Swiss 43 15 0.2 97 13
* To detect a 4 min. change in the mean (2-sided) with α=0.05, power = 90%
** to detect a 4 min. change in the mean with 20 mice/group
Data from Jay 1955 Proc Soc. Exp Biol Med 90:378
The data below shows the mean and standard deviation (N=25-47) of sleeping time in
five inbred strains and two outbred stocks of mice under hexobarbital anaesthetic.
Note the much greater variability in the outbred stocks. This substantially reduces the
signal/noise ratio (assuming an effect size of 4 minutes), and means much larger sample
sizes are needed.

57
Variation in kidney weight in 58 groups of ratsVariation in kidney weight in 58 groups of rats
0
10
20
30
40
50
60
70
80
90
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57
Sample number
Variability
Mycoplasma
Outbred
F1
F2
(re-drawn from Gartner,K. (1990), Laboratory Animals, 24:71-77.)
This study shows the variability
of kidney weight in 58 groups
of rats (N=approx 30 in each
group).
Groups have been ranked in
order of variability which is
expressed as a percentage.
Some groups were affected
with Mycoplasma pulmonis
causing chronic respiratory
disease (in red), some were
outbred, some F1 hybrid and
some F2 hybrids.
The effect of these sources of
variation are shown on the next
slide

58
Variation, sample size and powerVariation, sample size and power
Factor Type Std.Dev Signal/
noise
Sample
size
Power
Genetic
s
F1 hybrid 13.5 1.48 10 87
F2 hybrid 18.4 1.09 15 63
Outbred 20.1 0.99 18 55
Disease Mycoplasma free 18.6 1.08 15 53
With Mycoplasma 43.3 0.23 76 14
Suppose the aim of an experiment is to find out whether a drug affects the
weight of the kidneys in rats. We can use a power analysis to find out how
many rats of each type shown on the previous page would be needed.
Assume that we want to be able to detect a 20% change in kidney weight
(either way), we want a power of 80%, a significance level of 5%, and we
have data on the variability of each group. The results are shown in the table
below.
The table also shows the power of the experiment to detect a 20% change if
the sample size is fixed at 10 animals per group with all other assumptions
the same.

59
Controlling variation: ConclusionControlling variation: Conclusion
Four examples:
The random dogs versus beagles in the previous section
Housing mice singly or in groups,
Sleeping time under anaesthetics,
Kidney weight in rats of various types
All show that increased variability reduces the
signal/noise ratio so larger sample sizes are needed to
detect the effect of a treatment.
This will cost money, time and effort, and it is also
unethical not to control such variation wherever possible.
Uncontrolled variation in almost any controlled
experiment is “bad news”

60
A hooded rat. Click for main menu

61
76% of animals used in research in the UK in 2008 were mice or rats76% of animals used in research in the UK in 2008 were mice or rats
But there are lots of types
of these species. What are
they all and what are their
properties?
The main types are:
Inbred strains*
Outbred stocks*
Mutants
Genetically modified strains
(not discussed here)
* Outbreds are known as “stocks”, inbreds as “strains”

62
Outbred stocks of rats and mice:Outbred stocks of rats and mice:
“Sprague-Dawley” and “Wistar” rats“Sprague-Dawley” and “Wistar” rats
“Swiss”, “CD-1” and “MF-1” mice“Swiss”, “CD-1” and “MF-1” mice
They can change rapidly in characteristics due to selection,
inbreeding and random genetic drift
They are “genetically undefined”. Nothing is known about the
genotype of any individual
Stock names such as “Sprague-Dawley” (SD) have no
genetic meaning (no genetic markers to define them)
Each animal genetically different But colonies with same name will also
differ genetically to some extent
But they are vigorous, cheap and prolific and are widely used in research

63
Outbred stocksOutbred stocks
0
10
20
30
40
50
60
70
80
90
100
1 3 5 7 9 11 13 15 17 19 21 23 25
Sample number
Percentresponders
Percent responders to a synthetic polypeptide in
successive samples of about 30 outbred SD rats
from the same commercial breeder over a
period of 18 months.
Note that this is not just sampling variation. The
high & low responding rats must have come
from different colonies.
Seven inbred strains gave consistent results
(100% responders or non-responders)
The variability of outbred stocks
(compared with inbred strains)
leads to lower signal/noise
ratios and less powerful
experiments
DNA fingerprint shows genetic heterogeneity

64
Outbred stocksOutbred stocks
Geneticists use outbred stocks only when they have no alternative, or
for a few genetic studies. For example, an outbred stock can be used as
a base population for a selective breeding experiment. More recently
they are sometimes used in gene association experiments where the
genotypes of many individuals is recorded at many gene loci to see if
there are associations with a disease or response to an experimental
treatment. But these are specialised (and expensive) studies.
For the vast majority of work geneticists recognise that they should
control the genetic background, and that means using inbred strains.
Some scientists attempt to justify the use of outbred stocks on the
grounds that “humans are outbred”. But humans and animals differ in
many ways. We don’t insist on using animals weighing 70kg on the
grounds that humans weight about that. And why do we so often use
albino animals? Possibly the reason that this is the easiest way of
making the animals all look the same, and good scientists know that
they should control variability if they want powerful experiments

65
Inbred strainsInbred strains
All animals in an inbred strain
are genetically identical But each strain is different
Produced by >20 generations of brother x sister mating
Genetically stable. Can not be changed by selective breeding
Sublines have arisen as a result of “residual heterozygosity” (the sublines were
separated before the strain was fully inbred).
Sublines can also arise as a result of new mutations (relatively rare)
There are >400 inbred strains of mice and 150 inbred strains of rats

66
““Derived” inbred strains. Brief details only are given hereDerived” inbred strains. Brief details only are given here
There a number of more specialised strains derived from straight inbred strains. These
include:
Coisogenic strains: A pair of strains which differ at only a single genetic locus as a
result of a mutation. “Knockout” strains usually fall into this category. They are used in
the study of the mutation
Congenic strains: A pair of strains which differ at a single genetic locus plus a section
of chromosome. These are produced by backcrossing a mutation or polymorphism to
an inbred strain. The length of associated chromosome depends on the number of
backcrossing generations.
Recombinant inbred (RI) strains: These are sets of inbred strains developed from an
F1 cross between two standard inbred strains. They are used to determine the mode
of inheritance of some measured phenotype.
Recombinant congenic (RC) strains. Like RI strains except they are produced from a
backcross generation of a cross between two inbred strains.

67
Inbred strains: nomenclatureInbred strains: nomenclature
Inbred strains of mice and rats are designated by a code starting with an upper case letter
followed by letters and/or numbers
Examples: Rat strains- LEW, F344, BDIX, PVG
Mouse strains DBA, CBA, BALB, C57BL, SJL.
There are some exceptions such as 129P1
Sublines are designated by a slash followed by a code involving letters and/or numbers
Example: Rat – LEW/Ss, F344/N
Mouse: BALB/c, C57BL/10
A code showing the breeder may be appended, e.g. C57BL/6J where the “J” stands for
the Jackson Laboratory.

68
Inbred strainsInbred strains
Petko M. Petkov et al. Genome Res. 2004; 14: 1806-1811
Genetic similarities in mice based on many
genetic markersDNA fingerprints show that within a
strain all animals are genetically
identical but strains differ

69
Controlling genetic variability (this slide was shown in a previous section)Controlling genetic variability (this slide was shown in a previous section)
Strain Mean SD Sig/noise Group size* Power**
A/N 48 4 1.0 23 86
BALB/c 41 2 2.0 7 99
C57BL/He 33 3 1.3 13 98
C3HB/He 22 3 1.3 13 98
SWR/HeN 18 4 1.0 23 86
Stock
CFW 48 12 0.3 191 17
Swiss 43 15 0.2 97 13
* To detect a 4 min. change in the mean (2-sided) with α=0.05, power = 90%
** to detect a 4 min. change in the mean with 20 mice/group
Data from Jay 1955 Proc Soc. Exp Biol Med 90:378
The data below shows the mean and standard deviation (N=25-47) of sleeping time in
five inbred strains and two outbred stocks of mice under hexobarbital anaesthetic.
Note the much greater variability in the outbred stocks. This substantially reduces the
signal/noise ratio (assuming an effect size of 4 minutes), and means much larger sample
sizes are needed.
Controlling the within-group variation using inbred strains
increases statistical power or allows fewer animals to be used

70
Genetics is important: Twenty two Nobel Prizes since 1960 for workGenetics is important: Twenty two Nobel Prizes since 1960 for work
depending on inbred strainsdepending on inbred strains
Cancer
mmTV
Transmissible
encephalopathacies/prions
Pruisner
Retroviruses, Oncogenes & growth factors
Cohen, Levi-montalcini, Varmus, Bishop, Baltimore, Temin
C.C. Little, DBA, 1909
Inbred Strains and derivatives
Jackson Laboratory
monoclonal antibodies
BALB/c mice
Kohler and Millstein
Smell
Axel & Buck
ES cells
Evans,
“knockouts”
Capecchi, Smithies
Genetics of
the MHC
Snell
Immunological tolerance
Medawar, Burnet
H2 restriction,
Doherty, Zinkanagel
immune responses
Benacerraf (G.pigs)
T-cell receptor
Jerne
Antibody diversity
Tonegawa

71
Mutant strainsMutant strains
Clockwise from top left
Hairless mouse, nude mouse
obese mouse, Rowett nude rat
(with graft of hamster skin), viable
yellow mutants, New Zealand
nude rat (with skin grafts), dwarf
mouse.
These and other mutants are
widely used in biomedical
research. Refer to textbooks for
more details

72
Experimental designsExperimental designs
There are a number of formal experimental designs available for use. These
include:
1. The completely randomised design. Subjects simply assigned to
treatments at random. This is the commonest design in work with laboratory
animals.
2. The randomised block design. The experiment is split up into a number of
“mini-experiments”. This is for convenience, to have subjects in different
treatment groups as similar as possible so as to increase power, to build in
some repeatability, or to take account of some natural structure in the
material such as litters.
Randomised block designs have several names such as “within-subject”,
“repeated measures”, “crossover” or “matched subjects” designs. These
depend on the nature of the experimental unit and whether replication is in
time or space.
3. Latin square designs. These are used to further balance an experiment in
special situations.

73
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The completely randomised design
Assume that the experimental units were reasonably homogeneous
Treatments grey, green, red and orange were assigned at random to
experimental units 1-20 using EXCEL exactly as described in section 3.
The fact that 4/5 of the grey treatment are in the first ten and 4/5 orange
treatments are in the last ten would not matter in most cases. In short-term
experiments it is unlikely that there would be any important variables that
affect the first and last subjects differently.
This “completely randomised” design in which subjects are assigned at
random to the treatments is simple, can tolerate unequal numbers in each
group and is perfectly adequate in many experimental situations.

74
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The completely randomised design
However, for long term experiments or ones where, for example, those
doing the experiments become more skilled, or more fatigued, this may
introduce extra variation and bias the results.
Also, the animals may not be very homogeneous, or they may have some
natural structure, such as coming in litters at different times. It may also be
more convenient if the experiment can be split up into smaller, more
homogeneous bits which are easier to handle, particularly if the experiment
is large.
In these cases a “randomised block” design might be more powerful and
more convenient.

75
The randomised block designThe randomised block design
Block 1 Block 2
Block 3 Block 4
Block 5
Here the experiment has been split up into
five homogeneous blocks.
Each block has subjects matched for size.
Each block has exactly one experimental
unit of each of the four treatments.
Randomisation is done separately within
each block
This design has one factor “Treatment”
which is a “fixed factor” (i.e. we can do
treatment 1 again if we wish) and one
factor “Block” which is a random factor
(you can’t replicate the effect of block one)
in which we have little or no interest.

76
The randomised block designThe randomised block design
Block 1 Block 2
Block 3 Block 4
Block 5
Blocks can be done at different times (even weeks
apart) and/or housed in different locations. Within
each block the subjects can be matched
The main advantages of the RB design are that:
1.It can deal with heterogeneous material by
matching subjects in each block (increasing
power).
2.It is often more convenient to break the experiment
down into smaller bits which can then be done
more carefully.
The main disadvantages of the RB design are
1.that it is not very tolerant of missing observations
2.it should not be done with very small experiments
(say less than about 16 experimental units total)
because there may be a slight loss of power

77
Various types of randomised block designsVarious types of randomised block designs
A Matched pairs design
(might be suitable for
comparing mutant and
wild-type in each litter as
it becomes available)
A before & after
experiment. But no
randomisation is
possible (can’t have
an after before a
before)

78
Various types of randomised block designsVarious types of randomised block designs
Time 1 Time 2 Time 3
Animal 1
Animal 2
Animal 3
A crossover or repeated measures** design (within-subject, but
experimental unit is an animal for a period of time).
Experimental unit is a subject for a period of time.
** The term “repeated measures” means different things to
different statisticians. It can also be used to describe an
experiment where each individual is measured several times
but without receiving a different treatment each time. This is
pseudo replication, which needs to be taken into account.
A within-subject design
(experimental unit is an area on
an animal)

79
The Latin square designThe Latin square design
The number of subjects is the number of
treatments squared.
This is a 5x5 Latin square. It has five rows, five
columns and five treatments (Grey, Red,
Yellow, Green, White).
Note that there is one of each treatment in
each row and in each column.
It has not yet been randomised. To maintain
the layout we randomise whole rows and then
whole columns.
It has one fixed factor (Treatment) and two
random factors (Rows and Columns).
We would use it if there are two factors such
as day of the week (represented as columns)
and time of the day (rows) which may
influence the outcome, and we want these
balanced out. Latin squares with more than 7 treatments can
become too large, and those with fewer than four
are too small. However, small ones (as small as
2x2) can be replicated.

80
This was a wild one in India

81
Factorial designsFactorial designs
A 2x2 Factorial design
Treated Control
E=16-4 = 12
Factorial designs are common in research
involving laboratory animals.
The design on the right has two factors, the
treatment (Control versus Treated) and the
colour (Blue versus Green). This might
represent the two sexes, or two strains or two
diets or any other factor of possible interest.
The aim is usually to see whether that other
factor influences the results.
This is a 2x2 factorial because there
are two factors each at two levels. The
Resource Equation method of sample
size determination is shown here. E is
more than 10 and less than 20 so size
is probably OK.

82
Factorial designsFactorial designs
Treated Control
E=16-4 = 12
Factorial designs are efficient because the effect
of the treatment is still determined by eight
treated and eight control animals but there is the
additional information of Blue versus Green
(also 8 vs 8) averaging across treatments, and it
can be seen whether the response to the
treatment is the same in the Blue and Green
groups (known as the interaction effect).
This is a 2x2 factorial because there
are two factors each at two levels. The
Resource Equation method of sample
size determination is shown here. E is
more than 10 and less than 20 so size
is probably OK.

83
Factorial experimentsFactorial experiments
Control Dose 1 Dose 2
Diet
1
Diet
2
Diet
3
E= (36-9) = 27
Sample size is a bit too large. Three
animals per group might be better
Factorial experiments can have
any number of factors and each
can be at any number of levels.
Here there are two factors
(Ttreatment with levels Control,
Dose 1 and Dose 2 and Diet
with levels 1,2,3)
Levels can be either qualitative
such as diets 1, 2, and 3 or
quantitative such as dose 0,
1000 and 2000mg/kg.

84
Factorial experimentsFactorial experiments
Control Dose 1 Dose 2
A 3x3x2 Factorial design
Diet
1
Diet
2
Diet
3
E= (36-18) = 18
Sample size is about right with 2
per smallest sub-group
Here there are three factors
represented by Dose (3 levels),
Diet (3 levels) and sex (Male
and Female (say), represented
by the patterns).
Note that the smallest sub-
group can be quite small with
this type of design because in
calculating the means we
average over the other groups

85
Effect of chloramphenicol on RBC counts (2000Effect of chloramphenicol on RBC counts (2000µµg/kg) in mice of two strainsg/kg) in mice of two strains
Strain Control Treated Strain means
BALB/c 10.10 8.95
10.08 8.45
9.73 8.68
10.09 8.89 9.37
C57BL 9.60 8.82
9.56 8.24
9.14 8.18
9.20 8.10 8.86
Treatment
Mean 9.69 8.54
A real example.
We want to know:
1. Does treatment have an effect on RBC
counts
2. Do strains differ in RBC counts
3. Do strains differ in their response
(interaction)
Clearly the treatment reduces red
blood cell (RBC) counts. There is no
overlap between treated and control
individuals. Also, C57BL seems to
have lower counts than BALB/c.
Whether or not there is an interaction
can best be seen graphically.

86
Plot of Means
Chloramphenicol2$Treat
meanofChloramphenicol2$RBC
8.59.09.510.0
c t
Chloramphenicol2$Strain
BALB/c
C57BL
A plot of the means shows that the
reduction is the same for each
strain.
A 2-way analysis of variance is
needed to show statistical
significances. This is shown in
Section 11. It finds that the
treatment and strain differences are
statistically significant (unlikely to be
due to sampling variation), but the
interaction is not significant, exactly
as we determined just be looking at
the data.
Effect of chloramphenicol on RBC counts (2000mg/kg) in mice of two strainsEffect of chloramphenicol on RBC counts (2000mg/kg) in mice of two strains

87
Effect of chloramphenicol (2000mg/kg) on RBC countEffect of chloramphenicol (2000mg/kg) on RBC count
Strain Control Treated Strain
means
C3H 7.85 7.81
8.77 7.21
8.48 6.96
8.22 7.10 7.80
CD-1 9.01 9.18
7.76 8.31
8.42 8.47
8.83 8.67 8.58
Treatment
means 8.42 7.96
Here are two different
mouse strains. In this case
the treatment seems to
have reduced the RBC
counts in C3H but not in
CD-1.
Statistical analysis shows
a highly significant
interaction effect (see
section 11). This can be
seen in a plot of the
means

88
Plot of Means
Chlorampehicol$Trt
meanofChlorampehicol$RBC
7.47.67.88.08.28.48.6
C T
Chlorampehicol$Strain
C3H
CD-1
In this case the CD-1 outbred stock
of mice was resistant to
chloramphenicol at this dose level,
but C3H has responded strongly
with a highly significant interaction
effect

89
A velvet rabbit. Click for main menu

90
Statistical analysisStatistical analysis
You will need statistical software with good graphics.
EXCEL is not recommended for statistical analysis, although it is
useful for data entry prior to reading it into you chosen package. In
some cases EXCEL graphics is quite useful.
If you have access to one of the larger commercial packages such as
MINITAB, SPSS, SAS, or Graphpad Prism then use it. But allow time
to get to know how it works and how to interpret the output. Better still
take a course if one is available.
There are a number of open source programs available. R is widely
used by professional statisticians. It is command driven and difficult
to learn but there is a front end called “R Commander” (Rcmdr) which
is menu driven and a lot easier to use. But still expect to spend time
learning how to use it. R and Rcmdr can be down loaded from the
CRAN web site.

91
It would probably be sensible to get a statistical textbook where the
examples are analysed using the statistical package that is available
to you. Go to the web site associated with your package and see if
they recommend any suitable texts.
This section is “about” the statistical analysis, not “how to do the
statistical analysis”
Size matters
The aim of most controlled experiments is to estimate the magnitude
of any differences between the means (or less frequently the medians
or proportion affected) of the treatment groups for a trait of interest.
The statistical analysis normally estimates the probability that
differences of the observed magnitude could have arisen by chance
sampling variation. These are the so-called “p-values”.
If it is very unlikely that the differences could have arisen by chance,
then they are assumed to be the result of the treatment.

92
0 500 1000 1500 2000 2500
7.07.58.08.59.09.510.010.5
Dose
RBC
The first step is to look carefully at the
raw data to see whether there any
obvious errors and to get a feel for
what is happening.
Graphical methods which show
individual observations should be
used
In this case there is one obvious
outlier at the 1000 dose level. It was
checked and was not a transcription
error so it was not deleted. The
statistical analysis was done with it
and without it. In fact it made no
difference to the obvious conclusion
that chloramphenicol reduces red
blood cell counts
Red blood cell counts in CBA mice given
various dose levels of chloramphenicol

93
GP1 GP2
34 42
46 42
35 51
42 48
42 44
42 43
44 41
43 45
39 42
34 44
46 44
Means 40.6 44.2
SDs 4.50 2.96
These are body weights of mice
(g) fed different diets. The
question is whether the
difference in means is due to
the effect of the diet, or could it
just be due to chance sampling
variation? There is quite a bit of
variation within each group
quantified by the standard
deviations (SDs).
What is your guess. Is the
difference likely to be due to the
effect of the treatment?
SD stands for “standard deviation”. It is a measure of the
variability in a group of numbers. It has the same units as the
numbers, so in this case it is g.

94
GP1 GP2
34 42
46 42
35 51
42 48
42 44
42 43
44 41
43 45
39 42
34 44
46 44
Means 40.6 44.2
SDs 4.50 2.96
1 2
35404550
Group
Weight
A plot with individual
observations helps. Differences
in means seem to depend on
about 3-4 individuals.
We need some objective way of
reaching a decision of whether
this is likely to be due to
chance. The p-value for the
difference between the two
means provides this.
There are several ways of
calculating p-values, one of
which is to use a two-sample
t.test.

95
GP1 GP2
34 42
46 42
35 51
42 48
42 44
42 43
44 41
43 45
39 42
34 44
46 44
Means 40.6 44.2
SDs 4.50 2.96
1 2
35404550
Group
Weight
The two-sample t-test is shown
below. “t” is a “test statistic” which
can be used in conjunction with the
df (“degrees of freedom”) to
estimate the p-value of 0.04114
shown here. In this version of the t-
test it is assumed that the variation
is the same in each group.
Here the test rejects the null
hypothesis that the difference
between the means is zero, and it
gives a 95% confidence for the true
difference of -6.93 to -0.157.
We conclude that there is only a 4%
chance that the difference is due to
sampling variation and that the
difference is “statistically significant
at p=0.04” (we usually quote the
actual p-value).
Two Sample t-test
data: GP1 and GP2
t = -2.1829, df = 20, p-value = 0.04114
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.9334705 -0.1574386

96
GP1 GP2
34 42
46 42
35 51
42 48
42 44
42 43
44 41
43 45
39 42
34 44
46 44
Means 40.6 44.2
SDs 4.50 2.96
1 2
35404550
Group
Weight
We could also use the ANOVA
(analysis of variance) to
estimate the p-values. When
there are only two groups the
ANOVA and t-test are
mathematically identical. Both
give a p-value of 0.041 as
shown below.
For an explanation of the
ANOVA table see the next page
One-way ANOVA: Wt versus Gp
Analysis of Variance for Wt
Source DF SS MS F P
Gp 1 69.1 69.1 4.77 0.041
Error 20 290.2 14.5
Total 21 359.3

97
Statistical analysis: The Analysis of Variance (ANOVA)Statistical analysis: The Analysis of Variance (ANOVA)
Analysis of Variance for Wt
Source DF SS MS F P
Gp 1 69.1 69.1 4.77 0.041
Error 20 290.2 14.5
Total 21 359.3
The source
of variation
(Groups,
error or
residual,
total)
Degrees of freedom (n-1)
Quantification of the variation
due to each source
Mean square (SS/DF)
The error mean square is
the variance (sd squared)
A test statistic
like t (actually t2
)
P-value
This is the most widely used
method of statistical analysis. It is
very versatile and is essential for
analysing randomised block and
factorial designs.

98
50-5
2
1
0
-1
-2
NormalScore
Residual
Normal Probability Plot of the Residuals
(response is Wt)
44434241
5
0
-5
Fitted Value
Residual
Residuals Versus the Fitted Values
(response is Wt)
Assumptions
The t-test and the ANOVA are so called
“parametric” tests. They depend on three
assumptions:
1. That the numbers are independent
observations. This depends on correct
randomisation of independent experimental
units
2. The residuals (deviation of each
observation from its group mean) have a
normal (bell-shaped) distribution.
3. The variation is the same in each group
These “Residuals diagnostic plots” are used
to investigate whether these assumptions
hold. The top one shows residuals versus fits
(group means). All four corners should be
equally filled, as shown.
The bottom one should be a straight line if
the residuals have a normal distribution, as
is the case here.
Scatter of points should be approximately the same
Points should lie on a straight line
Residuals diagnostic plots

99
151050
10
0
-10
Fitted Value
Residual
(response is TumCount)
What if these assumption are not met?
The ANOVA is quite “robust”. Some deviation
from the assumptions can be tolerated.
However, if the variation is much greater in the
group with a larger mean and the normal plot
is not a straight line then the next step is to try
a transformation.
Top right shows a plot where the residuals for
the smaller counts vary less than those of the
larger counts. The bottom plot shows the
residuals plots following a transformation
X=log(Y+1), where Y is the original value and
one has been added to avoid missing data
when the count was zero (log of zero is
undefined).
An analysis using the transformed data may
be more reliable than one using the raw data,
although this is a marginal case that may not
even need transformation
1.21.11.00.90.80.70.60.50.40.3
0.5
0.0
-0.5
Fitted ValueResidual
(response is LogTums+)

100
More than two treatment groups
A t-test is only suitable for comparing two groups.
But an ANOVA can be used with any number of
treatment groups (if the assumptions are reasonably
well met). It tests the over-all null hypothesis that the
differences in means among groups are zero against
the alternative hypothesis that they are not zero.
But it can not differentiate between the two situations to
the right. In both cases it just gives an over-all p-value.
The most common (but not necessarily the best) way of
finding out which groups differ significantly from each
other is to use “post-hoc comparisons”. These will be
available in all good statistical packages. There are
many different ones. Use the ones in your package.
Statisticians would usually rather use orthogonal
contrasts, but these are not discussed here.
Experiment 1.
Three groups all
different
Experiment 2.
Two groups
the same, one
group different

101
0
50
100
150
200
250
300
350
400
450
500
1 2 3
Week
Apoptosisscore
Control
CGP
STAU
Randomised blocks and the two-way
ANOVA
The aim of the experiment, right, was to
determine whether two drugs CGP and
STAU affected apoptosis in rat
thymocytes (compared with the vehicle
control). Each week they humanely killed
one rat and prepared three dishes of
thymocytes which received one of the
three treatments. Apoptosis was scored
after incubation for a fixed period.
This was a small randomised block
experiment with the blocking factor being
Week. Notice the large week-to-week
variation, but a similar relationship
between groups each week.
Raw data
C CPG STAU
Week 1 365 398 421
Week 2 423 432 459
Week 3 308 320
329
Means 365.3 383.3 403.0

102
Two-way ANOVA without interaction
Source DF SS MS F P
Block 2 21764 10882 114.82
0.000
Treat 2 2129 1064 11.23
0.023
Error 4 379 94
Total 8 24272
The experiment needs to be analysed
using a 2-way ANOVA without
interaction, shown on the right. The
over-all test of the null hypothesis that
there is no difference among the
treatment groups gives a p-value of
0.023. Hence we reject the null
hypothesis. Notice that most of the
total variation is due to the blocks, but
this is of little interest because we know
that it is very difficult to get identical
absolute measurements each week in
such studies.
The SD is the square root of the error
mean square (94).
The post-hoc Dunnett’s test shows that
STAU but not CPG differs significantly
from the control. But this is a very small
experiment which lacks power.
Means p-value*
Control 365.3 -----
CPG 383.3 0.14
STAU 403.0 0.02
SD 9.7
* Using Dunnett’s test, a post-hoc
test for comparing the means of
treated groups with controls (not
discussed here).

103
Source DF SS MS F P
Block 2 21764 10882 114.82
0.000
Treat 2 2129 1064 11.23
0.023
Error 4 379 94
Total 8 24272
-10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 7.5
0
1
2
3
Residual
Frequency
Histogram of Residuals
0 1 2 3 4 5 6 7 8 9
-20
-10
0
10
20
Observation Number
Residual
I Chart of Residuals
Mean=3.16E-14
UCL=20.17
LCL=-20.17
300 350 400 450
-10
0
10
Fit
Residual
Residuals vs. Fits
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-10
0
10
Normal Plot of Residuals
Normal ScoreResidual
Residual Model Diagnostics
The residuals diagnostic plots show
that the two assumptions of normality
of residuals and homogeneous
variances appear to be reasonably
well met. (plots top left and bottom
right)

104
Strain Control Treated Strain means
BALB/c 10.10 8.95
10.08 8.45
9.73 8.68
10.09 8.89 9.37
C57BL 9.60 8.82
9.56 8.24
9.14 8.18
9.20 8.10 8.86
Treatment
Mean 9.69 8.54
A real example.
We want to know:
1. Does treatment have an effect on RBC
counts
2. Do strains differ in RBC counts
3. Do strains differ in their response
(interaction)
This data was shown earlier. It can be analysed using a two-way ANOVAThis data was shown earlier. It can be analysed using a two-way ANOVA
with interaction.with interaction.
Effect of chloramphenicol ( 200mg/kg) on Red Blood Cell counts in mice of two strainsEffect of chloramphenicol ( 200mg/kg) on Red Blood Cell counts in mice of two strains
The two-way ANOVA (next page)
shows that there are significant
effects associated with strain and
treatment, but no significant
interactions (p=0.40)

105
Plot of Means
Chloramphenicol2$Treat
meanofChloramphenicol2$RBC
8.59.09.510.0
c t
Chloramphenicol2$Strain
BALB/c
C57BL
A plot of the means shows that the reduction is the same for
each strain and the 2-way ANOVA with interaction confirms
these findings
Analysis of Variance for RBCs
Source DF SS MS F P
Strain 1 1.0661 1.0661 17.15 0.001
Treatment 1 5.2785 5.2785 84.92 0.000
Strain*Treatment 1 0.0473 0.0473 0.76 0.400
Error 12 0.7459 0.0622
Total 15 7.1377
The effect can be stated in the original units as 9.69-8.54 =1.15 units
or it can be expressed in standard deviation units. The standard
deviation is the square root of the error mean square in the ANOVA,
0.0622=0.2494. So the response was 1.15/.2494=4.61 standard
deviations. (But not this should not be done with smaller sample sizes,
which need a correction factor)

106
Effect of chloramphenicol (2000mg/kg) on RBC countEffect of chloramphenicol (2000mg/kg) on RBC count
Strain Control Treated Strain
means
C3H 7.85 7.81
8.77 7.21
8.48 6.96
8.22 7.10 7.80
CD-1 9.01 9.18
7.76 8.31
8.42 8.47
8.83 8.67 8.58
Treatment
means 8.42 7.96
Here are two different
mouse strains. In this case
the treatment seems to
have reduced the RBC
counts in C3H but not in
CD-1.
Statistical analysis shows
a highly significant
interaction effect (see
section 11). This can be
seen in a plot of the
means

107
Plot of Means
Chlorampehicol$Trt
meanofChlorampehicol$RBC
7.47.67.88.08.28.48.6
C T
Chlorampehicol$Strain
C3H
CD-1
In this case the CD-1 outbred stock of mice was
resistant to chloramphenicol at this dose level, but C3H
has responded strongly with a highly significant
interaction effect
Source Df Sum Sq Mean Sq F value Pr(>F)
Trt 1 0.82356 0.82356 4.4302 0.057057 .
Strain 1 2.44141 2.44141 13.1330 0.003489 **
Trt:Strain 1 1.47016 1.47016 7.9084 0.015686 *
Residuals 12 2.23077 0.18590
Note that this ANOVA was done using the R statistical
package which doesn’t show totals and heads the p-value
column as Pr(>F), but numerically it gives the same results.
Note the interaction is significant at p=0.01569.
In this situation the response should be expressed separately for each strain

108
A black hooded rat. Click for main menu

109
Presenting your resultsPresenting your results
The aim of your scientific paper or report is to
communicate your results as clearly and concisely as
possible.
Sufficient information should be provided to enable
somebody else to repeat the experiments. The ARRIVE
guidelines, given in the next section can be used as a
check-list to ensure that nothing has been forgotten.
This section gives some general advice on the
presentation of the numerical results.
Decimal places
Means, Medians and standard deviations should normally
be given to no more than three significant digits, e.g. 13.3,
0.0124.

110
Standard deviation, Standard Error or Confidence interval?
1. A standard deviation SD) is used to describe individual
variability.
2. A standard error (SE or better SEM) is used to describe the
variability of means.
3. A confidence interval (CI) is used to indicate the range within
which we can be reasonably sure the true mean lies.
4. In all three cases it is important to know the numbers in each
group.
5. Rather than using a ± it is better to use a designation such as
“Mean = 10.1 (SD 1.5, n=8)” or “Mean = 10.1 (SE 1.5, n=8)” so
that there can be no confusion between standard error and
standard deviation.
6. When two means are being compared, the size of the difference
between them should be quoted, with a confidence interval.
7. The difference could also be expressed as the Standardised
Effect Size, SES (The difference divided by the pooled SD).
However this is biased upwards if “n” is very low (say less than
10). The SES is a ratio without units and can be used to
compare different characters.

111
When medians are being quoted, the 25 and 75%
centiles can be given.
Where means are tabulated, they should be shown in
columns rather than rows as this makes it easier to
compare them.
If the means have been compared using an analysis
of variance, then the assumption will have been made
(and tested using residuals plots) that the variation is
the same in each group. In this case a pooled
standard deviation should be quoted rather than
showing separate SDs for each mean.
When an analysis of variance has been used to
analyse the results, and F-value should be quoted
with numerator and denominator degrees of freedom,
as well as a p-value (e.g F3, 9 = 3.91, p= 0.049).

112

115
 Your papers should be written in such a way that other
scientists can replicate your results
 The ARRIVE guidelines shown in this section can be used
as a check-list to make sure that you have not forgotten
anything.
 The points that they make may seem obvious, but many of
the errors discussed in section 1 are the result of poorly
written papers.
 Other errors are the result of poorly designed experiments.
If you have got this far in this document, your experiments
should have been reasonable well designed.
 Remember that a picture is said to be worth a thousand
words, but make sure that it is worthwhile. Presenting
means as bar diagrams may be good, but in some cases it
is just a waste of space.

117
The ARRIVE GuidelinesThe ARRIVE Guidelines
Concerns about the quality of research involving animals
were expressed in Section 1 “Why bother”. Anyone who
has worked through this presentation should have a better
idea of how to design and analyse an animal experiment,
although the discussion of the statistical analysis needs to
be supported by a good statistical textbook and software.
But writing the paper is a major bottleneck, and it doesn’t
always get the attention that it deserves. Information vital
for assessing the importance and reliability of a paper is
often missing.
The ARRIVE (Animals in Research: Reporting In Vivo
Experiments) Guidelines are based on the CONSORT
statement for randomised clinical trials (see references,
right) published in 2001, and now widely used when
reporting clinical research.
The following pages list the 20 items which need to be
taken into account when writing a paper involving the use
of laboratory animals.
Moher D, Schulz KF, Altman DG for the
CONSORT Group (2001) The CONSORT
statement: revised recommendations for
improving the quality of reports of parallel-group
randomised trials. Lancet 357: 1191–1194.
Kilkenny,C., W.J.Browne, I.C.Cuthill,
M.Emerson, and D.G.Altman. 2010b.
"Improving bioscience research
reporting: the ARRIVE guidelines for
reporting animal research." PLoS.Biol.
8:e1000412.

118
The ARRIVE guidelines
(re-formatted from the original publication)
1. TITLE Provide as accurate and concise a description of the
content of the article as possible.
2. ABSTRACT Provide an accurate summary of the
background, research objectives (including details of the
species or strain of animal used), key methods, principal
findings, and conclusions of the study.
INTRODUCTION
3. Background.
• a. Include sufficient scientific background (including
relevant references to previous work) to understand the
motivation and context for the study, and explain the
experimental approach and rationale.
• b. Explain how and why the animal species and model
being used can address the scientific objectives and, where
appropriate, the study’s relevance to human biology.
4. Objectives. Clearly describe the primary and any secondary
objectives of the study, or specific hypotheses being tested.

119
METHODS
5. Ethical statement
• Indicate the nature of the ethical review permissions, relevant
licences (e.g. Animal [Scientific Procedures] Act 1986), and national
or institutional guidelines for the care and use of animals, that cover
the research.
6. Study design For each experiment, give brief details of the study
design, including:
• a. The number of experimental and control groups.
• b. Any steps taken to minimise the effects of subjective bias when
allocating animals to treatment (e.g., randomisation procedure) and
when assessing results (e.g., if done, describe who was blinded and
when).
• c. The experimental unit (e.g. a single animal, group, or cage of
animals). A time-line diagram or flow chart can be useful to illustrate
how complex study designs were carried out.

120
METHODS (continued)
7. Experimental procedures.
For each experiment and each experimental group, including controls,
provide precise details of all procedures carried out. For example:
• a. How (e.g., drug formulation and dose, site and route of administration,
anaesthesia and analgesia used [including monitoring], surgical
procedure, method of euthanasia). Provide details of any specialist
equipment used, including supplier(s).
• b. When (e.g., time of day).
• c. Where (e.g., home cage, laboratory, water maze).
• d. Why (e.g., rationale for choice of specific anaesthetic, route of
administration, drug dose used).
8. Experimental animals
• a. Provide details of the animals used, including species, strain, sex,
developmental stage (e.g., mean or median age plus age range), and
weight (e.g., mean or median weight plus weight range).
• b. Provide further relevant information such as the source of animals,
international strain nomenclature, genetic modification status (e.g.
knock-out or transgenic), genotype, health/immune status, drug- or test
naive, previous procedures, etc.

121
METHODS (continued)
9. Housing and husbandry
Provide details of:
• a. Housing (e.g., type of facility, e.g., specific pathogen free (SPF); type
of cage or housing; bedding material; number of cage companions;
tank shape and material etc. for fish).
• b. Husbandry conditions (e.g., breeding programme, light/dark cycle,
temperature, quality of water etc. for fish, type of food, access to food
and water, environmental enrichment).
• c. Welfare-related assessments and interventions that were carried out
before, during, or after the experiment.
10. Sample size
• a. Specify the total number of animals used in each experiment and the
number of animals in each experimental group.
• b. Explain how the number of animals was decided. Provide details of
any sample size calculation used.
• c. Indicate the number of independent replications of each experiment,
if relevant.

122
METHODS (continued)
11. Allocating animals to experimental groups
• a. Give full details of how animals were allocated to experimental
groups, including randomisation or matching if done.
• b. Describe the order in which the animals in the different experimental
groups were treated and assessed.
12. Experimental outcomes
• Clearly define the primary and secondary experimental outcomes
assessed (e.g., cell death, molecular markers, behavioural changes).
13. Statistical methods
• a. Provide details of the statistical methods used for each analysis.
• b. Specify the unit of analysis for each dataset (e.g. single animal, group
of animals, single neuron).
• c. Describe any methods used to assess whether the data met the
assumptions of the statistical approach.

123
RESULTS
14. Baseline data
• For each experimental group, report relevant characteristics and health
status of animals (e.g., weight, microbiological status, and drug- or test-
naive) before treatment or testing (this information can often be tabulated).
15. Numbers analysed
• a. Report the number of animals in each group included in each analysis.
Report absolute numbers (e.g. 10/20, not 50%).
• b. If any animals or data were not included in the analysis, explain why.
16. Outcomes and estimation
• Report the results for each analysis carried out, with a measure of
precision (e.g., standard error or confidence interval).
17. Adverse events
• a. Give details of all important adverse events in each experimental group.
• b. Describe any modifications to the experimental protocols made to
reduce adverse events.

124
DISCUSSION
18. Interpretation/scientific implications
• a. Interpret the results, taking into account the study objectives and
hypotheses, current theory, and other relevant studies in the
literature.
• b. Comment on the study limitations including any potential sources
of bias, any limitations of the animal model, and the imprecision
associated with the results.
• c. Describe any implications of your experimental methods or findings
for the replacement, refinement, or reduction (the 3Rs) of the use of
animals in research.
19. Generalisability/translation.
Comment on whether, and how, the findings of this study are likely to
translate to other species or systems, including any relevance to
human biology.
20 Funding.
List all funding sources (including grant number) and the role of the
funder(s) in the study.

125
The ARRIVE GuidelinesThe ARRIVE Guidelines

126
Question 1
An investigator plans an experiment with a control and treated
group, but reduces the numbers in the treated group and
increases those in the control group because she fears that the
treated animals may experience pain. Is this an example of:
1. Replacement
2. Refinement
3. Reduction
Question 2
Which is the best way to randomise 12 animals all in the same
cage to three treatment groups A, B and C?
1. Roll a die and assign the first animal to group A if the die
shows 1 or 2, to group B if it shows 3 or 4 or to group C if it shows
5 or 6.
2. Assign the first animal to group A, the next to Group B and the
third to group C, and repeat this four times
3. Assign the first four animals to group A, the next four to group
B and the last four to group C.
4. Use EXCEL to randomise a column with 4 As, 4 Bs and 4 Cs
and assign the animals according to the random sequence
Feedback

127
FeedbackFeedback
Question 1.
This is an example of Refinement because the aim is to reduce over-all
suffering. Many people also feel that it is better for large numbers of
animals to have a mild stress rather than fewer having more severe
stress/pain, although this is not the situation here.
Question 2.
Option 1, although random would not result in the same number of animals
in each group, so it would not be good
Option 2. This does not assign the animals at random. If the first animals
are easiest to catch there would be a tendency for group A to have more
easily caught animals than group C.
Option 3. This would be even worse than option 2.
Option 4. This is the best method. It will result in equal numbers per group,
and the animals are truly assigned at random

128
Question 3. Properties of Inbred strains and outbred stocks of mice and rats
Cheaper to buy
Phenotypically more uniform
Genetically more stable and less likely to change
Easier genetic quality control
Most commonly used by toxicologists
Most commonly used by geneticists
Like an immortal clone of genetically identical individuals
Large strain differences
Characteristics may change following selective breeding
Well established and widely used strain nomenclature
Inbred strains Outbred stocks Both

129
Relocation
Removable
Replacement
Research
Resolve
Resource
Results
Resurface
Revealed
Rewrite
Radish
Reading
Ready
Real
Relish
Reduction
Refinement
Refreshments
Related
Re-locatable
4. What are the “3Rs” of humane experimental technique?

130
Power. Increase it from 80% to 90%
Significance level. Increase it from 0.05 to 0.10
Standard deviation. Decrease it by choosing uniform animals
Alternative hypothesis. Make it one instead of two sided
Effect size. Increase it by increasing the dose
In the Power Analysis method of determining sample size, if
you make the changes noted below, how would it alter the
group (sample) size needed?
Increased Decreased
Sample size would be:

131
An investigator wants to find out whether a drug affects
activity of mice when given access to a running wheel
over a period of a week. The drug will be given in the
food.
He plans to use two strains of mice, three dose levels
and both sexes in a factorial design.
Using the Resource Equation method what groups
size could be recommended?
2-3
4-5
6-7
8-9
>9
Feedback

132
FeedbackFeedback
With three doses, two strains and both sexes there
are 12 treatment groups altogether.
With two animals per group E= 24-12 = 12
With three animals per group E=36-12= 24
E should be between about 10 and 20, so 2-3
animals per group would be adequate.

134
Most important pointsMost important points
State clearly the purpose of the study
Explain why you have chosen a particular animal model
Think about the 3Rs in relation to your experiments
Explicitly Identify your experimental unit
Explain how you decided sample size (power analysis,
resource equation or fixed by availability)
Explain how the experimental units were randomised to the
treatment groups
Use coded samples where possible to blind yourself (and
others) to which treatment group a subject belongs
Think about ways of reducing the variability to increase power
e.g. optimum/non-stressful housing, freedom from disease
If using rodents, use inbred strains or justify not using
them

135
Most important points (continued)Most important points (continued)
Choose a suitable experimental design (completely randomised,
randomised block, Latin square, split plot etc)
Consider using a factorial design to explore generality of your
results
Decide how you are going to do the statistical analysis before
starting the experiment, recognising that methods may need to
be modified when the results are obtained.
Choose a good statistical package and learn how to use it.
Make use of graphical methods, particularly those showing
individual points, to screen and display your results
Consider quoting/displaying the results in standard
deviation units (this will also help those doing a meta-
analysis)

136
Most important points (continued)Most important points (continued)
Learn some statistics (buy a good statistics textbook,
take a course on statistics)
Learn to use the analysis of variance
One way (completely randomised design)
Two-way without interaction (randomised block
design)
Two-or-more-way with interaction (factorial
design)
Be very honest about deleting observations, but try
analysis with/without to see if they make any difference
Use the ARRIVE guidelines to ensure that you have not
missed anything when writing your paper/thesis

137
A fancy guinea-pig, click for main menu

Michael Festing - MedicReS World Congress 2011

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Michael Festing - MedicReS World Congress 2011

Similar to Michael Festing - MedicReS World Congress 2011 (20)

More from MedicReS

More from MedicReS (20)

Recently uploaded

Recently uploaded (20)

Michael Festing - MedicReS World Congress 2011