2. Chapter 1: Introduction
Statistics: A field of study concerned with:
collection, organization, analysis, summarization and
interpretation of numerical data, &
the drawing of inferences about a body of data when only a small
part of the data is observed.
The subject of statistics covers:
the design of a study
the collection of data
the analysis of data
the presentation of suitably summarized information, often in a
graphical or tabular form
The interpretation of the analyses in a manner which
communicates the findings accurately
3. Biostatistics: it is the application of statistical methods to the
fields of biological and medical sciences.
Concerned with interpretation of biological data & the
communication of information derived from these data
Has central role in medical investigations
The numbers must be presented in such a way that valid
interpretations are possible
4. Why statistics?
Do research and publish scientific literature
Integral component of epidemiology
Risk analysis and predictions
Analysis of data from diagnostic services
Analysis of data from pharmaceutical and agrochemical
industries
Safety and quality of food for human consumption
5. Uses of Biostatistics
• Provide methods of organizing information
• Assessment of health status
• Health program evaluation
• Resource allocation
• Magnitude of association
– Strong vs weak association between exposure and
outcome
Example: Feeding vs Production
Health vs Production
6. Uses of biostatistics
• Assessing risk factors
– Cause & effect relationship (Eg, Environment/Housing vs Production)
• Evaluation of a new vaccine or drug
– What can be concluded if the proportion of animal free from
the disease is greater among the vaccinated than the
unvaccinated?
– How effective is the vaccine (drug)?
– Is the effect due to chance or some bias?
• Drawing of inferences
– Information from sample to population
7. What does biostatistics cover?
Research Planning
Design
Execution (Data collection)
Data Processing
Data Analysis
Presentation
Interpretation
Publication
Biostatistical
thinking
contribute in
every step in a
research
The best way to
learn about
biostatistics is to
follow the flow of a
research from
inception to the
final publication
8. Analysis
• Analysis part is the major part of learning about biostatistics
– There are dozens of different methods of analysis, which
makes difficult the choice of the correct method for a
particular case
– It is necessary to consider the philosophy that underlies all
methods of analysis:
• Use data from a sample to draw inference about a wider
population
9. Analysis
The raw data are meaningless unless certain statistical treatment
is given to them.
Analysis of data means to make the raw data meaningful or to
draw some information from the data
Thus, the analysis of data serves the following main functions:
• To make the raw data meaningful
• To test null hypothesis
• To test the significance
• To draw some inferences or make generalization
• To estimate parameters (sample statiscts and population
parameters)
10. Interpretation
• Interpretation of results of statistical analysis is not always
straightforward, but is simpler when the study has a clear
aim.
• If the study has been well designed and correctly analyzed
the interpretation of results can be fairly simple.
11. Types of Statistics
1. Descriptive statistics:
Ways of organizing and summarizing data
Helps to identify the general features and trends in a set of
data and extracting useful information
Also very important in conveying the final results of a study
Example: tables, graphs, numerical summary measures
12. Types of Statistics
2. Inferential statistics:
• Methods used for drawing conclusions about a population
based on the information obtained from a sample of
observations drawn from that population
Example: Principles of probability, estimation, confidence
interval, comparison of two or more means or
proportions, hypothesis testing, etc.
13. Statistical variables and data
• A variable is a set of observations on a particular character that
can take values which vary from individual to individual or group
to group,
• e.g. height, weight, housing, blood count, enzyme activity,
coat colour, percentage of a flock which are pregnant, which
are diseased etc…
• Data are records of measurement, counts or observations of
variables.
• Examples of data are records of weights of calves in kg, milk
yield of cows in liter, male or female sex, and black or white
coat color of cattle.
14. Types of Data
1. Primary data: collected from the items or individual
respondents directly by the researcher for the purpose of a
study.
2. Secondary data: which had been collected by certain people
or organization, & statistically treated and the information
contained in it is used for other purpose by other people.
• Can be obtained from:
– Routinely kept records, literature
– Surveys
– Counting
– Experiments
– Reports
– Observation etc…
15. Types of variables
Categorical (qualitative)
Nominal scale
Ordinal scale
• Nominal scale (classification or group): the distinct categories
which define the variables are unordered and each can be assigned
a name,
• It has has categories that cannot be ranked.
e.g. coat colors (black, white)
Sex (male or female)
Breed (local, exotic)
Numerical (quantitative)
Discrete variable
Continous variable
16. Types of variable…
• Ordinal scale (ranked variables: small, medium, large): the
categories which constitute the variable have some intrinsic order
but there are no consistent and defined intervals between the
various categories;
• An ordinal variable has categories that can be ranked.
For example: Body condition scores, Degree of vigor of
motility of larvae.
These “scales” are often given numerical values 1 to n. However,
the differences among those numbers do not have numerical
meaning.
Those scores depict categories, but not a numerical scale.
17. Types of variable…
Quantitative variables
• Consisting of numerical values (true numerical measurement)
on a well defined scale (measurement unit).
• Quantitative data relate to amounts, rather than just indicating
classes,
• These data may be further divided into: discrete and
continuous.
18. Types of variable…
• Discrete variable can have only one of a specified set of values,
such as whole numbers. Discrete data often generate counts, i.e. it
is countable. for example:-
• The number of ticks collected from animals
• Number of animals per households
• Number of parasite eggs per gram of feces
• Continuous variable theoretically may have any value within a
defined range and potentially can take any value between intervals
(though the range can be infinite). Examples are body weight,
height, milk yield, temperature, and antibody titre.
20. Types of variable in a statistical model
•Dependent vs. Independent variables:
• Dependent (response variables or outcome variable) which
vary depending on the effects of Independent variables
Examples: Weight, milk yield, Disease status (i.e. its
presence or absence)
• Independent (explanatory or predictor ) variables are those
variables that affects the dependent variables
Examples: can be sex, age, environment, breed,
management, genotype etc
21. Data coding
Data Coding is an analytical process in which data are categorized by
numerical value to facilitate analysis.
Coding means the transformation of data into a form understandable by
computer (statistical) software.
Both qualitative and quantitative data can be coded to make data
computing with statistical software ease.
Questionnaire data can be pre-coded (process of assigning codes to
expected answers on designed questionnaire),
field-coded (process of assigning codes as soon as data is available,
usually during fieldwork),
post-coded (coding of open questions on completed questionnaires) or
office-coded (done after fieldwork).
22. Examples of data coding
variables codes
sex
male 1
female 0
phy status
pregnant 1
lactate 2
dry 3
body cond
good 1
poor 0
Mastitis
positive 1
negative 0
variables codes
Cattle herd size
<40 0
> 40 1
animal age
< 4 y 0
> 4y 1
Milk yield
1 lit 1
2 – 3 lit 2
above 3 lit 3
Qualitative Quantitative
23. Chapter 2:
Strategies For Understanding The Meanings Of Data
Data is collected with the intention of gathering (assembling)
information
Information can be easily obtained from raw data when the
data set comprises relatively few observations made on a small
groups of animals
As the number of observations becomes high, it is difficult to
obtain an overall ‘picture’ of the data
The primary stage in the process of obtaining this picture is to
organize the data to establish how often different values occur
(frequency distributions).
24. Data description…
The next step is to further condense the data, reducing to a
manageable size and obtain a snapshot view as an aid for
understanding and interpretation
There are various methods we adopt
Tables to exhibit features of the data
Diagrams to illustrate patterns
Numerical measures to summarize the data
25. Data description…
Graphical presentations of qualitative variables can include bar,
column or pie-charts.
When describing qualitative data each observation is assigned
to a specific category. Data are then described by the number of
observations in each category or by the proportion of the total
number of observations.
The most widely used graph for presentation of quantitative data
is a histogram.
In order to present a distribution, the quantitative data are
partitioned into classes and the histogram shows the number or
relative frequency of observations for each class.
26. Frequency distribution
A frequency distribution shows the frequencies of occurrence of
the observation in a data set.
When making frequency distributions, it is vital to distinction
between categorical and quantitative variables.
When a variable is categorical, frequency observations occurs in
every class or category of the variable.
When the variable is quantitative, class can be created between
non-overlapping, preferably equal intervals
27. Frequency distribution
The number of observations belonging to each class is the class
frequency i.e. frequency distribution
The frequency distribution is presented in the form of a table or
a bar chart (discrete variable) or a histogram (continuous
variables)
Relative frequency refers to the proportion or percentage
observation in each class or category
The sum of the relative frequencies of all the categories is unity
(or 100%) apart from rounding errors
28. Tables
A table is an orderly arrangement of observation usually
numbers in rows and columns,
The layout of the table will be dictated by the data, and
therefore will vary for different types of data
Table 1. Percentage of the households’ sources of income
29. Diagrams
A diagram is a graphic representation of data and may take
several forms (Chart, graph and Schematic)
It is often easier to distinguish important patterns from a
diagram rather than a table,
Are more useful to convey information quickly
30. Categorical data
Bar chart
Is a diagram in which every category of the variable is
represented;
The length of each bar, which should be of constant width,
depicts the number or percentage of individuals belonging
to that category.
The length of the bar is proportional to the frequency in the
relevant category, so it is essential that the scale showing the
frequency should start at zero for each bar
31. Figure xx: Prevalence of Prevalence of Dairy Cattle Diseases (a) and ticks (b) in
cattle over time period in pastoral region
Categorical data
b. Tick prevalence
a. Prevalence of Dairy Cattle Diseases
32. Pie chart
Is a circle divided into segments with each segment portraying a different
category of the qualitative variable.
The total area of the circle represents of the total frequency or
percentage, and the area of a given sector is proportional to the
percentage of individuals falling into that category.
A pie chart should include a statement of the percentage or actual
number of individuals in each segment
Generally, bar chart is preferable to the pie chart as the former is easier
to construct and is more useful for comparative purposes, partly because
it is easier to compare lengths by eye rather than angles
Categorical data
34. Quantitative data
When the data are quantitative, we may use Dot plot, Histogram,
Scatter plot, line graph, Box plot, Stem and Leaf
Dot diagram
If the data set is of a manageable size, the best way of display it is
to show every value in a dot diagram/plot
Fig xx. Dot diagram of mean daily tick count of different
species arround Hawassa
35. Histogram
Histogram is a two-dimensional diagram in which usually the
horizontal axis represents the units of the measurement of the
variable of interest, with each class interval being clearly
delineated
To construct histogram the data range is divided into 5 to 20
classes or bin to get equal width
Range = maximum – minmum value
If the intervals are of equal width, then the height of the bin
(rectangle) is proportional to the frequency
Histogram gives a good picture of the frequency distribution of
quantitative variables
36. Histogram…
The distribution is symmetrical if its shape to the right of a
central value is a mirror image of that to the left of the central
value
It is used to evaluate normal distribution
The tails of the frequency distribution represent the
frequencies at the extremes of the distribution
The frequency distribution is skewed to the right (positively
skewed) if the right-hand tail is extended
The frequency distribution skewed to the left (negatively
skewed) if the left-hand tail is extended
It is common to find biological data which are skewed to the
right
38. Figure xx: Line graph showing mean monthly minimum, maximum
and average temperature for Borana areas (1976 -2011)
Line Graph
Line graphs compare variables, each of which is plotted along
x-and-y coordinate.
Show specific values of data, trends in data and enable viewer
to predict about.
39. Box- plot
The scale of measurement of the variable is usually drawn
vertically
The diagram comprises a box with horizontal limits defining
the upper and the lower quartiles and representing the
interquartile range,
the central 50% of the observations, with the median marked
by a horizontal line within the box
The range is as low as the 2.5th percentile and as high as the
97.5th percentile (the minimum and maximum values of the
set of observations)
40. 0
20
40
60
80
100
1 2 3 4 5 6
Number of livestock species
Fig xx. Box plot showing livestock wealth by species diversity
41. Scatter diagram
The scatter diagram is an effective way of presenting data when
we are interested in trends and relationship between two
variables.
The diagram is a two-dimensional plot in which each axis
represents the scale of measurement of one of the two variables.
Using this rectangular co-ordinate system, we relate the value
for an individual on the horizontal scale to the corresponding
value for that individual on the vertical scale by marking with
an appropriate symbol
The points can be joined to produce a line graph, or draw a line
which best represents the relationship
42. Fig xx. Relationship of cattle population with rainfall in Borana
(between 1976 and 2011)
43. Stem and Leaf
Each value is divided into two parts, ‘Stem’ and ‘Leaf’. ‘Stem’
corresponds to higher decimal places, and ‘Leaf’ corresponds to
lower decimal places.
‘Stems’ are sorted in ascending order in the first column.
The appropriate ‘Leaf’ for each observation is recorded in the
row with the appropriate ‘Stem’
Fig xx. A ‘Stem and Leaf’ plot of
the weights of calves
44. Numerical measures of description
If we are able to determine some form of average that measures
the central tendency of the data set, and if we know how widely
scattered the observations are in either direction from that
average, then we will have a reasonable ‘picture’ of the data.
These two characteristics of a set of observations measured on a
numerical variable are known as
Measures of location (averages, Central Tendency)
give useful information about the center of the data
Measures of dispersion (spread)
how “spread out” the numbers are abut the center.
45. Measures of location
The tendency of statistical data to get concentrated at certain
values is called the “Central Tendency” and
The various methods of determining the actual value at which the
data tend to concentrate are called measures of central
Tendency or averages.
Hence, an average is a value which tends to sum up or describe the
mass of the data.
Measures of central tendency are numbers that tell us where the
majority of values in the distribution are located
Common measures of central tendency are Mean, Media and
Mode.
46. Measures of location
1. Arithmetic mean
is the most commonly used measure of location.
It is obtained by adding together the observations in a data set
and dividing by the number of observations in the set
The mean has the disadvantage that its value is influenced by
outliers
An outlier is an observation whose value is highly inconsistent
with the main body of the data.
An outlier with an excessively large value will tend to increase
the mean unduly, whilst a particularly small value will decrease
47. Especially it is appropriate to measure location of data if the
observations were sampled from symmetrical distributions.
The mean can be misleading if there are any extreme values in a group
of numbers.
For example, the mean of the group 1, 2, 3, 2, 4, 5,19 is 5.1. The
value 19 is an extreme value, since it is far higher than any of the
other numbers in the group. Since only one of the values in the
group is actually 5.1 or greater, the mean is not representative of
the group.
In this case, the median may provide a better representation.
The mean will be ‘pulled’ to the right (increased in value) if the
distribution is skewed to the right, and ‘pulled’ to the left (decreased in
value) if the distribution is skewed to the left.
48. The arithmetic mean of a sample of n numbers y1,y2,..., yn is:
The arithmetic mean for grouped data is:
The arithmetic mean and the median are close or equal in value if
the distribution is symmetrical.
Geometric mean: It is obtained by taking the nth root of the
product of “n” values, i.e, if the values of the observation are
demoted by x1,x2 ,…,x n then, GM = n√(x1)(x2)….(xn) .
It is preferable to the arithmetic mean if the series of observations
contains one or more unusually large values.
49. 2. Median
is the middle of value of the observation when they are arranged in
order of magnitude.
It is appropriate for skewed data.
To calculate the median: we have to arrange all of the recorded
values in order of size and then find the middle value.
If we arrange the above numbers in numerical order, we
obtain: 1, 2, 2, 3, 4, 5, 19. The median is 3.
In the above example, the median is much more representative of
the group than the mean (5.1). Extreme values do not affect the
median, and the median value is usually typical of the data.
50. 2. Median
If there is an even number of values, use the mean of the two
middle values:
For example, for 19, 24, 26, 30, 31, 34, The median is (26 +
30)/2 = 28.
The arithmetic mean and the median are close or equal in
value if the distribution is symmetrical.
The advantage of the median is that it is not affected by
outliers or if the distribution of the data is skewed. Thus
the median will be less than the mean if the data are
skewed to the right, and greater than the mean if the
data are skewed to the left.
51. 3. Mode
is the most frequently occurring observation and the measure
does not involve the whole observation.
It is not affected by extreme values and most commonly used in
skewed data.
This can be determined by creating frequency table.
The mode is determined by disregarding most of the
observations
Some distributions do not have a mode, whilst other
distributions may have more than one mode. /Unimodal or
Bimodal/
If we arrange the previous numbers in numerical
order, we obtain: 1, 2, 2, 3, 4, 5, 19. The mode is 2.
52. Although the mean is the measure that is most common, when
distributions are asymmetric, the median and mode can give better
information about the set of data.
Unusually extreme values in a sample will affect the arithmetic
mean more than the median. In that case the median is a more
representative measure of central tendency than the arithmetic
mean.
For extremely asymmetric distributions the mode is the best
measure.
53. Skewness: If extremely low or extremely high observations
are present in a distribution, then the mean tends to shift
towards those scores. Based on the type of skewness,
distributions can be:
a) Negatively skewed distribution: occurs when majority of
scores are at the right end of the curve and a few small scores
are scattered at the left end. (if it has a long tail to the left)
b) Positively skewed distribution: Occurs when the majority
of scores are at the left end of the curve and a few extreme
large scores are scattered at the right end. (if it has a long
tail to the right)
54. Consider the three distributions shown in Figure
For example, observation of the “No Skew” distribution would
yield the following data: 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9,
9, 10, 10, 10, 11, 11. Using SPSS software, the following descriptive
statistics were obtained for these three distributions
55. Exercise:
The following data, 44.4, 67.6, 76.2, 64.7, 80.0, 64.2, 75.0, 34.2,
29.2, represent the infection of goats with the viral condition peste
des petits ruminants. Calculate the median.
Calculate the mean and the median of the following data set.
What evidence is there for concluding that the data are or are not
symmetrically distributed?
56. Arranged in ascending order, the rates (%) are: 29.2, 34.2, 44.4,
64.2, 64.7, 67.6, 75.0, 76.2 and 80.0. There are nine observations, so
the median is the (9 + 1)/2 = 5th observation in the ordered set, i.e.
the median is 64.7%.
Mean = 761.2/16 = 47.58 g, median = 51.95 g. The mean and the
median do not coincide, indicating that the data are skewed. The
mean is less than the median, indicating that the data are skewed
to the left.
57. Which Measure Should You Use?
The choice of a particular measure of central tendency depends
on the shape of the population distribution. When we are dealing
with sample-based data, the distribution of the data from the
sample may suggest the shape of the population distribution.
For normally distributed data, mathematical theory of the
normal distribution suggests that the arithmetic mean is the most
appropriate measure of central tendency.
If a log transformation creates normally distributed data, then the
geometric mean is appropriate to the raw data.
58. Which Measure Should You Use?
For symmetric distributions, the mean and median are equal. If
the distribution is symmetric and has only one mode, all three
measures are the same.
For skewed distributions, with a single mode, the three measures
differ.
For positively skewed distributions (where the upper, or left, tail
of the distribution is longer (“fatter”) than the lower, or right, tail)
the measures are ordered as follows: mode < median < mean.
For negatively skewed distributions (where the lower tail of the
distribution is longer than the upper tail), the reverse ordering
occurs: mean < median < mode.
59. Which Measure Should You Use?
For symmetric distributions, the mean and median are equal. If
the distribution is symmetric and has only one mode, all three
measures are the same.
For skewed distributions, with a single mode, the three measures
differ.
Figure xxx Symmetric (B) and skewed distributions:
right skewed (A) and left skewed (C)
60. Measures of Dispersion
Consider the following data sets:
Mean
Set 1: 60 40 30 50 60 40 70 50
Set 2: 50 49 49 51 48 50 53 50
The two data sets given above have a mean of 50, but obviously set 1
is more “spread out” than set 2. How do we express this numerically?
The object of measuring this scatter or dispersion is to obtain a
single summary figure which adequately exhibits whether the
distribution is compact or spread out.
61. Figure shows the frequency polygons for two populations that have
equal means but different amounts of variability. Population B,
which is more variable than population A, is more spread out. If the
values are widely scattered, the dispersion is greater.
Figure xx: Two frequency distributions with equal mean but
different amount of dispersion
62. Common measures of variability are the range, variance, standard
deviation and coefficient of variation.
1. Range is the difference between the maximum and minimum
values in a set of observations.
It wastes information for it takes no account of the entire data.
It gives undue weight to extreme values and will, therefore,
overestimate the dispersion of most of the observations if outliers
are present
2. Variance is the expected squared deviation of a random
variable from its mean
Measures of Dispersion
63. 1
)
( 2
1
n
x
xi
n
i
The variance is determined by calculating the deviation of each
observation from the mean.
This deviation will be large if the observation is far from the
mean, and it will be small if the observation is close to the mean.
3. Standard deviation (S) is a measure of the scatter of the
observations in relation to their mean i.e. how close are the
observation to their mean.
to obtain a measure of dispersion in original units
It is the average absolute deviation from the mean:
64. 4. The standard deviation is expressed as a percentage of the mean;
we call this measure the coefficient of variation (CV).
It can be used for comparing relative amounts of variation. This is
especially true when variability is compared among sets of data
that have different units or even the same unit of measurement
The standard error of the mean (SEM) is a measure of the
precision of the sample mean as an estimate of the population
mean. It evaluates the sampling error by giving an indication of
how close a sample mean is to the population mean it is estimating
(inferring). / It is indication of reliability of mean/
100
*
x
s
CV
65. Confidence interval (Confidence limit)
CL is the range of values within which the true population mean is
expected to lie with a certain probability (i.e. 95%).
It has the lower and the upper limits of the confidence interval
If the confidence interval is wide, then the sample mean is a poor
estimate of the population mean.
If the confidence interval is narrow, then the sample mean is a
precise estimate of the population mean.
The 95% confidence interval for the mean is calculated as
Mean ± 1.96 × SEM or (Sd),
66. Exercise
Calculate the standard deviation, variance and standrd error
of the following data
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
68. Calculate the mean (see the first column, xi). 55/11 = 5
Subtract the mean from each observation to find the deviations
from the mean (see the 2nd column, xi − x ).
Square the deviations from the mean (see the 3rd column, (xi −
x) 2 , above).
Sum the squared deviations (see the 3rd column)= 110
Divide the sum of the squared deviations by n–1 to find the
variance: 110/10 = 11
Take the square root of the variance to calculate the standard
deviation: √s2 = √ 11.0 = 3.3
SE = 3.3/√11 =0.9949
69. Exercise
The following are progesterone in the milk (ng/ml) of 14 cows, 4.37,
4.87, 4.35, 3.92, 4.68, 4.54, 5.24, 4.57, 4.59, 4.66, 4.40, 4.73, 4.83, 4.21.
Given the variance of 0.10177, Calculate the
A) Arithmetic mean
B) Median
C) Standard deviation
D) Standard error
E) 95% Confidence interval
F) Coefficient of variation (CV)
71. Suppose two samples of the following results:
Which is more variable, the weights of the 25-year-olds or the
weights of the 11-year-olds?
72. A comparison of the standard deviations might lead one to
conclude that the two samples possess equal variability.
If we compute the coefficients of variation, however, we have
for the 25-year-olds
and for the 11-year-olds
If we compare these results, we get quite a different impression.
It is clear from this example that variation is much higher in the
sample of 11-yearolds than in the sample of 25-year-olds.
73. Kurtosis is a measure of the degree to which a distribution is
“peaked” or flat in comparison to a normal distribution whose
graph is characterized by a bell-shaped appearance.
A distribution, in comparison to a normal distribution, may
possess an excessive proportion of observations in its tails, so
that its graph exhibits a flattened appearance. Such a
distribution is said to be platykurtic.
Conversely, a distribution, in comparison to a normal
distribution, may possess a smaller proportion of observations
in its tails, so that its graph exhibits a more peaked appearance.
Such a distribution is said to be leptokurtic.
A normal, or bell-shaped distribution, is said to be mesokurtic.
74. Consider the three distributions shown in Figure
For example, observation of the “mesokurtic” distribution would
yield the following data: 1, 2, 2, 3, 3, 3, 3, 3, … , 9, 9, 9, 9, 9, 10, 10,
11. Using SPSS software, the following descriptive statistics were
obtained for these three distributions: